i have to discover the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a category issue, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this case will be fairly benign therefore I will utilize the precision rating and set up a baseline of 63.3per cent to price success. Making use of TFiDfVectorization, I’ll get the function value to ascertain which terms have the greatest forecast energy for the goal factors. If effective, this model is also utilized to a target other pages which have comparable regularity for the words that are same expressions.
See relationship-advice-scrape and dating-advice-scrape notebooks with this component.
After switching most of the scrapes into DataFrames, they were saved by me as csvs that you can get within the dataset folder for this repo.
Information Cleaning and EDA
- dropped rows with null self text column becuase those rows are worthless in my experience.
- combined name and selftext column directly into one brand new columns that are all_text
- exambined distributions of term counts for titles and selftext column per post and contrasted the 2 subreddit pages.
Preprocessing and Modeling
Found the baseline precision rating 0.633 this means if i usually find the value that develops oftentimes, i will be appropriate 63.3% of times.
First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first pair of scraping, pretty bad rating with a high variance. Train 99%, test 72%
- attempted to decrease maximum features and rating got a whole lot worse
- tried with lemmatizer preprocessing instead and test score went as much as 74per cent