Oh, The Horror!

Sentiment Analysis

Overview

I have a friend named Kari who is always looking for a new, good horror movie to watch. I completely agree with her, I am always looking too. But what constitutes good? Well, there is an entire community of horror fans who spend a lot of time answering that question. Where can I find them? On Reddit of course! I decided to look at two large Reddit communities for movie discussions: All Things Horror and Horror Movies Only!

Now Kari is a little tired of people just ragging on movies and directors. She wants to find a new horror movie to watch but she doesn’t want to spend a lot of time scrolling through comment after comment of negative reviews. Kari is looking for a movie to watch, not to review herself. So I want to refer her to either All Things Horror or Horror Movies ONLY to increase her chances of finding a movie she wants to watch.

Technicals

After using the Pushshift API to get a subset of over 1500 text fields, I used VADER to assign a sentiment analysis to each field. I then ran a Naive Bayes with CountVectorizer through a pipeline. And compared it to a Vote Ensemble with Decision Tree, Boosts, and CountVectorizer.

Takeaways

My first roadblock was the limitation of the VADER lexicon. Horror doesn’t translate well with common sentiments of what “good” means. In the time constraints of this project, I chose to remove versions of”horror” and “scare” from the VADER lexicon.

Also, comparing the character counts between subreddits showed a pattern. I assumed initially that my analysis would lead me to refer Kari to a specialty site like r/HorrorMoviesOnly. But the character counts visually displayed how wildly subjective their site was. Therefore, I recommended r/AllThingsHorror because there was a higher possibility of neutral or positive comments instead of negative.