Social Media, Likability, and the Language of Hate

Sherina Wijaya, Stephanie (Yujie) Wu, and Kristie Xia

Accompanying Python notebook (via Google Collab)

Over the past few years, we have witnessed the incredible power of social media. From influencing elections to amplifying the voices of advocacy groups, from memes that spread like wildfire to the most niche jokes, social media has it all.

As 20-something social media users, we were intrigued to see how and what people posted, especially in response to tragedy. As hate crimes against members of the Asian community increased over the past year, we noticed how social media has become a powerful tool that communities utilize to organize and advocate. However, we also know that each social media platform has its own characteristic, its own “type” of content. What is popular on Facebook may not be so on Twitter, vice versa.

With this in mind, we decided to look at Twitter tweets and Facebook posts to find out: what makes a post likable?

Data Collection, Cleaning, and Encoding

In order to collect data points (tweets and Facebook posts), we utilized two pre-existing packages: facebook-scraper and tweepy.

We wanted our data points to reflect the behaviors of a wide range of users and not the voice of an official publication or organization, so for our Facebook data points, we identified eight public groups (including Asian Americans United Against Violence, Asian American Alliance, and Stop Racism Against Asian Americans - 1 Million Strong) where users can freely post. We ran the Facebook scraper to collect data from the past 7 days, which amounted to over 600 posts to be analyzed. Each post/data point contains additional information about the number of likes, comments, and shares, when the post was made, and we also encoded a boolean value that indicates whether or not the post contained images, videos, or links.

For tweets, we simply scraped for original tweets (not retweets) that contain the words “Asian hate” or the hashtags “#stopaapihate” or “#stopasianhate.” Due to the time it takes for the scraper to run, we could only get 15000 tweets, but each tweet/data point gave us insight into how many likes and retweets it got, when it was tweeted, as well as details about the user (such as their account description and how many followers they have).

Finally, we performed sentiment analysis on both our datasets — the text on each Facebook post, the text of each tweet, as well as the account description of each user. These sentiment analysis scores are used as another potential feature that can help our models predict likability.

We defined likability as having more than 125 likes and encoded a boolean value for both our Facebook and Twitter dataset to indicate its likability (1 if it counts as likable, 0 otherwise).

Exploratory Data Analysis

Let us look at several charts to better understand users and their data.

Facebook

First, let us take a look at when people post on these Facebook groups.

How many posts were made at different hours of the day

We can see that late afternoons (3–6pm) and late nights (10pm-1am) are the most popular times people post on Facebook. 6pm seems to be the most common time of day for posts, perhaps as people post after they finish work for the day. In addition, it’s interesting to see that there is still quite significant activity up until 4am. This may be due to time differences — users in the West Coast (such as California, where there are large Asian communities) would still be posting up until 1am.

Next, let us look at what people post in these groups.

The number of posts that fall into each category (contains or does not contain image, video, link)

It seems like very few posts are of images or videos, but the majority of them contain a link. This suggests that the purpose of these groups are to recirculate information from one source (news outlets, blogs, etc.) to another (Facebook).

Do the content of the posts (images, videos, link) affect how many likes that post will get?

Average number of likes for the different types of posts

From the chart above, it seems that posts without images, without videos, and with links get the highest likes on average. However, we need to keep in mind that there is an imbalance in the number of samples for each category. Let us explore this deeper later in our models

What are some of the most popular keywords people have been discussing in these groups over the last week?

A word cloud representing texts found in our Facebook dataset

According to the word cloud above, some of the most used phrases are “hate”, “American”, “community”, “please”, “Chinese”, “crime”, and “violence.” The rise in anti-Asian sentiment and hate crimes in recent weeks could have prompted group members to more actively share resources and advocate for the community.

Finally, how does the content or text within the pots affect the number of likes and comments that post garners?

Number of likes (left) and comments (right) based on the sentiment score of the post

One would think that the more radical the sentiment of the post (extremely positive or negative posts with extremely high or low sentiment scores) the more likely it is to garner likes and comments. However, that idea does not seem to always hold true; posts with sentiments that hover around 0 (neutral posts) seem to garner more likes. Comments are less predictable. Some negative sentiment posts garnered a large number of comments, which suggests that negative post sentiment has a larger effect on comments than likes. We will explore this further when we build our models

Twitter

First, let us take a look at when people tweet about anti-Asian hate.

How many tweets were made at different hours of the day

Once again, 4–5pm as well as 1–2am are peak tweeting times. This perhaps describes the platform well — tweets are short and are usually spontaneous. Rather than wait until you get home from work and turn on your computer, you can tweet as you pack up from work at 4pm or as you are waiting to fall asleep at 1am.

Next, let us look at what people tweet about

A word cloud representing texts found in our Twitter dataset

From the word cloud above, we see “Bannon”, “crimes”, “bill”, and “black” as some of the most prominent words (outside of our three search keywords). This suggests that the discourse on Twitter is different from that on Facebook. Tweets center around advocacy (passing “bill[s]”), addressing fake news (“Bannon”), as well as reconciling anti-black sentiment in Asian communities (“black”).

Finally, let us look at how the text sentiment of a tweet and affects the number of retweets and favorites/likes it garners.

Number of likes/favorites (left) and retweets (right) based on the sentiment score of the tweet
Number of likes/favorites (left) and retweets (right) based on the sentiment score of the tweeter’s user description

Once again, we see that most tweets that do well (high retweets and likes) have moderately neutral sentiment scores and user description scores.

Let us now move on and try to build models to predict likability. For each of the models below, we use a 60–40 training-testing set to prevent overfitting and evaluate out-of-sample accuracy for each model.

Model 1: Logistic Regression

As we are trying to predict binary outcomes, we decided to first try a logistic regression. Using SciKit-Learn’s Logistic Regression package, we defined models, trained them using our training sets, and made predictions on our test sets.

Our Twitter model achieved a test RMSE of 12.25%, while our Facebook model got 12.31%. Below are confusion matrices for each model:

Confusion matrices for our Twitter (left) and Facebook (right) Logistic Regression models

These matrices suggest that both models have quite low precision and recall (0% for Twitter and 67% for Facebook). The confusion matrix also makes it clear that there may be some undersampling issues we need to keep in mind.

Model 2: Decision Trees

Another model we considered is using decision trees to classify whether or not a tweet/post is predicted to be likable. Using SciKit-Learn’s Decision Tree Classifier, we defined a classifier with a minimum sample leaf of approximately half of the number of likable tweets/posts (equivalent to 50 tweets or 7 posts) to prevent overfitting. Next, we trained the model and tested it on our test dataset.

Our Twitter model achieved a test RMSE of 8.16%, with the tree shown below:

Decision Tree model for tweets

The model suggests that the most important features are the number of retweets (denoted by X[0]) and the number of followers the user has (denoted by X[1]). It seems that the most likable posts are those that were retweeted at least 56 times. This is an interesting observation, as people on Twitter often say that retweets do not equal endorsements, that is, retweets do not mean you agree or like what the original tweeter was saying.

Meanwhile, our Facebook model achieved a test RMSE of 10.66%, with the tree shown below:

Decision Tree model for Facebook posts

The model suggests that the most important features are the number of times the post was shared (X[1]), the length of the text (X[2]), the post day (X[7]), and the shared sentiment score (X[9]), which denotes the sentiment of the text being shared. Posts which are shared more than 20 times are most likely to be categorized as “likable.” This is quite understandable — when you share something on Facebook, you are usually sharing content that you agree with, content that you want others to read and agree with as well.

Let us now look at the confusion matrix for both models:

Confusion matrices for our Twitter (left) and Facebook (right) Decision Tree models

The test precision of our models has definitely improved (now at 93% and 71% for Twitter and Facebook respectively), and so has the recall (now at 59% and 83% for Twitter and Facebook respectively)

Model 3: Random Forests

Let us utilize ensemble classifiers to improve prediction. Using a random forest with 20 estimators, our Twitter model achieved a test RMSE of 8.16%, while our Facebook model got 15.08%. The random forest Twitter model is not that much better than our decision tree model, while the random forest Facebook model performed worse.

Below are confusion matrices for each model:

Confusion matrices for our Twitter (left) and Facebook (right) Random Forest models

The recall of our Twitter Random Forest model has improved to 67%, while its precision fell to 84%. Meanwhile, the recall of our Facebook Random Forest model has dropped to 0%.

All in all, the random forest provides us with much less information and worse statistics.

Conclusion

From our Exploratory Data Analysis, we learned a lot about people’s tendency to share and re-share information on Facebook, as well as people’s tendency to spontaneously tweet. We garnered insight into what each platform’s users talked about (the focus on information sharing on Facebook vs advocacy on Twitter) and we saw that people tend to be more active on these platforms in the late afternoons and late nights.

We also tried 3 models that can help us predict likability: logistic regression, decision trees, and random forests. At the end of the day, the logistic regression models were not accurate enough and the random forests were difficult to interpret while at the same time not being that much more accurate. Therefore with the dataset that we have, we believe the decision trees were our best models.

We did consider the fact that we have a huge class imbalance in our dataset. As we tried resampling methods, we see that our Twitter and Facebook Decision Tree models achieved an RMSE of 14.72% and 23.84% respectively. Slightly worse than our standard models, but perhaps a better representation of our model’s true accuracy (and the best one among the over-sampled models).

Shortcomings and Next Steps

Throughout the data collection stage of this project, we encountered issues and were constrained by various limitations. Tweet scraping using tweepy took quite a long time — approximately 1.5 hours for 15,000 tweets. Facebook scraping using facebook-scraper was unstable. There were times when the scraper would return hundreds of results, other times, it would return zero. Finally, both scrapers were free and limited scraping to a 7-day window; we were unable to obtain tweets or posts beyond then. This resulted in a dataset size that is lower than the ideal. Given more time, we could periodically run the scraper every 7 days for several weeks in order to obtain more data that covered several months. There is also an option to use paid scraping services/packages with less limitations (for example, the ability to scrape for the number of replies and quote tweets) and more stability if one were to commercialize this endeavor.

We also acknowledge that our data is not “big” (15000 tweets and 600-something posts) and extremely sparse, that is, there are very few “likable” tweets from the past 7 days that contain the words and hashtags we were focusing on. This sparsity may have affected how well our models can perform in the future. Next time, we could use other resampling techniques (beyond random sampling) to improve our models.

In the future, we could also try using other predictive methods (such as neural networks) and explore other ways to measure text sentiment.