Can machine learning techniques predict the popularity of memes on social media?

What exactly makes memes transmitted through online social networks go viral? Research published recently in Applied Network Science uses machine learning techniques to examine and predict the success of a meme. Furthermore, these memes could provide a looking glass into the thoughts of internet users in the time of COVID-19. Nóra Balogh, a co-author of the study, talks about the findings.

As a side-effect of the COVID-19 pandemic and social distancing, more people have turned to online platforms for social interactions.

Few memes actually become viral and it is unclear what characteristics lead a meme to have merit

With this, memes in the form of funny images, quotes, jokes, tweets and hashtags have become an even more important social phenomenon, catching the attention of sociologists and network scientists as well as data scientists like us.

Memes can express humor, thoughts, and draw attention to poignant cultural and political themes. Few memes actually become viral and it is unclear what characteristics lead a meme to have merit.

Many authors have explored the social network factors that lead a meme to go viral but bracketed the impact that meme content may have on popularity.

In contrast, our research performs a content-based analysis of what makes a meme successful using advanced machine learning techniques.

We scraped 129,326 unique image-with-text memes posted on Reddit during the beginning of the global coronavirus outbreak. The viral nature of image-with-text memes make this data well suited for a binary classification task.

Accordingly, we derived our target variable, which is dank (1) if a meme is in the top 5% of the number of upvotes divided by the number of subscribers to a subreddit, and non-dank (0) otherwise.

Color content of popular memes. The average amount of each color attribute in the top 5% of memes, 3728 records, with the most normalized upvotes.
© The Authors

After data cleaning steps and explanatory analysis we used text analysis and optical character recognition to investigate the relationship between textual attributes and popularity.

The word cloud featured at the top of this post created from every word we gathered indicates certain topics are especially prevalent in the memes from late March 2020.

We found that in most cases, the most occurring words are just as prevalent in the top 5% viral memes as in the non-viral memes, except for the category COVID-19 synonyms which appeared more frequently in dank memes.

Surprisingly, it is not obvious whether image related or textual attributes have the stronger predictive power

We also analyzed low-level image attributes, such as color-content, hue, value, saturation, etc. and high-level image attributes that aim to describe the semantic meaning present in images.

We used three supervised learning models to predict whether memes fall into the dank or not dank categories: gradient boosting, random forest, and convolutional neural network.

The models were trained with image-only attributes, text-only attributes, both, and all attributes.

Not surprisingly, the model trained with all data outperformed the other models. This aligns with previous results in which text and network data held more predictive power for image popularity on Flickr.

Surprisingly, it is not obvious whether image-related or textual attributes have stronger predictive power since the Random forest model performed better with the image related attributes, while the Gradient Boosting model performed better with textual attributes.

We found that the success of a meme can be predicted based on its content alone moderately well; our best performing machine learning model predicts viral memes with an area under roc curve score of (AUC) 0.68.

We also found that both image-related and textual attributes have significant incremental predictive power over each other.

There are many great directions for future research relevant to this project such as analyzing memes inspired by COVID-19 alone or studying the temporal and dynamic aspects of meme success.

View the latest posts on the On Physical Sciences homepage