Date of Award
5-1-2020
Document Type
Thesis (Undergraduate)
Department or Program
Department of Computer Science
First Advisor
Soroush Vosoughi
Abstract
Social curation platforms like Reddit are rich with user interactions such as comments, upvotes, and downvotes. Predicting these interactions before they happen is an interesting computational challenge and can be used for a variety of tasks, ranging from content moderation to personality prediction. Given the vast amount of information posted on these sites, it's important to develop models that can simplify this prediction task. In this paper, we present a simple clustering algorithm that helps predict the controversiality of a Reddit post using the user's profile information, their past contributions on Reddit, and the sentiment expressed in their post. On average, introducing the cluster to the prediction task improved the accuracy of the prediction by over 20 percent, with F1 scores of 0.95 (micro) and 0.7 (macro). The classifier performs better than a majority predictor. The results also show that the overwhelming majority of users are inactive and when they do post, they post non-controversial content.
Recommended Citation
Dara, Abenezer Daniel, "A Clustering Algorithm for Early Prediction of Controversial Reddit Posts" (2020). Dartmouth College Undergraduate Theses. 157.
https://digitalcommons.dartmouth.edu/senior_theses/157
Comments
Originally posted in the Dartmouth College Computer Science Technical Report Series, number TR2020-891.