Date of Award


Document Type

Thesis (Undergraduate)


Department of Computer Science

First Advisor

V.S. Subrahmanian

Second Advisor

Benjamin Valentino


A growing percentage of public political communication takes place on social media sites such as Twitter, and not all of it is posted by humans. If citizens are to have the final say online, we must be able to detect and weed out bot accounts. The objective of this thesis is threefold: 1) expand the pool of Twitter election data available for analysis, 2) evaluate the bot detection performance of humans on a ground-truth dataset, and 3) learn what features humans associate with accounts that they believe to be bots. In this thesis, we build a large database of over 120 million tweets from over 900,000 Twitter accounts that tweeted about political candidates running for US Senate during the 2018 American Midterm Elections. Tweet-level data were collected in real-time during the two-month period surrounding the elections; account-level data were collected retrospectively in the months following the elections. Using this original dataset, we design and launch a bot detection study using a novel combination of Amazon SageMaker and Qualtrics. For ground truth, we include 39 known bot accounts from a separate 2015 Bot Challenge Dataset (BCD 2015) in the study sample. Of the 39 known bots from BCD 2015, only 11 accounts (28.2%) were accurately identified as bots with a two-thirds or unanimous annotator vote; just 5 accounts (12.8%) were unanimously accurately identified as bots, highlighting the difficulty of building accurate training sets for bot detection. Looking at the study results for the Senate dataset accounts, we observe that accounts which 1) post frequently and 2) retweet frequently were more likely to be labeled as bots. The Senate dataset and the associated study results offer significant opportunities for further analysis and research.


Originally posted in the Dartmouth College Computer Science Technical Report Series, number TR2019-865.