Bots in the Twittersphere
An estimated two-thirds of tweeted links to popular websites are posted by automated accounts – not human beings
CORRECTION (April 2018): In the original report, the words “liberals” and “conservatives” were reversed in one sentence. It has been corrected to read, “Suspected bots share roughly 41% of links to political sites shared primarily by liberals and 44% of links to political sites shared primarily by conservatives – a difference that is not statistically significant.” In another sentence, the word “conservatives” was mistakenly used in place of “liberals.” It has been corrected to read, “By contrast, automated accounts are estimated to share 41% of links to political sites with audiences comprised primarily of liberals, and 44% of those comprised primarily of conservatives.” These corrections do not change the conclusion that automated accounts in the study did not show evidence of a liberal or conservative “political bias” in their overall link-sharing behavior.
The word “substantially” was also removed from the following sentence: “Links associated with Twitter itself are shared by suspected bot accounts about 50% of the time – a substantially smaller share than the other primary categories of content analyzed.” The 50% figure is substantially smaller than only five of the six categories. This correction does not materially change the analysis of the report.
The role of so-called social media “bots” – automated accounts capable of posting content or interacting with other users with no direct human involvement – has been the subject of much scrutiny and attention in recent years. These accounts can play a valuable part in the social media ecosystem by answering questions about a variety of topics in real time or providing automated updates about news stories or events. At the same time, they can also be used to attempt to alter perceptions of political discourse on social media, spread misinformation, or manipulate online rating and review systems. As social media has attained an increasingly prominent position in the overall news and information environment, bots have been swept up in the broader debate over Americans’ changing news habits, the tenor of online discourse and the prevalence of “fake news” online.
In the context of these ongoing arguments over the role and nature of bots, Pew Research Center set out to better understand how many of the links being shared on Twitter – most of which refer to a site outside the platform itself – are being promoted by bots rather than humans. To do this, the Center used a list of 2,315 of the most popular websites1 and examined the roughly 1.2 million tweets (sent by English language users) that included links to those sites during a roughly six-week period in summer 2017. The results illustrate the pervasive role that automated accounts play in disseminating links to a wide range of prominent websites on Twitter.
How does this study define a Twitter bot?
Broadly speaking, Twitter bots are accounts that can post content or interact with other users in an automated way and without direct human input.
Bots are used for many purposes. This study focuses on a particular kind of bot behavior: bots that tweet or retweet links to content around the web. In other words, these are bots that post or promote specific websites or other online content.
Many bots do not identify themselves as bots, so this study uses a tool called Botometer to estimate the proportion of Twitter links to popular sites around the web that are posted by automated or partially automated accounts. One study suggests Botometer is about 86% accurate, and Pew Resesarch Center conducted its own independent validation tests of the Botometer system. To acknowledge the possibility of misclassification, we use the term “suspected bots” throughout this report. For details on how Botometer functions, see the methodology.
Among the key findings of this research:
- Of all tweeted links2 3 to popular websites, 66% are shared by accounts with characteristics common among automated “bots,” rather than human users.
- Among popular news and current event websites, 66% of tweeted links are made by suspected bots – identical to the overall average. The share of bot-created tweeted links is even higher among certain kinds of news sites. For example, an estimated 89% of tweeted links to popular aggregation sites that compile stories from around the web are posted by bots.
- A relatively small number of highly active bots are responsible for a significant share of links to prominent news and media sites. This analysis finds that the 500 most-active suspected bot accounts are responsible for 22% of the tweeted links to popular news and current events sites over the period in which this study was conducted. By comparison, the 500 most-active human users are responsible for a much smaller share (an estimated 6%) of tweeted links to these outlets.
- The study does not find evidence that automated accounts currently have a liberal or conservative “political bias” in their overall link-sharing behavior. This emerges from an analysis of the subset of news sites that contain politically oriented material. Suspected bots share roughly 41% of links to political sites shared primarily by liberals and 44% of links to political sites shared primarily by conservatives – a difference that is not statistically significant. By contrast, suspected bots share 57% to 66% of links from news and current events sites shared primarily by an ideologically mixed or centrist human audience.
Examples of Twitter bots in action
Bots can be used for a wide range of purposes. Here are some examples of bots that perform various tasks on Twitter:
- Netflix Bot (@netflix_bot) automatically tweets when new content has been added to the online streaming service.
- Grammar Police (@_grammar_) is a bot that identifies grammatically incorrect tweets and offers suggestions for correct usage
- Museum Bot (@museumbot) posts random images from the Metropolitan Museum of Art
- The CNN Breaking News Bot (@attention_cnn) is an unofficial account that sends an alert whenever CNN claims to have breaking news
- The New York Times 4th Down Bot (@NYT4thDownBot) is a bot that provides live NFL analysis.
- PowerPost by the Washington Post (@PowerPost) is a bot that provides news about decision-makers in Washington.
These findings are based on an analysis of a random sample of about 1.2 million tweets from English language users containing links to popular websites over the time period of July 27 to Sept. 11, 2017.4 To construct the list of popular sites used in this analysis, the Center identified nearly 3,000 of the most-shared websites during the first 18 days of the study period and coded them based on a variety of characteristics.5 After removing links that were dead, duplicated or directed to sites without sufficient information to classify their content, researchers arrived at a list of 2,315 websites.
First, these sites were categorized into six different topical groups based on their primary area of focus. The topical groupings included: adult content, sports, celebrity, commercial products or services, organizations or groups, and news and current events. For comparison with these primary categories, researchers put links that redirected to content within Twitter itself into a separate category.
Second, sites categorized as having a broad focus on news and current events (in total, 925 sites met this criteria) were subsequently coded based on three additional criteria:
- Whether a majority of the site’s content consisted of aggregated or republished material produced by other sites or publications;
- Whether the site included a politics section, and/or prominently featured political stories in its top headlines; and
- Whether the site had a contact page (a trait that can serve as a proxy for whether a site offers readers the ability to submit comments and feedback).
Third, the Center identified an additional subset of news and current events sites that featured political stories or a politics section and that primarily serve a U.S. audience. Each of these politically oriented news and current events sites was then categorized as having primarily a liberal audience, a conservative audience or a mixed readership.6
The next step was to examine each tweeted link to those sites and attempt to determine if the link was posted from an automated account. To identify bots, the Center used a tool known as “Botometer,” developed by researchers at the University of Southern California and Indiana University. Now in its second incarnation, Botometer estimates the likelihood that any given account is automated or not based on a number of criteria, including the age of the account, how frequently it posts, and the characteristics of its follower network, among other factors. Accounts estimated as having a relatively high likelihood of being automated based on Pew Research Center’s tests of the Botometer system were classified as bots for the purposes of this analysis.7
Collectively, the data gathering, site coding and bot detection analysis described above provide an answer to the following key research question: What proportion of tweeted links to popular websites are posted by automated accounts, rather than by human users?
This research is part of a series of Pew Research Center reports examining the information environment on social media and the ways that users engage in these digital spaces. Previous studies have documented the nature and sources of tweets regarding immigration news, the ways in which news is shared via social media in a polarized Congress, the degree to which science information on social media is shared and trusted, the role of social media in the broader context of online harassment, how key social issues like race relations play out on these platforms, and the patterns of how different groups arrange themselves on Twitter.
It is important to note that bot accounts do not always clearly identify themselves as such in their profiles, and any bot classification system inevitably carries some risk of error. The Botometer system has been documented and validated in an array of academic publications, and researchers from the Center conducted a number of independent validation measures of its results.8 However, some human accounts may be misclassified as automated, while some automated accounts may be misclassified as genuine. There is therefore a degree of uncertainty in these estimates of the share of traffic by suspected bot accounts.
In addition, the analysis described in this report is based on a subset of tweets collected over a specific period of time. It is not an analysis of all websites or of all media properties, but rather an analysis of popular websites and media outlets as measured by the number of links posted on Twitter to their content. This analysis does not seek to evaluate whether these links were being shared by “good” or “bad” bots, or whether those bots are controlled from inside or outside the U.S. It also did not seek to assess the reach of the tweets in question or to determine how many human users saw, clicked through or otherwise engaged with bot-generated content.
Further details on our bot-classification effort can be found in the methodology of this report.
Automated account activity is prominent across the Twitter ecosystem
Automated accounts play a prominent role in tweeting out links to content across the Twitter ecosystem. The Center’s analysis finds that an estimated 66% of all tweeted links to the most popular websites are likely posted by automated accounts, rather than human users.
Certain types of sites – most notably those focused on adult content and sports – receive an especially large share of their Twitter links from automated accounts. Automated accounts were responsible for an estimated 90% of all tweeted links to popular websites focused on adult content during the study period. For popular websites focused on sports content, that share was estimated to be 76%.
Automated accounts make up a slightly smaller proportion – although in each case still a majority – of link shares for other types of popular sites. Most notably, the Center’s analysis finds that 66% of tweeted links to the most popular news and current events sites on Twitter are likely to have been shared by bot accounts. That figure is identical to the average for the most popular sites as a whole. Suspected automated accounts make up a larger share of links posted to popular sites focused on commercial products or services (73%) and a lesser share of sites focused on celebrity news and culture (62%). The proportion of link shares by automated accounts is the lowest for links associated with Twitter.com – that is, links that stop at Twitter and do not redirect to any external site – compared with the six topical categories in this study. Links associated with Twitter itself are shared by suspected bot accounts about 50% of the time – a smaller share than the other primary categories of content analyzed.
In focus: Popular news and current events websites are linked to in tweets by bots
Automated accounts post a substantial share of links to a wide range of online media outlets on Twitter. As noted above, the Center’s analysis estimates that 66% of tweeted links to popular news and current events websites are posted by bots. The analysis also finds that a relatively small number of automated accounts are responsible for a substantial share of the links to popular media outlets on Twitter. The 500 most-active suspected bot accounts alone were responsible for 22% of all the links to these news and current events sites over the period in which this study was conducted. By contrast, the 500 most-active human accounts were responsible for just 6% of all links to such sites.
The Center’s analysis also indicates that certain types of news and current events sites appear especially likely to be tweeted by automated accounts. Among the most prominent of these are aggregation sites, or sites that primarily compile content from other places around the web. An estimated 89% of links to these aggregation sites over the study period were posted by bot accounts.
Automated accounts also provide a somewhat higher-than-average proportion of links to sites lacking a public contact page or email address for contacting the editor or other staff. This type of contact information can be used to submit reader feedback that may serve as the basis of corrections or additional reporting. The vast majority (90%) of the popular news and current events sites examined in this study had a public-facing, non-Twitter contact page. The small minority of sites lacking this type of contact page were shared by suspected bots at greater rates than those with contact pages. Some 75% of links to such sites were shared by suspected bot accounts during the period under study, compared with 60% for sites with a contact page.
On the other hand, certain types of news and current events sites receive a lower-than-average share of their Twitter links from automated accounts. Most notably, this analysis indicates that popular news and current events sites featuring political content have the lowest level of link traffic from bot accounts among the types of news and current events content the Center analyzed, holding other factors constant. Of all links to popular media sources prominently featuring politics or political content over the time period of the study, 57% are estimated to have originated from bot accounts.
Twitter bots post a greater share of content from centrist Twitter audiences
The question of whether the media sources shared by liberals or conservatives see more automated account traffic has been a topic of debate over the last year. Some have voiced worry that suspected bot accounts are prolific in sharing hyper-partisan political news, either on the left or right of the ideological spectrum.
However, the Center’s analysis finds that automated Twitter accounts actually share a higher proportion of links from sites that have ideologically mixed or centrist human audiences – at least within the realm of popular news and current events sites with an orientation toward political news and issues. By extension, these automated accounts are less likely to share links from sites with ideologically conservative or liberal human audiences. In addition, right-left differences in the proportion of bot traffic are not substantial.
This analysis is based on a subgroup of popular news and current events outlets that feature political stories in their headlines or have a politics section, and that serve a primarily U.S. audience. A total of 358 websites out of our full sample of 2,315 popular sites met these criteria. Researchers isolated the suspected non-automated accounts that shared links to those sites on Twitter over the time period of the study and used a statistical technique known as correspondence analysis to estimate the ideology of each site’s Twitter audience.
Correspondence analysis first measures how consistently individual sites are shared by some users and not others. It then groups them together and quantifies the degree of difference. Based on this analysis, a score of greater than zero suggests that a site’s audience is more consistently conservative, while a score less than zero suggests that a site’s audience is more consistently liberal. This is a technique based on scholarly research that estimates ideological preferences as revealed by behavior. Researchers can use this method to see which sites are shared mostly by a liberal, conservative, or moderate audience, and how many times bots share each kind of site. It is important to note that correspondence analysis produces estimates of audience ideology without any analysis of the content of the website – only the sharing patterns of human users. For more details, see the methodology section.
The Center’s analysis finds that suspected autonomous accounts post a higher proportion of links to sites that are primarily shared by human users who score near the center of the ideological spectrum, rather than those shared more often by either a more liberal or a more conservative audience. Automated accounts share roughly 57% to 66% of the links to political sites that are shared by an ideologically mixed or centrist human audience, according to the analysis. By contrast, automated accounts are estimated to share roughly 41% of links to political sites with audiences comprised primarily of liberals, and 44% of those comprised primarily of conservatives. Sharing rates among sites with liberal audiences are not significantly different from those with conservative audiences. However, differences in sharing rates for sites with centrist audiences compared with those at either end of the spectrum are substantially beyond the margins of error.
It is important to note certain caveats in interpreting the findings of this analysis. First, this study only examines major media outlets as measured by the number of shares they receive on Twitter. Second, it does not examine the truthfulness (or lack thereof) of the content shared by humans and the content shared by bots. Finally, it is focused on overall sharing rates and does not account for the subsequent shares or engagement of human users.
- Popular sites defined as those most frequently shared in a 1% sample of tweets posted on Twitter from the period July 27 to Aug. 14, 2017. The final list was based on a larger list of nearly 3,000 of the most-shared web sites linked to on Twitter during this initial 18-day period in the study. A total of 685 were excluded because they were deactivated, duplicated, or directed to sites without sufficient information to allow researchers to classify them. See methodology for further details. ↩
- A tweeted link is a link to a twitter URL or an external URL contained in a single tweet. If two tweets contain the same link, they are counted separately. If a tweet contains two or more links, each is counted as a separate tweeted link. 5.2% of all tweets contained more than one link. Counting each tweet once results in an estimate of 65%, inclusive of links to the twitter.com domain. ↩
- Removing links to the twitter.com domain results in an estimate of 70%. Counting each tweet only once does not change this estimate. ↩
- Accounts may tweet in many different languages, but researchers only focused on those listing English as their profile language. Profile language or listed location is not necessarily a reliable measure of where the user or account is operated from. ↩
- This list is based on a sample of tweets containing links collected between July 27 and Aug. 14, 2017. See methodology for more details. ↩
- In order to estimate the American political orientation of website audiences, researchers excluded sites with non-U.S. audiences. Sites were assigned a human audience ideology score using an analytic technique known as “correspondence analysis.” See methodology for details. ↩
- The Center constructed likelihood estimates based on its own tests of the Botometer system. See methodology for more details. ↩
- For example, accounts the Center identified as automated were suspended by Twitter at a rate nearly five times greater than accounts identified as non-automated. See methodology for details. ↩