February 20, 2014

Mapping Twitter Topic Networks: From Polarized Crowds to Community Clusters

Part 1: In-depth Analysis: Research Method and Strategy

To understand the nature of Twitter conversations, the Pew Research Center Internet Project joined with researchers at the Social Media Research Foundation, a group of scholars whose mission is to support the creation and application of open tools, open data, and open scholarship related to social media. The discovery of these six archetypical network structures emerged over several years as we examined thousands of Twitter networks on hundreds of topics. Some structures such as Polarized Crowds have been noted by other researchers and were anticipated in our exploration, but the other structures emerged by studying many maps. This kind of exploratory data analysis depends on effective visualization techniques. In our case, the key design advance was the Group-in-a-Box layout technique, which presents the results of clustering algorithms so as to clearly show the size of each cluster, connection density within each cluster, and the connection frequency between clusters.

As all exploratory data analysts do, we generated insights which we invite others to replicate with other tools, such as different visual layout techniques or statistical criteria. Our work is in the spirit of observational research that forms categories, like 17th century botanists describing the variety of flowers on a newly discovered island or astronomers whose new telescopes that allow them to see different categories of galaxies. Our naming reflects conjectures about why different structures emerge. These categories and explanations are open to challenge by others who may have differing perspectives and more powerful tools.

Our tool was a software tool called NodeXL, a plug-in extension to Microsoft Excel spreadsheets that enables network overview, discovery, and exploration. NodeXL allows users to import network data and perform analysis and visualization of networks. NodeXL permits anyone to connect to social media services (including Twitter, Facebook, YouTube, Flickr, Wikis, email, blogs and websites) and retrieve public data about the connections among users, pages, and documents. In the specific case of Twitter, the tool captures information about the content of each message (the “tweet”), which may contain usernames, hyperlinks and hashtags, along with information about each author’s connections to other Twitter users. In Twitter, these connections include relationships among users who follow one another, who mention one another, and reply to one another.

We performed Twitter keyword searches which returned a set of tweets that were then used as datasets for analysis. Network connections were extracted from the content of each tweet returned in Twitter Search results. A link was created for every reply or mention we observed. In addition, NodeXL captures information about the Twitter user’s connections to other Twitter members.2 Data are also retrieved from each user’s public Twitter profile, which includes the number of tweets the user has posted, the number of other users that the user follows, and the number of other people who follow that user, among other things. Author statistics are combined with information about the connections among the people who shared the use of the same word, phrase, or term. For example, if Alice and Betty both posted a message in our dataset that includes the term “politics” and Alice follows Betty on Twitter, our data captured this relationship.

Only publicly available messages were analyzed in this study. No direct messages or other private content were collected or analyzed. Any message defined by its author as private (from, for example, “protected accounts”) was excluded from analysis.

There are clear limits to any dataset captured by NodeXL. The tweets we collect are snapshots of finite periods of conversation around a topic or phrase. The data here do not represent the sentiments of the full population of Twitter users or the larger period of discussion beyond the data collection window. Further, Twitter users are not representative of the full range of the population of the United States or even the population of the Internet or even of social media users generally.3 Thus, we are not arguing that this analysis represents all that happens on Twitter or that it is a proxy for national sentiment on these topics. However, we believe these data sets contain useful snapshots of the structure of social media networks around topics that matter.

Taking “aerial photographs” of Twitter crowds

Our method is similar to taking aerial photographs or short videos of crowds in public spaces, particularly pictures of rallies, protests, political events, and other socially and culturally interesting phenomena. No one snapshot or video clip of a crowd completely captures the event, but taken together crowd images provide some insights into an event or gathering. Our method produces crowd photos from social media spaces; a domain that has not been widely pictured before. Like aerial crowd photographs, social media network maps show the size and structure of the crowd along with the key actors in that crowd.

These social media network maps can reveal information at the level of both individuals and groups. Social media networks often have just a few people who stand out in terms of the unique ways they connect to others. Some networks are composed of just a single group, while others are divided into sub-groups. Each group can be more or less connected to other groups. These structures tell a story about the kinds of interactions that take place in Twitter.

Networks, group density, and diversity of connections

Twitter social media network maps show how interconnected people become when they engage in conversations. People often “clump” into groups. Each network and its sub-groups can be measured in terms of the density of its internal connections. A group of people with many connections among its members is more “dense” than a group that has few connections among the same number of participants. Density is measured as the ratio of the number of relationships among a population over the total number of possible relationships. The density can vary between zero (i.e.: no connections among nodes) and 1 (i.e.: all nodes in a network are connected to all other nodes). As groups grow in size it is harder to interact with all other participants, so as a rule, the larger the numbers of people in a social network the lower the density of their connections. As a result, no one value is a specific threshold for separating high or low density groups. Generally, though, networks are considered to be loosely-knit, low density networks when only a few of the participants are connected to one another.

Some people within a sub-group connect to people outside their group. The amount of internal and external connection in a sub-group is an important indicator of how much people in that group are exposed to people with differing points of view in different groups. If there are few ties between groups, people may not be exposed to content from users in other groups. If there are many ties between groups there is likely to be a larger amount of information flowing between them.

More on network hubs and bridges

Social network maps created from collections of Twitter relationships often highlight a few individual users who occupy key positions in the network. We refer to the relatively rare highly connected users as “hubs.” Many other users follow these hub users; far more follow the majority of other people in the network. Hubs are important because they have large audiences. Some people who have fewer connections can be equally important if their links are rare, connecting across the network to otherwise disconnected groups, acting as “bridges.” While big hubs can also occupy the important position of “bridge,” a user with just a few relatively unique connections may also be an important bridge. 

  1. Twitter has subsequently reduced the accessibility of the Followers network data, see: http://www.connectedaction.net/2013/06/11/over-the-edge-twitter-api-1-1-makes-follows-edges-hard-to-get/
  2. Pew Internet Report on Twitter Demographics: http://www.pewinternet.org/Series/Twitter.aspx