Tweet by Tweet: Analysing the anatomy of voice in conference live tweets

Its now just over a week since the end of SAA2017. Back at home I’ve been reflecting on this great conference.

I decided to spend more time at the bioarchaeology and funerary archaeology sessions at SAA as I was at CAA recently. Nevertheless, I was able to keep up with many of the discussions at the digital sessions running in parallel because they were live tweeted.

This got me thinking about coverage of the conference on twitter, differences between the coverage of each session, how to visualise these, and more importantly what it might tell us.

Recently, I’ve been experimenting with the TAGS tool from Hawksey that has some nice archiving and visualisation capabilities built in Google Sheets using Google Scripts.

TAGS 6.1 and #SAA2017

Using TAGS 6.1 I did some data-mining and extracted a dataset of tweets with the SAA2017 hashtag to look for some patterns in the data. This was facilitated by the fact that the SAA tweets are well organised by sessions using #s tags, and indeed one of the first tweets in the dataset that I extracted was about the use of hashtags:

and @captain_primate had helpfully pointed out Brian Croxall’s tips for tweeting at conferences

The TAGS archive started from 26 March 22.53, three days before the first event, until 5 April 2017 22.56, approximately three days after the close of the conference. This comprised 5459 tweets from 979 users (#SAA2017 Tweeps).

Now is a good time to highlight that this dataset will not be fully representative of the conference sessions and discussions for a few reasons, the main ones being:

  1. It only represents a subset of the tweets from SAA, given that not all users used the SAA2017 hashtag, and not all session-related tweets would have been tagged with the session tag
  2. The extract may not be complete, because Twitter’s search API is “focused on relevance and not completeness” according to a statement from Twitter posted on Hawksey’s FAQs page, others have found that the API doesn’t always represent all Twitter activity accurately (Gonzalez-Bailon et al. 2012)
  3. Finally, not everyone tweets, and those who do tweet are not necessarily representative of the conference audience

The first two points are related to working within the constraints of the source of the data and tools available. The third one is relevant to the content and focus of the tweets, which we’ll explore more below.

Despite these ‘challenges’ ☺ I decided to hack the data to analyse in Voyant Tools (a great text analysis tool I found out about via @electricarchaeo: Graham 2014).

Voyant Tools Analysis

To analyse the tweets I sorted them by date, from oldest to most recent, and loaded this into Voyant Tools as a single corpus to see what it looked like (you can access the Voyant Tools corpus here).

The total count was 95,802 words and 10,418 unique word forms, the most frequent being saa2017 (5335); rt (2140); https (2028); (1897); data (730); archaeology (714); amp (543); s149 (347); digital (297); session (296).

The graph and bubble line below show the frequency of words appearing over the duration of the dataset.

Screen Shot 2017-04-08 at 20.09.30

Screen Shot 2017-04-08 at 20.18.10

Five words (saa2017, rt, https,, amp) among the top 10 are more related to technicalities, this means that they are predisposed to being high frequency by default.

As SAA2017 is the search term it was expected to be included in all tweets. It was used as a hashtag on every tweet, but due to the way that TAGS archives the data it didn’t appear in the “text” field on 124 records, hence only appears on 5335 in the dataset of 5459.

RT occurred frequently, as each retweet started with those characters. is twitter’s shortened URL, mostly used for image links. Amp is short for ampersand and appeared because TAGS extracted this as a text code.

Interestingly, RTs increased greatly towards the end of the sample period, possibly because people looked back over the increasing archive of tweets from the conference as it progressed and finished.

The removal of each of these technical words from the list reveals that discussions frequently mentioned data, archaeology, digital, session, and s149.

At an archaeology conference you’d expect people to discuss archaeology and the sessions they were attending. The high frequency of data, digital and s149 looks to be due to the fact that s149 was the most tweeted session. It was the Forum “Beyond Data Management: A Conversation”, and is related to data and digital archaeology.

Furthermore, s149, data and digital may have been tweeted more frequently because of the point discussed above: Twitter users are more likely to be a subset of attendees concerned with digital topics, given that Twitter is a digital media platform.

It is possible to tie the tweeting of data and s149 together using Voyant Tools Collocates functionality, which shows the high frequencies of these two together.

Screen Shot 2017-04-08 at 20.32.45

To investigate the coverage of other sessions, I filtered for the top 10 most frequently tweeted session IDs, and reviewed the frequency of mentions over time.

The filter shown that ‘digital archaeology’-themed sessions dominated the tweet sample. Eight out of 10 top tweeted sessions had direct links to digital archaeology or computer applications and archaeology. The most frequent was s149.

ID Session Categorisation
s149 Beyond Data Management: A Conversation About “Digital Data Realities” Digiarch
s227 The Future Of “Big Data” In Archaeology Digiarch
s372 Lightning Rounds Institute For Digital Archaeology Method And Practice Project Reports Digiarch
s37 Archaeological Epistemology In The Digital Age Digiarch
s224 Burning Libraries: Environmental Impacts On Heritage And Science Environment
s112 How To Do Archaeological Science Using R Comparch
s18 Methods And Models For Teaching Digital Archaeology And Heritage Digiarch
s312 Current Challenges In Using 3d Data In Archaeology Digiarch
s256 Do Data Stop At The 49th Parallel? The State Of Archaeological Databases Digital Methodologies, Heritage Management, And Research Collaboration Through Canada And The United States Digiarch
s330 Investigating The Hunter-Gatherers Of Lake Baikal And Hokkaido: Integrating Individual Life Histories And High-Resolution Chronologies Hunter-Gath.

The temporal patterning of session IDs shows clear peaks during the times when each session took place. This points to sessions being ‘live tweeted’.


Digging Deeper

To gain some more insights into the data I looked at the full list of 5459 tweets to determine the most frequent tweeters. I defined these high frequency tweeters as users contributing 0.99% or more of the entire dataset. This turned out to be 20 users, so I termed them T20 Tweeters, in contrast to all other users.

I’ve provided the counts and proportion of tweets per each T20 Tweeter below against generic usernames. This is because although the data is publicly available on Twitter, the users might not have anticipated the publication of their specific usernames on a blog (cf. Twitter 2017a).

However, I’ve maintained the handles along with the the tweet text for each specific tweet in the archive, as required by Twitter’s broadcast guidelines (Twitter 2017b). If you are interested in seeing the user handles, you can find all the data there, or in the TAGS archive.

You can read more about the ethical discussion on social media data-mining online from a range of sources (Fish 2010; Social Data Science Lab 2016; Townsend and Wallace 2016; Zimmer 2010).

Users Count % of Total SAA2017 Tweets
User_01 302 5.53%
User_02 195 3.57%
User_03 169 3.10%
User_04 165 3.02%
User_05 139 2.55%
User_06 102 1.87%
User_07 100 1.83%
User_08 91 1.67%
User_09 87 1.59%
User_10 85 1.56%
User_11 84 1.54%
User_12 83 1.52%
User_13 82 1.50%
User_14 81 1.48%
User_15 79 1.45%
User_16 77 1.41%
User_17 62 1.14%
User_18 57 1.04%
User_19 54 0.99%
User_20 54 0.99%

Once I had the T20 tweeters and the top 10 sessions I decided to use this data to investigate two things.

Firstly, how the top 10 sessions compared to a selection of bioarchaeology and funerary archaeology sessions as these were not tweeted as frequently, and secondly, the influence of T20 Tweeters on how much each session was tweeted.

ID Session Categorisation
s276 Curating The Past: The Practice And Ethics Of Skeletal Conservation Bioarch
s92 Bioarchaeology And Genetics Bioarch
s139 Manipulated Bodies: Investigating Postmortem Interactions With Human Remains Bioarch
s31 Bodies As Narratives: Revisiting Osteobiography As A Conceptual Tool Bioarch
s219 Life And Death In Ancient Nubia: Archaeological And Bioarchaeological   Perspectives Bioarch
s252 Mortuary Practices And Funerary Archaeology I Funarch
s245 “Us” And “Them”: The Bioarchaeology Of Belonging Bioarch

The Lowdown

Breaking out the tweets per session shows the high frequency of s149 tweets, and that much of the conversation came from T20 Tweeters.The disparity between the top 10 sessions and the bioarchaeology and funerary archaeology sessions is clear. The majority of the latter have very few tweets.

Only one bioarchaeology session comes close to the coverage of the top 10, which is s276, and that is due to the fact that it was extensively covered by a T20 tweeter.


Looking at the number of tweeters per session, there were high levels of participation in the conversation in s149, but in some of the top 10 sessions there were relatively fewer tweeters, particularly in s224, on the environment, s312: a project-specific session, s256, and s330.

Greater numbers of the T20 Tweeters tweeted about the digital archaeology compared to the hunter-gatherer, bioarchaeology and funerary archaeology sessions.


To compare the dominance of voice across each session, I compared the % of tweets per session from the T20 Tweeters.

Across the sample 60-80% of the Top 10 Sessions’ tweets came from T20 Tweeters, suggesting they were the dominant voices in the conversation. The situation is different in the bioarchaeology ones, where 5 of the 6 session were 80-100% tweeted by a T20 Tweeter – if they hadn’t been there, the sessions wouldn’t have been tweeted!

One session from the Top 10 that stands out is the hunter-gatherer session, s330 which has the lowest % of tweets from the T20 Tweeters, suggesting that this was not attended by as many of these users.


To further investigate these patterns, I also looked at the proportion of RTs within each session’s tweets. This reveals that 20-60% were retweets in each of the top 10, except in s330. Here, over 70% of tweets were retweets.The evidence from the bioarchaeology sessions points to much less retweeting than the top 10. Interestingly, the one session of all analysed with no T20 Tweeters was s219, and this had a much higher retweet component compared to the other bioarch sessions, at over 60%.

One potential hypothesis to test based on this data is whether the proportion of retweets in a session is greater when there are fewer high frequency T20 Tweeters.

If this turns out to be the case, it might be that high frequency tweeters are generating more original content.



T20 Tweeters appear to be prolific tweeters who are interested in live tweeting the conference sessions they attend. As we might expect, those users are to be more likely to attend digital-related sessions, and hence these are the most frequently tweeted.

Where fewer T20 Tweeters live tweet the session, there may be less original content, and the RT component increases.

Another interesting aspect was the increase in RTs towards the later end of the data range, the end of the conference and the period afterwards. This may be due to the fact that as the conference runs it will accumulate a greater number of tweets which may be retweeted, but it may also be a function of how users interact with twitter, returning to review tweets as the event closes.

This is only a small scale investigation and there is much more to explore. However, there are other articles to be done before Easter, so I didn’t get to look at aspects such as the impact of the duration of the session on tweets, nor did I explore retweets in depth, or touch on likes, replies and follower counts.

The data used is posted to Zenodo under the DOI 10.5281/zenodo.495733 (#SAA2017 Tweeps); Github; Voyant Tools corpus or via the TAGS archive.

P.S. As the evidence points out we also need to give bioarchaeology at SAA some more digital exposure, so you can check out the presentation I gave at SAA here.


#SAA2017 Tweeps. 2017. SAA2017 TAGS Tweet Archive [Data set]. Zenodo.

Fish, A. 2010. Mining Twitter and Informed Consent. Available at: Accessed: 9 April 2017.

Gonzalez-Bailon, S. Wang, N. Rivero, A. Borge-Holthoefer, J. & Moreno, Y. (2012). Assessing the Bias in Communication Networks Sampled from Twitter. SSRN Electronic Journal. DOI 10.2139/ssrn.2185134

Graham, S. 2014. Text Analysis of the Grand Jury Documents.  Available at: Accessed: 6 April 2017.

Social Data Science Lab. 2016. Lab Online Guide to Social Media Research Ethics. Available at: Accessed: 9 April 2017.

TAGS. 2017. FAQs. Available at: Accessed 9 April 2017.

Townsend, L. Wallace, C. 2016. Social Media Research: A Guide to Ethics. University of Aberdeen. Available at: Accessed: 9 April 2017.

Twitter. 2017a. Developer Agreement & Policy. Available at: Accessed: 9 April 2017.

Twitter. 2017b. Broadcast Guidelines. Available at: Accessed: 9 April 2017.

Zimmer, M. 2010. Is it Ethical to Harvest Public Twitter Accounts without Consent? Available at: Accessed: 9 April 2017.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s