There are many tools you can use for data analysis. You can dump your data into a database, but that often requires a fair bit of setup for defining schemas, etc.
If you want to do something more ad hoc you could use a spreadsheet if your data is small enough. This isn’t a bad option in many cases.
But the command line is another option that will let you do basic analysis in a fast, fluid and ad hoc way. It’s happy with handling fairly large files too.
The data
newsrw_final_sorted.tsv.zip (188 KB) is an archive of tweets from journalism.co.uk’s News Rewired conference on 3 February 2012 that used the #newsrw hashtag. Download and unzip the file and move it to a working directory somewhere.
The tools
head and tail show us the top and bottom few lines in a file.
wc by default counts the number of lines, words and characters in a file. For our purposes we’ll generally use the -l option just to count the lines.
sort will sort a file alphanumerically, or with the -n option, numerically.
uniq will output a file showing only the unique lines provided that the file is sorted first. The -c option shows a count of the number of identical lines.
cut takes a vertical slice from a file. Using the -f option you can cut one or more fields (columns) which are separated by tabs by default. The -c option lets you specify a cut from one character position to another, e.g. cut -c 1-10 will return the first 10 characters of every line (cut’s numbering starts from 1, not 0.)
The analysis
Before we start, let’s save some typing (or tab-completing) by setting a variable for the name of the file we’re working with:
$ export f=newsrw_final_sorted.tsv
$ echo $f
newsrw_final_sorted.tsv
So now we can use $f in place of the filename everywhere.
Now eyeball the file to see the kind of data we’ve got:
$ head $f
will show that the file is tab-separated with the following fields:
- timestamp
- tweet ID
- Sender’s username
- Message (Twitter calls this the “status”)
- Client
- User replied to (if any)
How many tweets?
Easy. There’s one tweet per line so just count the number of lines in the file:
$ wc -l $f
3683 newsrw_final.tsv
What date/time range do the tweets cover?
The file is sorted in timestamp order with the earliest tweet on the first line of the file and the latest tweet on the last line.
The timestamp is in the first field.
So use head -1 to grab the first line of the file and cut -f1 to grab the first field from the output. Remember to specify the filename as the input to head.
$ head -1 $f | cut -f1
2012-01-27 10:06:25 +0000
and do the same at the end of the file:
$ tail -1 $f | cut -f1
2012-02-06 22:31:11 +0000
How many tweets were sent on each day?
Our file covers the range from 27 January to 6 February. If we summarise the number of tweets per day we can very easily see the day of the conference.
First we need to extract the dates from the file. The dates are stored in a fixed-width format and always occupy the first 10 characters of each line. So:
$ cut -c1-10 $f
will output just the dates:
2012-01-27
2012-01-27
2012-01-27
2012-01-27
2012-01-27
...
2012-02-03
2012-02-03
2012-02-03
2012-02-03
2012-02-03
Not that useful in itself but we can use uniq -c to count all those repetitive lines. uniq requires all input to be sorted (it suppresses adjacent lines) so even though our file is already sorted we’ll sort it again just to show that this isn’t something you can always assume.
$ cut -c1-10 $f | sort | uniq -c
7 2012-01-27
10 2012-01-30
13 2012-01-31
19 2012-02-01
97 2012-02-02
3345 2012-02-03
48 2012-02-04
45 2012-02-05
99 2012-02-06
Who were the most prolific tweeters?
The tweeter’s usernames are in field 3. So start by extracting that on its own using cut -f3.
$ cut -f3 $f
SourceAdam
newsrewired
GabrielleNYC
newsrewired
GabrielleNYC
AndrewGrill
Kred
marksimpkins
currybet
elanazak
...
Then if we sort the output we’ll get a long list of all the tweeters with each user’s repetition grouped together. Used without options, sort sorts in alphanumeric order.
$ cut -f3 $f | sort
04SophieLouise
04SophieLouise
04SophieLouise
04SophieLouise
04SophieLouise
04SophieLouise
10Yetis
10Yetis
10Yetis
10Yetis
...
Now that’s sorted we can use uniq -c to suppress and count the repetitions.
$ $ cut -f3 $f | sort | uniq -c
6 04SophieLouise
7 10Yetis
2 123makingmoney
1 21WFMJ
1 3mil
1 ALStranne
2 AboutTheBBC
1 AbsintheSpirit
1 AdamReed
2 AdamWestbrook
...
This output now needs to be sorted in numeric order. We also want a list with the highest numbers at the top, so that means sorting in reverse order. The command we need is sort -rn – sort in reverse order, numerically.
$ cut -f3 $f | sort | uniq -c | sort -rn
175 SarahMarshall3
156 newsrewired
154 journalismnews
104 GarrettGoodman
95 aaroscape
82 mikemullane
79 journochat
77 BBCCollege
69 Jackdearlove
65 GabrielleNYC
...
Pipe this output to head or tail to return only the top or bottom ten rows from the output if required.
$ cut -f3 $f | sort | uniq -c | sort -rn | head
Which Twitter clients were people using?
You can tell a lot about a group of tweeters by looking at the clients they use to send tweets.
The client data is in field 5. We can use exactly the same method as we did for the prolific tweeters: extract the field, sort the output to group runs of the same client together, count the repetitions with uniq -c and sort the results numerically in reverse order.
$ cut -f5 $f | sort | uniq -c | sort -rn
1284 <a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>
427 <a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>
381 <a href="http://twitter.com/">web</a>
314 <a href="http://twitter.com/#!/download/iphone" rel="nofollow">Twitter for iPhone</a>
238 <a href="http://www.echofon.com/" rel="nofollow">Echofon</a>
...
This is giving us the right result but unfortunately it’s quite messy with all those HTML entities.
Tweets per hour during the conference
As luck would have it ;) the timestamp is at the start of every line, so we can extract all the tweets for a specific day just by grepping the file using the ^ anchor to say we want to find the pattern at the start of the line:
$ grep ^2012-02-03 $f
From here we use similar techniques as we did finding the number of tweets per day.
The hour part of the fixed-width timestamp field is stored in columns 12 and 13. We can extract that in the usual way using cut -c to grab that range of characters:
$ grep ^2012-02-03 $f | cut -c12-13
05
06
06
06
06
06
06
06
06
06
...
Now just sort that output and count the adjacent repetitions with uniq -c:
$ grep ^2012-02-03 $f | cut -c12-13 | sort | uniq -c
1 05
11 06
20 07
54 08
103 09
381 10
515 11
543 12
294 13
284 14
439 15
352 16
209 17
38 18
27 19
38 20
9 21
17 22
10 23
and there’s your visualisation with the hours in the second column and the number of tweets in that hour in the first.