Lots of Tweets
I’ve done a small “survey” of publicly accessible tweets recently. I used Twitter’s Streaming API to download a small stream of tweets (about 1-2% of all tweets during that time) for a little over 3 hours the other day. The result was roughly 285 MB of JSON data, each tweet object is delimited by a newline.
I wrote a small Python program to process these tweets and give some stats about them. The program runs in about 15 seconds on my 2008 Macbook but I haven’t determined if the slowness is due to reading the file or processing the data, my guess is reading the file but I’ll need to time that later.
The program reads the file line by line and tries to parse the JSON encoded tweet. Then it determines if the tweet has Geo data, if the tweet is a 140 char tweet, and it also creates a “word” distribution of the tweet. I say “word” but really it just splits the text for every space, so that means words include links, hashtags, usernames, and punctuation i.e. “this.” is different than “this”.
The program was meant to just give a simple look on the data and here are the results:
Tweets: 139591 GEO: 996 Characters: 9228541 Average Length: 66.1112894098 140s: 4290 140%: 3.07326403565 Unique 'words': 338538 Total hapaxes: 278362