Mr. 23

Quack!
Like this?

4 notes

Lots of Tweets

I’ve done a small “survey” of publicly accessible tweets recently. I used Twitter’s Streaming API to download a small stream of tweets (about 1-2% of all tweets during that time) for a little over 3 hours the other day. The result was roughly 285 MB of JSON data, each tweet object is delimited by a newline.

I wrote a small Python program to process these tweets and give some stats about them. The program runs in about 15 seconds on my 2008 Macbook but I haven’t determined if the slowness is due to reading the file or processing the data, my guess is reading the file but I’ll need to time that later.

The program reads the file line by line and tries to parse the JSON encoded tweet. Then it determines if the tweet has Geo data, if the tweet is a 140 char tweet, and it also creates a “word” distribution of the tweet. I say “word” but really it just splits the text for every space, so that means words include links, hashtags, usernames, and punctuation i.e. “this.” is different than “this”.

The program was meant to just give a simple look on the data and here are the results:

Tweets:		139591
GEO:		996
Characters:	9228541
Average Length:	66.1112894098
140s:		4290
140%:		3.07326403565
Unique 'words':	338538
Total hapaxes:	278362

Filed under CompSci

  1. mr23 posted this