Store Binary Data in Twitter with Tootfiles
Sometimes a man's worst ideas lead to his finest moments: Ben Franklin decides to fly a kite in a storm and we get electricity (I'm paraphrasing that story), Alexander Graham Bell shacks up with his 15-year-old apprentice (who was deaf) and we get the telephone (paraphrasing again).
Now, I don't fly kites and my wife is neither 15 years old nor deaf, but I feel you and I might be on the cusp of solving the world's data storage needs as I, too, have a terrible idea: let's put everything in Twitter. Everything. Music, photos, tax forms--you name it, we should be able to store the data in a Twitter stream.
Brilliance begets 'tootfiles'
So yes, you probably have realized that storing all of your binary data in Twitter is one of the top five or six ideas of this decade, so you're as eager as I am to begin implementation. Luckily, I've already gone to the trouble and written a Python script that does just this: I call it tootfiles.
Note: What follows is a detailed look at how this project was built and the various choices I made along the way. If the details of character encodings and data compression bore you and you're only interested in seeing the final code or using the script yourself, you can check out the tootfiles project page over at my Github account. I promise I won't judge you.
A Quick (and faulty) Look at the Numbers
Twitter allows us to post individual text messages to its service with up to 140 bytes per message. Assuming for a moment that each binary byte can be represented as one character byte in a toot, it is fairly simple to gauge how many toots it will take to represent a file given its size in bytes. For example, the standard sized Google search logo weighs in at 8,558 bytes. Divided up into 140 byte segments, you might determine that we could represent this file with 62 individual toots. That sounds like a lot of bit-sized messaging, but consider that internet celebrity Robert Scoble has over 20,000 toots to his name and still has over 90,000 followers fawning over whatever it is that he actually does.
A more amusing case study might be trying to store a music file in your Twitter stream. Doing a quick search for some legal music to test with, I've found a punchy little number from the year 1909 called "John, Go and Put Your Trousers On." With this short piece weighing in at 3,586,134 bytes, it would only take 25,615 individual toots to represent it as a tootfile. I'll admit that this is a disturbingly high number of messages to store a small music file, but life is about trudging through dead end projects and never letting it break your spirit, so we're just going to pretend it's not a problem. Did you hear that? It's not a problem.
Reality Bytes
So the truth is we can't just slice the files up and stick the pieces into a toot. Binary files are a different beast from the short, nearly plaintext messages of Twitter. To fit any possible binary byte into a string that we can store in Twitter, we'll need to store these bytes using some type of standard encoding system.
Choosing an encoding system
The two candidate encoding systems I evaluated for this project were Base64 and Base85, with the trailing number indicating the size of the usable character set. To understand this a little better, remember that binary is base 2 (it can represent any number using only 1 or 0) and the standard numerical representation is base 10 (representing any number using 0-9). You could theoretically encode anything using any base, but the higher the base the tighter you can pack the data in.
Size Matters...
I said before that the higher the base, the more efficiently you can pack data into your string. To examine this a little closer, let us look at the real world difference between Base64 and Base85 encodings.
Base64 uses the character ranges A–Z
, a–z
, and 0–9
for its first 62 character slots. Depending on the implementation (and there are a few), the remaining two characters differ. In mapping binary to this encoding scheme, Base64 uses four characters to represent three bytes of data. This leads to a roughly 33% increase in size when representing a binary file in Base64.
Base85 (as defined in RFC1924) uses the character ranges 0–9
, A–Z
, a–z
, and the 23 characters !#$%&()*+-;?@^_`{|}~
to represent data. By using five characters to represent four bytes of data, Base85 typically can represent a binary file with a 25% increase in size.
To compare these encodings using our two example files from above, I wrote a script that encodes the files with Base64 (as it's included in the standard lib) and used this pure Python implementation which does the same with Base85. You can compare the file sizes in the chart below.
There seem to be no surprises there, as the numbers back up our earlier claims with regard to encoding efficiency.
... But Character Sets Matter More
So yeah--Base85 is much more efficient; that's the pick, right? I'll now ask that you hold on for just one moment, as there is one more issue that we must consider: the characters used to represent our data in the encodings.
You see, Twitter has a little known 'feature' that does a bit of post-processing to your messages to prevent bad guys from inserting scripts into a Twitter stream. This sanitization of Twitter messages takes certain special characters and converts them into HTML entities, meaning that characters like ``, and &
get expanded into safer representations of themselves.
Sanitizing messages is great because it keeps you and I safe when we're tooting away, but this has serious implications for our data storage scheme. You see, converting these single characters into HTML encodings means the characters now take up four bytes instead of one. Base85 includes a number of characters that Twitter translates into HTML entities, so our 140 byte messages might actually end up taking up much more than that and cause problems in the decode process.
For the above reason, its smarter for us to stick with good old Base64 encoding. Even without Base85's character issues (get it?), there's something to be said for having a standard Base64 implementation baked right in to most standard libraries as well.
Implementation Details
Instead of walking through every line of the code, I want to just highlight some of the other choices I made and what it means for the finished product. This is broken down into two sections--one for posting to and one for decoding files from Twitter--as the situations require different approaches. The code snippets are shown as Python functions for brevity, but you'll notice that the final code handles things with an object-oriented design.
Posting Data to Twitter
Compression
Before encoding our data to be posted to Twitter, we should really employ some type of simple data compression. Using the standard zlib
package that ships with Python, I ran our two example files from before through a quick script to see how much compression really helps us. The snippet below (shortened at the expense of PEP 8) should give you an indication of how to do this on your own.
import os, sys, zlib, base64 f = open('somefile', 'rb') fcompressed = zlib.compress(f.read()) fencoded = base64.b64encode(fcompressed)
The results from the run are shown in the following chart.
This chart actually surprised me a bit, as I had only hoped to shrink the file down enough to save some of extra cruft we saw after encoding. While the compressed Google logo is only slightly smaller than the Base64 encoded file, our rousing hit song about trousers from 1909 has been reduced to a startling 82% of the original file's size. Your mileage will vary on this, as some file formats compress better than others while many formats are compressed by default, but it's clearly a huge win to include a bit of fast compression to our code.
Get Your toot in the Door
So now that we have a decent handle on our compressed and encoded data, it's time to think about getting it tooted. Before posting the messages to Twitter, we'll first need to split the large string of data into 140 byte chunks. The below function does just this, given a data
string and defaulting the slice size to 140.
def segment(data, n=140): ''' Given the encoded string, slice it into twitter ready array elements ''' tootcount = int(math.ceil(len(data)/float(n))) slices = range(tootcount) slices.reverse() return [data[i*n:(i+1)*n] for i in slices]
You'll notice that we reverse the range in the segment()
function, as we want to build the list from tail to head. This reverse order makes reading the individual toots easy later on in the decode section, as we can simply iterate through the stream grabbing the toots in order.
Including some type of header helps signify the start of a file, along with a bit of metadata. Header information is appended to the list (to be posted last) using the following format:
tootlist = segment(data) # Header Information header = "|Tootfile:'%s' MD5:'%s' Count:'%s'|" % (filename, md5hash, tootcount) tootlist.append(header) # Insert the header
With this header, we're helping future decode runs by including three important pieces of information: the name of the file, the MD5 hash of the file data (for integrity checking), and the number of data segments that will follow the header.
Publishing the Segments
Publishing the toots to Twitter is accomplished using the Twitter-python library. It's a fairly simple library to use (despite the uppercase method names), so I won't get into the details of this other than to provide a quick look at the publish()
method:
def publish(data, username, password): api = twitter.Api(username, password) for toot in data: for retries in range(5): try: status = api.PostUpdate(toot) break except: if retries == 4: raise Exception('Unable to post a segment. Quitting.') else: time.sleep(1) print "Finished."
There's some rudimentary fault tolerance built in, as I encountered some timeout issues when tooting a larger file. The run will fail if it cannot post a single toot with five attempts.
Retrieving Data from Twitter
This part of the project was a bit tricky, as the Twitter API does not seem to provide a good way to access a single user's stream of toots for more than a single page. My preferred method was for a user to only need to know the ID of a tootfile header when retrieving files.
The first issue I ran into was not knowing the username to grab all of the toots from. I ended up using simplejson
and using a call to http://twitter.com/statuses/show/<tootid>.json
to grab enough info given a single Twitter message ID to do the rest of our business. The key piece of info I needed was the Twitter username that owns the toot. I could have settled and required a full URL path to the header (which includes username), but this was a choice I made--right or wrong--as to how I wanted this implemented.
After reading a blog post from Scott Carpenter, I decided to just use the fantastic Python HTML scraping library BeautifulSoup
to grab the tweets out of the given users stream. While Scott's goal was to archive all of his Twitter messages, I was able to simplify the script to grab just the info we needed. The modified section is below.
def walk(username, headerid, tootcount): tootlist = [] grabbedtoots = 0 url = 'http://twitter.com/%s?page=%s' re_status_id = re.compile(r'.*/status/([0-9]*).*') # find the max number of pages, based on 20 per page maxpages = int(math.ceil(int(tootcount)/20.0)) for page in range(1, maxpages+1): f = urllib.urlopen(url % (username, page)) soup = BeautifulSoup(f.read()) f.close() toots = soup.findAll('li', {'class': re.compile(r'.*\bstatus\b.*')}) if len(toots) == 0: break for toot in toots: # Do we need more toots? If so, keep going if grabbedtoots < int(tootcount): m = re_status_id.search(toot.find('a', 'entry-date')['href']) status_id = m.groups()[0] # Look for the message directly after our header if(int(status_id) < int(headerid)): data = str(toot.find('span', 'entry-content').renderContents()) tootlist.append(data) grabbedtoots += 1 else: break # one second delay between pages time.sleep(1)
There's nothing too fancy going on here--just some pulling down of the Twitter pages and scraping out the messages. This implementation of walk()
is naive and assumes that you only have one tootfile in your stream and that it is located at the front of your stream. I consider this a bug, so it will be fixed eventually in the actual project.
Reassembling the data is fairly trivial once you have the data in a list. The following snippet works backwardly compared to the encryption, peeling back the layers of compression and encoding before leaving you with a string representing the file data.
def decode(tootlist): """Decodes the raw data given a list of toots""" data = "".join(tootlist) compressed_data = base64.b64decode(data) rawdata = zlib.decompress(compressed_data) md5hash = md5.new(self.rawdata).hexdigest()
With that, we're done. The decode process can now check the md5 hash of the assembled data against what's in the header and write out the rawdata
to either a file or standard out.
Grab the source and fork it for your own needs
If you're interested in the full (and slightly more proper) Python implementation of tootfiles, you can grab the source code and documentation over at its Github project. The script can either be used as a standalone command line encoder, or you can use it as a Python library.
I'm releasing the script with an MIT license, which basically means you can do whatever you want with it.
This is mostly a proof of concept release, as the whole idea of storing meaningful amounts of data in 140 byte segments is absolutely silly. Even as I write this, I'm aware of a few bugs that would occur if someone were to try and use it heavily. With that said, I'll continue to iron things out and I will gladly accept any patches as well.
If you're interested in seeing some actual postings from the script, check out tootfiles on Twitter. Have fun, and please don't spam your followers with a tootfile--especially if you're someone that I'm following. ;)