Archiving Twitter

Fight disinformation: Sign up for the free Mother Jones Daily newsletter and follow the news that matters.

The Library of Congress plans to begin archiving all Twitter posts. Impressive! Except, not so much:

When do you start?
The agreement has been signed, but we still have a lot of technical details to work out — how we’ll technically transfer it, and when. There’s a built in six-month window, so we don’t have the live Twitter archive at any given time. There is a window for people if they want to delete their tweets, things like that.

There’s a built-in lag?
Yes, so once the transfer is complete, if a researcher comes here, we’ll let them know that it’s 2006 till six months prior. And there’ll be a rolling period of transfers after that.

How much will it cost?
Well, it’s a gift; we didn’t pay for it. But it will be the cost of storing what is, right now, around 5 terabytes, and the staff effort of maybe one full-time person over the years.

Five terabytes of storage? Seriously? That’ll set you back about a thousand bucks. Make it a fancy RAID array and maybe it’s a couple thousand. They needed a gift for this?

And I learned something else new: namely that (a) Twitter’s archives are remarkably small, and (b) they exist. I always figured there was no good way to search Twitter because they only kept tweets for a certain length of time. But no. They’ve got ’em all, and the database is so small that it could be indexed in a few hours. So why is searching Twitter so hard? And will researchers really have to “come here” to search the archives? That was left unclear at the end of the interview, but it sounds like this is Twitter’s call. Putting it online sure sounds like a better idea to me.