Topsy, Now With Every Tweet From 2006 On, Has Other Social Media Indexes In The Works

Twitter has established itself as top social networking destination, mentioned the same breath as sites like Facebook, LinkedIn, or YouTube, as well as a go-to destination for breaking news. But as a search engine that could be the Google for real-time media, Twitter still fails. For Twitter data partner Topsy, that was an opportunity.

If the web is now being cobbled together by status updates and hashtagged posts as much as it is by PageRanked websites, then a lot is being lost. In Twitter’s case, the company has only been scratching the surface of a history of tweets stretching back to 2006. An archive, to date, containing some 425 million tweets.

Topsy, one of only four Certified Resellers of Twitter’s data, says it has now indexed every tweet ever posted – something Twitter doesn’t do, and couldn’t easily reproduce due the infrastructure and costs involved. (Topsy has raised $35 million in venture capital since 2008 to get to this point, the company says.)

Meanwhile, today’s Twitter is more interested in the “now,” and the “recent,” not the distant past. A visit to search.twitter.com pulls up tweet results that only stretch back a matter of days, not months, and certainly not years. And with every passing season, that time frame compresses even more. Twitter’s index currently only goes back a week, it states. In 2009, it stretched back a week and a half. Before that, it was a month.

Topsy was able to dig into Twitter’s archive all the way back to 2010, with an expansion announced in August. Now, it can go back the full seven years. That makes it the largest and most comprehensive archive of Twitter’s data that has ever existed for free, public access. Outside of Twitter, only data partners like Gnip and The Library of Congress have had access to this data before – but it was not in a format everyday users could access and search. And it definitely wasn’t free.

According to Topsy co-founder and CTO Vipul Ved Prakash, the ability to index every tweet from Twitter’s beginning onwards – now 425 billion items across 3,500 servers – was a big data feat. “The third generation of our indexing technology has increased the density of the number of documents we can index on a server, so that means we can now run a massive index that includes every tweet,” and, he adds, Topsy will eventually be able to scale that to trillions of documents. Topsy’s competitors not building infrastructure-based businesses, Prakash challenges, won’t be able to keep up.

Though companies often like to make bold claims such as this, there is some truth to that statement. Today’s web is changing. Twitter, for example, is now pumping out some 400 million to 600 million new tweets daily, each of which becomes indexed on Topsy within 150 milliseconds. Put another way: the amount of data that Twitter will produce between now and this time next year is more than every tweet it has ever produced to date.

The amount of data being created on Twitter plus Facebook today is more than the data being created on the rest of the web.

And when you also take Facebook into account, you realize the web Google understands is now just a slice. “The amount of data being created on Twitter plus Facebook today is more than the data being created on the rest of the web” Prakash explains. “Social data has become the bigger public corpus.” (There’s your answer to “why does Google+ exist?”)

And if the social web is now the larger web, then it’s not surprising that Topsy’s ambitions expand beyond Twitter. The company’s technology is already capable of indexing every public page on other social media sites, like Facebook, in addition to links users tweet. It also has an archive of all of Google+ public posts.

“We’re building some indexes in the background that will be available in the future,” Prakash hints regarding Topsy’s future plans. However, he declined to talk in detail about what the company had built in terms of a Facebook index, saying it was “unreleased,” noting that “public pages are accessible publicly” and that “if we’re creating value for a social network – if it makes good business sense – then we will have deeper access to data.”

Topsy’s value to social media networks, like Twitter (or potentially others), is about the opportunity of what can be done with the data after collection, like offering detailed stats around the data, similar to what it does now for tweets. Topsy can count how many times a term like “Obama” was ever mentioned on the network, or inform hedge funds how users really feel about the new iPhone. Brands can monitor their social media presence, to figure out how to better target ads or influencers. Journalists can do story research. And so on.

“Processing the exhaust of social networks is a different kind of business that the business the social networks are in,” Prakash says. Twitter is concerned with building a publishing platform, and monetizing engagement around tweets, not necessarily in providing analysis of its archives. Though it has a history of moving into businesses its ecosystem partners once led, Prakash is not concerned that Twitter will do so with it because it complements Twitter, but doesn’t replace it.

Still, Topsy has ended up in a symbiotic relationship with Twitter – it pays an undisclosed fee for API access (the “firehose” of Twitter data), but Twitter also pays Topsy through a separate contractual agreement in order to build special tools like its presidential election index or Oscars index, for example. Meanwhile, Topsy’s analytics customers pay $1,000/month per seat for access to Topsy, and API customers pay based on the amount of data that’s returned.

Maybe one day Twitter will think that’s an interesting business it would like to be in itself? And if Topsy’s infrastructure can’t be easily duplicated, would Twitter like to just acquire Topsy, then? “There’s some possibility of that happening,” Prakash admits, but claims the two companies haven’t discussed this possibility. For now, Topsy is burning capital as it scales, while declining to talk customer numbers, revenues or even percentage growth.

Longer term, a historical archive of social media is valuable may be value to specific businesses and marketers, but whether or not everyday, mainstream users will feel the same is still in question. That could change over time, though. “In the next 10 years, social media could start looking like the internet…it’s a different ecosystem, where you have big silos of data, but a new market is forming here,” says Prakash. “Our ambition is to have every public social post in our index.”

Blogging!

— TechCrunch (@TechCrunch) March 7, 2007

I’ve got a secret dream to one day be a Startup Battlefield judge #tcdisrupt

— Alexia Tsotsis (@alexia) May 26, 2010

Jaiku? What did twitter ever do to you? One addiction is enough for me anyway!

— Sarah Perez (@sarahintampa) April 10, 2007

not using twitter, foo’

— Eric Rosser Eldon (@eldon) August 22, 2007