Live Web, Real Time . . . Call It What You Will, It's Gonna Take A While To Get It

Comment

This guest post is written by Mary Hodder, the founder Dabble. Prior to Dabble, Hodder consulted for a number of startups, did research at Technorati and wrote her masters thesis at Berkeley focusing on live web search looking at blog data.

Hands on clock

Real time search is nothing new. It is a problem we’ve been working on for at least ten years, and we likely will still be trying to solve it ten years from now. It’s a really hard problem which we used to call “live web search,” which was coined by Allen Searls (Doc’s son) and refers to the web that is alive, with time as an element, in all factors including search.

The name change to “real time search” seems a way to refocus attention toward the issue of time as an important element of filters. We are still presented with the same set of problems we’ve had at least the past ten years. None of the companies that Erick Schonfeld pointed to the other day seem to be doing anything differently from the live web search / discovery companies that came before. The new ones all seem to be fumbling around at the beginning of the problem, and in fact seem to be doing “recent search,” not really real time search. While I’m sure they’ve worked really hard on their systems, they are no closer than the older live web search systems got with the problem. All the new ones give a reverse chron view, with most mixing Twitter with something: blog data, other microblog data, photos, creating some kind of top list of recent trends. Some have context, like a count of activity over a period of time, or how long a trend has gone on or a histogram (Crowdeye) which both Technorati and Sphere experimented with in the early years. Or they show how many links there are to something or the number of tweets. All seem susceptible to spam and other activities degrading to the user experience and none seem to really provide the context and quality filters that one would like to see if this were to really work. All seem to suffer from needing to learn the lessons we already learned in blog search and topic discovery.

Publicly available publishing systems starting in 1999 took the value of time and incorporated it into what was being published (think Pyra which is now Blogger, Moveable Type, WordPress and Flickr, among the many) as well as search and discovery systems for those published bits like Technorati, Sphere, Rojo, Blogpulse, Feedster, Pubsub and others, to walk down memory lane . . . (btw, for disclosure purposes I should state that I worked for Technorati in 2004 for 10 months, and consulted or advised most all the others in one form or another).

I started working on this problem in 1999, at UC Berkeley, and eventually did my master’s thesis on live web data search and topic discovery at SIMS (or the iSchool as it’s now known). From 2000 to 2004, people at SIMS would say to me, “What are you doing with blogs and data, it’s just weird. Why does it matter?” But the element of time was the captivating piece that was missing for me from regular search. It’s the element that makes something news, as well as the element that can group items together in a short period to show a focus of attention and activity that often legacy news outlets miss (until more recently when they decided that live web activity was interesting).

Barney said, you have my explicit permission to flickr me, so get your camera..

At Burning Man in 2005, under a shade structure during a hot, quiet afternoon, I remember having a four or five hour conversation with Barney Pell (who would later found Powerset) about the Live Web and Live Web Search, how to do it, what it meant, how to understand and present time to the user, how much was discovery and how much was search, how structured was the data you could get and how reliant on the time could you be with the data, what meaning you could make from that data, etc. Sergey Brin was sitting and listening, and finally, after a couple of hours, he asked me, “What is the live web and what is live web search?” Since Barney and I had already been doing a deep dive, I assumed Sergey knew what we were talking about, so it surprised me, but I explained why I thought time was a huge missing element of regular search, and that this was the type of search I worked on. Barney and I continued for a couple more hours. And it got cooler so it was time to go admire the art and that was the end of that. But I have wondered over the years where Google is with the live web and when they might do something with time. Twitter seems to be prodding them.

In 2006, “The Living Web” Newsweek cover story by Steven Levy and Brad Stone poked at this issue for the first time in a national forum.

When I look at the latest crop of search startups, I think: Why are we doing it all the same way again? Reinventing the wheel? Is anyone doing anything original either with data or interface? Is anyone building on what we’ve learned before about the backend or UI’s?

Frankly, our filters suck.. and I suppose that if a name change gets us to think anew about better filters, well, I should rejoice. I’m partly to blame for the bad filters we have to date because in having worked on this problem, I’ve contributed to some of the various live web or real time or whatever the word of the moment is to describe trying to solve this problem. We are very good at publishing our thoughts and visions, with time stamps, but not very good at the filtering side of things. The old method of information search and discovery was to open the paper or magazine, turn the pages with editorially filtered and placed information, and when you were finished, you said, “Okay, I’m informed” (whether you really were or not). But the media got complacent, missed stories and with the ease of blog publishing and sites like Flickr for photos, we could replace paper and supplement our information needs with the whole web. The only problem is, it’s the whole freaking web. An avalanche. We feel anxiety on the web from the lack of filter and editorial grace that one or two printed news sources used to give us.

I did a study in 2002, which I repeated in 2004 and again last year in 2008. I asked users to track their online information intake for one week. There were only 30 people in each study, chosen randomly from Craiglist ads, but what I found across each group of 30 was that the average time spent online with news and information sites was 1.25 hours in 2002, 1.85 hours in 2004 and 2.45 hours in 2008. These people are not in Silicon Valley, but they do all have broadband at home and live in the US. Every one of them reported some level anxiety over the amount of data they felt they needed to take in in order to feel informed. They often dealt with it by increasing the time they took to stay informed. They didn’t know that better filters might actually reduce their anxiety.

As Erick noted, the tension to solve this problem is between memory and consciousness; or as Bob Wyman and Salim Ismail called it at Pubsub: retrospective verses prospective search. And it is part of the issue. But there is more.

Discovery does mean you have to introduce time as an element. The user cannot be expected to know what is bubbling up, or the specific phrases that will name the latest thing.

Some people will say “michael jackson” and some will say “MJ” and some will say “king of pop.” And Michael Jackson as a topic is actually pretty easy. I remember once doing usability tests for a live web search and discovery system in 2003, where we asked users to search on Google News and various live web systems for an incident in Australia where a “giant sea creature” was found. But since all the media covering it originated in Australia, and they’d all called it a “massive squid,” and all the follow-on American sources including bloggers had copied the Aussie language, there were no recent hits for “massive sea creature.” Testers had to think creatively about how to get to the info they knew was there, and yet it was a semantic leap. One search tester actually cried as she refused to give up, she was so determined to find the result in any of the live web systems we were testing. We begged her to stop; it was painful. Good discovery could have helped.

Another key element of discovery and live web search is getting structured data, because spidering, which Google uses to get data from the web for it’s regular retrospective web search, makes understanding time with a published work more difficult. It’s hard to work with time if you only know for sure when you spidered the page. Twitter on the other hand has structured data because everything is published in their silo so the sites they provide their complete stream to get it in a structured format. They know the time of each tweet. Not to mention the data is available through API’s. This is the most efficient way to draw out meaning for search because you know for sure about the context of each piece of data, with time as one of the pivots, for search and discovery.

You also need to get the data model right for the backend search data base, in order to get meaning and link metrics. And you need to understand the different corpuses of data to know what things mean to users (not engineers), and figure out the spam and bad actor problems. There is the original context the data had and there is the UI which is so difficult when trying to make time understandable for many users. In fact some think that communicating the time element to regular users is so hard that making time focused search is really an “advanced search” problem.

If designed poorly, the system can contribute to the unnatural production of skewed data by users. If the system involves some sort of filter for authority or popularity, they are subject to power law effects (Technorati calls their metric “authority” but inbound link counts from blogs are not authority, they’re just a measure of popularity). What’s a power law effect? It’s when a system drives activity to reinforce unnaturally the behavior that caused something to be there in the first place. For example, if one of the metrics of a filter counts the number of people clicking on a top search, then the more clicks, the longer the item will stay at the top of the list of searches, even if naturally it would have fallen off the list earlier. Conversely if a metric for a filter involves a spontaneous act, driven by imagination, like writing a tweet, then exposing those items at the top of the filter might be less likely to drive up activity. However, if you show the results to the users, upon seeing a popular topic, they might begin tweeting about that topic without having thought of it before seeing the popular topic. In other words, by revealing the metrics you focus on, you can push users to change their behavior. By driving behavior, power-law distributions keep things with some power at the top because they are at the top or can drive them higher. It becomes a loop. And because no distinction is made between the quality or strength of a unit or what that unit might mean to a group of users in a topic area, straight number counts just aren’t very smart.

For example, if we made a system that counted Om Malik’s inbound links and called it authority, no matter the topic, I think Om would agree that even he wouldn’t have great authority and insight on the subjects of say, modern dance or metal working, if he happened to mention those words in a blog post. But on broadband issues, he is most definitely an authority. But Technorati, OneRiot, and other services that take a metric count and apply it for all topics, all circumstances, all search result matches, without context, randomize the quality of the information the user sees. They may provide a filter across the whole web, but they don’t give us any real help in judging what is useful or not. It’s why topic communities are helpful, and once you find a good editorial filter, driven by the human touch, you glom onto it for dear life because it’s such a time and energy saver.

I’m under no illusions that we’re remotely close to solving Live Web or Real Time search or even recent search. We are not. Nor are we near solving discovery. But I hope we will. Sooner rather than later. Because I need it now. The opportunity is huge. It means really building algorithmically the editorial filters we have today in the form of people, while balancing the mobs’ activities. Solve that and the prize will be big.

More TechCrunch

In September, California Governor Gavin Newsom considered 38 AI-related bills, including the highly contentious SB 1047, which the state’s legislature sent to his desk for final approval. He vetoed SB…

Here is what’s illegal under California’s 18 (and counting) new AI laws

California Governor Gavin Newsom has vetoed SB 1047, a high-profile bill that would have regulated the development of AI. The bill was authored by State Senator Scott Wiener and would…

Gov. Newsom vetoes California’s controversial AI bill, SB 1047

A number of YouTube videos featuring music from artists such as Adele, Green Day, Bob Dylan, Nirvana, and R.E.M. have been unplayable in the United States since Saturday. For example,…

YouTube blocks videos from Adele, Green Day, Bob Dylan, others in dispute with SESAC

Kevin Ryan has had a long and storied career as a pivotal force of New York City tech. He’s the founder and CEO of investment firm AlleyCorp, which has invested…

New York tech investor and serial entrepreneur Kevin Ryan explains when to sell your company

Featured Article

Elastic founder on returning to open source four years after going proprietary

Licensing kerfuffles have long been a defining facet of the commercial open source space. Some of the biggest vendors have switched to a more restrictive “copyleft” license, as Grafana and Element have done, or gone full proprietary, as HashiCorp did last year with Terraform. But one $8 billion company has…

Elastic founder on returning to open source four years after going proprietary

This week, Alex Goldman shares his setup. A former producer for WYNC’s On the Media, Goldman co-founded Reply-All with Emmanuel Dzotsi in 2014.

How I Podcast: Hyperfixed’s Alex Goldman

The Pixel 9 Pro Fold is back, bigger and better than before, with a thinner design and excellent tri-camera system.

Google Pixel 9 Pro Fold: Bigger, mostly better

Featured Article

In war-torn Sudan, a displaced startup incubator returns to fuel innovation

Businesses need stability to thrive. Unfortunately for anyone in Sudan, stability has been hard to come by for the past year and a half as the country quakes amidst a raging civil war. More than 20,000 people have been killed, and about 7.7 million people have been displaced just within…

In war-torn Sudan, a displaced startup incubator returns to fuel innovation

X (formerly Twitter) could soon resume service in Brazil — if it’s willing to pay an additional fine. Reuters and other publications have reported on an order from the country’s…

X faces additional $1.9M fine to end ban in Brazil

Meta Connect 2024 was this week, showcasing new hardware and software to support two of the company’s big ambitions: AI and the metaverse. CEO Mark Zuckerberg announced new Quest headsets,…

Meta rethinks smart glasses with Orion

Amazon Prime Video could be getting into the live news business, if only for one night. Variety reports that the company is in talks with longtime NBC and MSNBC news…

Brian Williams might host a live election night special for Amazon

Apple faces a looming deadline to produce what it says are more than 1 million documents related to recent App Store changes. On Friday, Judge Thomas S. Hixson denied the…

Judge is unimpressed by Apple’s deadline extension request in Epic Games dispute

For years, Silicon Valley and Wall Street have questioned Mark Zuckerberg’s decision to invest tens of billions of dollars into Reality Labs. This week, Meta’s wearables division unveiled a prototype…

Meta offers a glimpse through its supposed iPhone killer: Orion

When the U.S. Feds cut interest rates by half a percentage point last week, it was a dash of good news for venture capitalists backing one particularly beleaguered class of…

VCs expect a surge in startups offering lower rate mortgages, other loans now that the Feds cut rates

The video debuted along with a research paper of the same name at IEEE’s International Conference on Robotics and Automation in Rotterdam this week.

Robot hand can detach from arm, crawl over to objects, and pick them up

There are many iPad apps to help you organize recipes; sync tasks across devices; be more productive; and manage your notes.

Best iPad apps to boost productivity and make your life easier

While online discourse would make it seem that venture has retreated to the Bay Area, with San Francisco being the most important place to build a startup, Index Ventures is…

Why Index Ventures is bulking up its investment team in NYC

In August, a Russian warlord posted a video on Telegram, showing a pair of Cybertrucks patrolling a road in Chechnya, armed seemingly with heavy machine guns. Leaving aside unanswerable (for…

A Russian warlord said he’ll take Cybertrucks into Ukraine; some experts think that’s unwise

WordPress.org has lifted its ban on hosting provider WP Engine until October 1, after putting a block on it earlier this week. The block prevented several sites from updating their…

WordPress.org temporarily lifts its ban on WP Engine

The world of WordPress, one of the most popular technologies for creating and hosting websites, is going through a very heated controversy. The core issue is the fight between WordPress…

The WordPress vs. WP Engine drama, explained

ChatGPT could get more expensive to use in coming years. The New York Times, citing internal OpenAI docs, reports that OpenAI is planning to raise the price of individual ChatGPT…

OpenAI might raise the price of ChatGPT to $44 by 2029

Binance founder Changpeng “CZ” Zhao was released from U.S. custody on Friday after serving out his four-month sentence in a low-security correctional facility. CZ’s sentence was the product of a…

Binance founder ‘CZ’ released from custody after four-month sentence

EV startup Canoo has been hit with two new lawsuits from suppliers linked to the drivetrains that power its electric vehicles, just weeks after the company kicked off a major…

Canoo hit with two supplier lawsuits as last remaining co-founder leaves

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the world of startups. Want it in your inbox every Friday? Sign up here. This week…

AI dominated both YC Demo Day and startup news

Three Iranian hackers working for the Islamic Revolutionary Guard Corps (IRGC) targeted the Trump campaign in an attempted hack-and-leak operation, according to the Department of Justice.

Iranian hackers charged with hacking Trump campaign to ‘stoke discord’

Wordy is a new iOS app that offers a unique way to learning English. The app automatically translates and defines unknown words while you watch your favorite movies or TV…

Wordy’s new app helps you learn vocabulary while watching movies and TV shows

The WSJ reports that OpenAI’s next funding round, worth around $6.5 billion, could close as soon as the first week in October.

OpenAI’s $6.5B funding round may close as soon as next week

We’re thrilled to welcome Bret Taylor to TechCrunch Disrupt 2024. As the former co-CEO of Salesforce, founder of Quip, former CTO of Facebook, the co-creator of Google Maps, and current…

Bret Taylor of Sierra joins TechCrunch Disrupt 2024

The U.K.s’ antitrust authority has concluded that Amazon’s partnership and equity investment in AI startup Anthropic can’t be investigated under current merger rules due to the size and scope of…

Amazon dodges antitrust scrutiny in UK over Anthropic investment

We’re in the final hours to save up to $600 on TechCrunch Disrupt 2024 tickets! Grab your tickets now and seize this final opportunity for major savings before the countdown…

Last hours to snag up to $600 off TechCrunch Disrupt 2024 passes