Vast – Aggregating Listings From The Whole Web

Vast is a search service that crawls the entire web and structures the data that it finds so that it can be categorized fully and indexed. The team launched a developer preview of the service tonight with three initial verticals available – Cars, Jobs and Profiles. Vast does an extremely impressive job of finding and categorizing data in the long tail – making it very simple for others to find listings from anywhere on the web.

The first thing that amazed me was how big Vast was. They crawl over 3 billion web pages and their intelligent algorithms have indexed 4.4 million cars for sale, 4.7 million job listings and 8.6 million user profiles – and this is just the beginning. This already makes Vast the largest car sale database on the web, the 2nd or 3rd largest job listing site, and in the top 5 for number of personal listings (profiles from social networks, blogs etc.).

The key to Vast is that they have done an excellent job of aggregating the long tail, as well as the top sites, and they want developers to now ‘steal this site’ (as it says on their homepage). Vast is making all this information available through an API which developers and site owners can use to integrate with other services or to build their own services with. The licensing terms for the API are as liberal as they can get – they just ask that you don’t do anything illegal and that you attribute Vast as the source of your data. Developers can now build their own implementation of the world largest auto search site in a matter of days and do what they like with it (place ads on it, etc.). In fact, the Vast.com site itself is an implementation of this API that took only a couple of weeks to write.

There are a few secrets to Vast that makes them very good at aggregating listings and organizing them. The first is that they are very good at crawling the whole web, not just the pages that every crawler can see but also beyond authentication screens and into what is known as the ‘deep web’. Vast is very good at extracting information out of all types of web pages, those with complex markup, Javascript, Flash , different document types etc. so the index is very comprehensive.

The second is the artificial intelligence they have developed that recognizes a listing and pulls the vital metadata out of what is usually a messy web page and indexes that. For cars this is the make, model, year, price, location information (amongst many other fields). Vast isn’t built with particular verticals in mind, it learns what it needs to find and what a job listing looks like, what an auto listing looks like, so it will only take the Vast team a month or so to setup a new vertical and have the crawlers find all the information required for it.

What they have also developed well is spam filtering methods, Vast CEO Naval Ravikant told me that 70% of the listings that they find are spam or duplicates. The results that you get out of Vast are generally very clean.

vast screen 1

The Vast business model is that later on at some point they will have paid listings within the search results, but only in results that do not have adequate results – so they will be adding sponsored listings in to ‘fill up’ the long tail of their own results. To the integrators, they just see these listings as any others. All listings link back to the original site so there isn’t as much of a threat of site owners wanting to block the Vast crawler as it is beneficial to them to have their listings indexed and syndicated out.

Vast was founded in April of 2005, but the project has been in development since 2000 and has taken multiple forms but has always been based around search. Vast is 21 people in total spread around the world with a development hub in Belgrade, Serbia and an office in San Francisco, California. They were funded by Leapfrog Ventures and Clearstone Ventures last April. Vast have established a very unique culture at which I really admire – the company is very engineering focused with every employee having a programming background and contributing code to this project. The team in San Francisco is Naval along with a few others such as a COO and VP of Products (who both also code). Their strong engineering focus comes through in their product which even at such an early stage is very well implemented, clean and stable with a huge (vast!) index.

Naval said that before he joined the team, he was looking for somebody who was crawling the web and dragging information out of thousands of sites, and he was surprised that not many people were taking that approach. Vast’s competition at the moment in the jobs space would be SimplyHired and Indeed, both of who index other sites – but the difference here being that those sites are not indexing the long tail, and a lot of human intervention is used to add source sites and to filter them. Vast’s job listings have identified 25,878 source sites – and it is finding more and more of them every day.

Google is in a position to attempt to recognize data in the way Vast does, they are already crawling the whole web – but the Google Base approach has been to setup a marketplace where others list their goods on its site. With the technology Vast have in finding listings in the long tail as well as the mainstream sites, it means they are not dependant on any one or few sources of data, and they will always be a superset of other sites out there attempting to aggregate listings just because of how well they crawl into the long tail.

Vast has it all for success, a smart and experienced team, some fantastic and innovative technology and a market strategy with its open API that will see it being used everywhere and widely adopted.