web scraping
diffbot

Diffbot Aims To Build The Intel Of Data For Artificial Intelligence

Next Story

GGV Capital Is Raising A Giant New Fund

Diffbot Aims To Build The Intel Of Data For Artificial Intelligence

With a new $10 million commitment led by Tencent, one of China’s largest Internet companies, Diffbot chief executive Mike Tung has come a long way from his days of eating beans and rice in the dark and solving the math problems that would form the core of his groundbreaking artificial intelligence software.

Diffbot, which raised its first seed money in 2012, has set itself the lofty goal of being the “Intel of data” for independent artificial intelligence application developers.

Companies like Google, Facebook, and Baidu — which are all working on artificial intelligence — have the benefit of massive amounts of data at their fingertips that they and their data entry employees can use to categorize and define the web in a language that AI software can later feed into their algorithms .

Small companies who don’t have the benefit of that data can turn to Diffbot.

“We’ve been working on this technology for quite a few years. It was really last year that 90% to 95% accuracy was reached. And hitting profitability last year as one of the first AI startups to do so was a turning point,” says Tung.

The major expenses for Diffbot had been electricity and bandwidth, Tung says. Unlike other artificial intelligence deep learning projects that rely on humans to classify web pages, Diffbot uses only the proprietary algorithms that it created itself and has refined over the years, according to Tung.

“We want to build the world’s largest database of structured knowledge,” he says.

If artificial intelligence is to achieve the promise (and potential peril) inherent in the technology, it still needs to be taught.

Tung compares it to teaching a child. “The technology is scouring the web and is trying to simulate what a human being is doing when they’re on the page,” he says.

shutterstock_228897490

Research into artificial intelligence, and the ability to develop sentience in machines, sits at the intersection of a few very large trends in computing. It combines the development of new, and newly powerful, chipsets that can process complex increasingly quickly; the development of new kinds of database software that can organize massive amounts of data more flexibly, and the development of a nearly ubiquitous arrays of sensors and systems to collect that data.

The problem with the data that these would be intelligences would learn and process is that it needs to be structured in a way that the systems can recognize and that’s exactly what Diffbot does.

“We’re taking the Internet and converting it into semantic knowledge,” says  Tung. And, in a strategy that drives down the cost of developing the massive trillions of facts that comprise the taxonomy that Diffbot is creating, the company’s secret weapon is its own AI software.

“Google has this knowledge graph using human curation and it’s the same with Watson. There’s a lot of human beings behind the scenes creating the rules the way the algorithm works,” says Tung. And humans cost money that Diffbot simply doesn’t need to spend.

Tung calls it the Manhattan project for AI — except computers are the researchers developing the bomb.

The Path Seldom Taken

Diffbot was always going to make money. The question of profitability wasn’t one that Tung ever wanted to address, nor was relying on fundraising as a necessity, the founder and chief executive said.

To make money to support the development of the software, Tung pinched pennies and took on a second job after dropping out of Stanford’s graduate school, learning patent law and filing patents in the wee small hours of the morning to make rent money.

“For each patent I was able to get 20K,” says Tung. “I would be good to get rent for a few months.”

He lived on a diet of beans and rice and ramen, alternating working on the math at the core of the software with filing patent applications for money.

Once the initial product was baked, Diffbot had the singular honor of being the first company to be accelerated in the program that would become Stanford’s premiere source for getting graduates to exit velocity with their business — StartX (where Tung is still a mentor).

With the initial seed money from StartX, Diffbot was able to continue its research and launch its first, revenue generating, products.

“From day one we made it an on-demand service,” Tung recalls. “You pass us a URL and we will process that. For every hit to our server we earn .008 cents…

In retrospect it was a decision that Tung was happiest about. “Our on-demand customers were paying us to structure the web,” he says.

Many of those on-demand customers are still on board. AOL (the parent company and owner of TechCrunch), Yandex, eBay, Microsoft’s Bing search service, Cisco and Adobe all pay Diffbot for its taxonomical services — and Diffbot got to increase the scope of its learning.

A Thin, Premeditated Rig

MBF-Arachnophobia-Spider-Table-Clock-aBlogtoWatch-3

While Diffbot couldn’t spider the web from day one, by 2015 its situation had changed. The company was profitable, confident in its ability to raise money, its AI software was identifying data on the web with a 90% to 95% reliability. It was time.

So the company started spidering the web to speed up its data collection. The goal, ultimately is to get to trillions of discrete data points to provide a structured taxonomy for the entire internet (it’s a small goal).

Since the company began its spidering project last year, it’s taxonomy already contains more than 1.2 billion objects and is adding 10 million objects per day.

By comparison, Google’s Knowledge Graph only recently passed 1 billion objects, the company notes.

Show Me The Money

Lofty goals attract big investors, and Diffbot has attracted some of the biggest.

For its seed round the company attracted a who’s who of the Silicon Valley’s biggest names including: EarthLink founder Sky DaytonAndy Bechtolsheim, co-founder of Sun Microsystems; Joi Ito, Director of MIT Media Lab; Brad Garlinghouse, CEO of YouSendIt (and formerly of TechCrunch parent company AOL),Maynard Webb, Chairman of the Board at LiveOps, formerly eBay COO; Elad Gil, VP of Corporate Strategy at Twitter; Jonathan Heiliger, former VP of Technical Operations at Facebook; Redbeacon co-founder Aaron Lee; and founder of VitalSigns Montgomery Kersten.

The latest round brought in a strategic investor in Tencent, one of China’s largest Internet companies in one of the world’s largest markets. And Felicis Ventures, which is building a sizable portfolio of artificial intelligence companies.

A coterie of new angels and other institutions joined as well — all of them also bold-faced names in the Valley. Among the superstar new names are: Andy Bechtolsheim, the founder of Sun Microsystems and the first investor in Google; Amplify Ventures, Valor Capital, and Bill Lee — an early investor in SpaceX and Tesla.

Featured Image: agsandrew/Shutterstock