Don’t Read The Comments — Let Diffbot Analyze Them Instead

Diffbot‘s mission, according to CEO Mike Tung, involves “teaching a robot how to read and understand web pages.” Today it expanded that understanding to include forums, comments, reviews, and other online discussions.

When Tung talks about understanding web pages, he means turning the content into structured data — say, looking at an article and identifying the title, author, text, images, topics and so on. That information, in turn, can help businesses find and track the content that’s relevant to them. (Diffbot customers include Microsoft/Bing, Cisco and eBay.)

Until today, however, Diffbot could perform its analysis on an article or a product page, but it couldn’t do the same for the comments under the article or the reviews under the product description.

Tung said there are a couple of specific challenges when it comes to analyzing these kinds of discussions. For one thing, comments are often presented in a JavaScript widget, so it’s not as straightforward as pulling the text — it requires “a bunch of visual analysis,” he said. For another, discussions often use more casual, colloquial, and emoji-heavy English, so Diffbot needed to develop “a more specialized language model.”

You can try it out for yourself using Diffbot’s test-drive page, where you can see Diffbot’s analysis for any page. To try it out, I looked at the results for a post I wrote last week that got more comments than usual, and I could see the basic attributes of each comment — author, time, text, language and author link.

This gets more interesting in aggregate when you can start finding larger trends in the conversation — Tung noted that while there are a lot of social media monitoring tools, it’s harder to track conversations across the web, where you’ll find “detailed, well-thought-out discussions.” For example, he said a shoe company could identify which shoes customers identify as most comfortable in their online conversations.

Diffbot says its new Discussions API supports Facebook Comments, Disqus, Livefyre, WordPress, Blogger, Automattic’s Intense Debate, Kinja, Hacker News, Reddit and more.