SpeakerText Automates And Crowdsources Video Transcripts (100 Beta Invites)

One of the big problems with video on the Web is that other than the title, description and some meta tags, it is mostly invisible to Google and other search engines. One way to make video more SEO-friendly is to add transcriptions, but that can get expensive. An angel-funded startup called SpeakerText is (re)launching today with a very clever way to automate the transcription process and attach the full transcript as part of the video player in a drop down window. You can see an example of how this works below. And if you publish a lot of videos and want to try it out yourself, we have 100 beta invites (use the code: techcrunch).

Once a video is transcribed, it appears in a collapsible window below each player. Not only is all the text visible to search engines, and thus should help drive more search traffic to individual videos, but the text is all time-stamped. So you can click on any sentence and it will jump to that point in the video. Anytime somebody cuts and pastes a portion of the transcript in a blog or other site, a link back to that point in the video is also included. The startup tried doing a Flash wrapper before for the YouTube player. It completely reworked its technology into what it is now calling the SpeakerBar that is more of a transcript plug-in that detects any video on your site that has a matching plug-in. SpeakerText works with video players from YouTube, Brightcove, and Blip.tv, and there is also a WordPress plug-in.

Below is a video explaining how it works, with a SpeakerBar underneath. Click on any sentence to jump to that party of the video.

SpeakerText uses a combination of speech-to-text software, natural language processing, and crowdsourced human labor to create each transcript. Video publishers submit videos they want transcribed. Using open source speech-to-text software called Sphinx-4 developed at Carnegie Mellon University (where co-founder Matt Swanson studied artificial intelligence), the videos get a rough first pass. (The other founders are CEO Matt Mireles and Tyler Kieft). These then get broken up into 5 to 8-second chunks, which are distributed to to human transcribers via Mechanical Turk.

The humans correct the text and punctuation in a digital assembly line, going through their micro-tasks quickly and efficiently. Different workers get ranked based on their work history, which helps in the assignment process. The transcribed video chunks are then pulled back together and reassembled into the complete video, with speech recognition software aligning the text to the video and adding time stamps. Natural language processing software is then used to determine where sentences begin and end, and to create meta tags for more SEO goodness.

This entire assembly line process is designed with feedback loops to get better and more automated over time. The service starts at $20 a month for the SpeakerBar, plus $2 per minute for the transcriptions. That is competitive with other transcription services, which seem to start at the $3 to $5 per minute range, but you also get the SpeakerBar. The lower SpeakerText can get their rates, the broader it’s appeal will be.