Crowdsourced project aims to add text-to-speech to Wikipedia

An open source project hopes to draw on crowdsourced contributions to make Wikipedia more accessible by adding text to speech synthesis that will enable users of the online encyclopedia to have portions of the text read out to them.

The speech synthesis platform is being developed in Europe by KTH Royal Institute of Technology university in Stockholm, Sweden, which was approached by Wikipedia to develop the feature on account of its prior specialism in text to speech synthesis.

Wikipedia will host the speech synthesis servers and an optimized version of the platform will be developed for it. But the software will also be made freely available as open source — and will be “readily usable” by any site that uses the MediaWiki software, according to KTH.

“We will build an open framework where any open source speech synthesizer can be plugged in. Since it is open source modules, it will also be possible to add or substitute certain modules in the Text-to-Speech system (TTS),” professor Joakim Gustafson, head of KTH’s speech group, tells TechCrunch.

“The TTS will be open source so anybody could of course use that functionality for any use (not only reading wiki (or other) web pages,” he adds.

The group has conducted a pilot study already. And Wikimedia Sweden, which initiated the project, estimates that a quarter of all Wikipedia users — or nearly 125 million people per month — “need or prefer” text in spoken form, whether for literacy or visual impairment reasons.

The crowdsourcing element will entail wiki users either being able to report badly sounding sentences, or to correct the sentences themselves — although that will require some linguistic knowledge as it will involve using a phonetic transcription to correct the dictionary.

Gustafson says the group would like to explore the possibility of letting users record how a word should be pronounced and then having that automatically correct the transcription in future. But that scenario is not how the platform will work in the first instance.

“In the first stage it the will have to use phonetic transcription (IPA) to correct the dictionary, but we will explore the possibility for a user to record how it should be pronounced and that automatically correct the transcription,” he notes. “This is probably something we will do in the next project where we will extend tha system to allow users to build their own voices. We will then have them read 30 minutes of text and then morph a voice (trained on 10 hours of speech) to sound like them.”

Crowdsourcing help will be solicited from a mix of Wikipedia users and schools with programs for kids with reading difficulties, according to Gustafson.

The funding for the platform — 2.8M Swedish Kronas (~$335,000)– is coming from the Swedish Post and Telecom Authority. So a version of the text to speech platform will be developed in Swedish first, with a “basic English voice” following, and finally a plan to do a “proof of concept” version in Arabic.

“What we want to prove is that it also works with a language with other character sets and that we need to read out right-to-left,” he notes. “For English there are a lot of open source language resources that we can plug in (dictionaries, grammars etc). For Arabic we would have to develop these resources. There is no funding for building resources in the project, only to integrate those that already exist. There are some small resources that we will use for the proof of concept.”

The project only kicked off at the start of this month so the aim is to have English, Swedish and Arabic speech engines produced “sometime about September 2017”.

After that, a further crowdsourcing effort could extend synthesized speech to the remaining 280 languages in which Wikipedia is available — albeit, likely poco a poco.

How will the platform be optimized for Wikipedia specifically? “Wikicommons Sweden will build a client-server architecture for adding functionality to get buttons where you can hear a sentence being read aloud while you at the same time see the read words highlighted,” notes Gustafson.