SQL is about as easy as it gets in the world of programming, and yet its learning curve is still steep enough to prevent many people from interacting with relational databases. Salesforce’s AI research team took it upon itself to explore how machine learning might be able to open doors for those without knowledge of SQL.
Their recent paper, Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning, builds on sequence to sequence models typically employed in machine translation. A reinforcement learning twist allowed the team to obtain promising results translating natural language database queries into SQL.
In practice this means that you could simply ask who the winningest team in college football is and an appropriate database could be automatically queried to tell you that it is in fact the University of Michigan.
“We don’t actually have just one way of writing a query the correct way,” Victor Zhong, one of the Salesforce researchers who worked on the project, explained to me in an interview. “If I give a natural language question, there might be two or three ways to write the query. We use reinforcement learning to encourage use of queries that obtain same result.”
You can imagine how machine translation problems can quickly become massively complex with large vocabularies. The more you can limit the number of possible translations for each missing word, the simpler your problem becomes. To this avail, Salesforce opted to limit its vocabulary to words used in database labels, the words in the question being asked and the words typically used in SQL queries.
The idea of democratizing SQL isn’t new. Startups like ClearGraph, which was recently acquired by Tableau, have made it their business to open up data with English rather than SQL.
“Some models perform execution on a database itself,” added Zhong. “But there’s potential privacy concerns if you’re asking a question about Social Security numbers.”
Outside of the paper itself, Salesforce’s biggest contribution here comes in the form of the WikiSQL data set it constructed to aid in building its model. First HTML tables were collected from Wikipedia. These tables became the basis for randomly generated SQL queries. These queries were used to form questions that were then passed off to humans for paraphrasing over Amazon Mechanical Turk. Each paraphrasing was verified twice with additional human guidance. The resulting data set is the largest such data set in existence.