Introduction
“the future of BI is Conversational” — this is what Gartner and other analysts have been telling us for the last few years.
Let’s focus on structured data, relational data to be precise. This forms the underlying storage format for most of the Business Intelligence (BI) world, irrespective of whether you are querying the database interactively or building a report in Tableau, Power BI, Qlik Sense, etc. The predominant language to interact with such storage platforms is SQL. We have already seen some products in this space, e.g., Power BI Q&A, Salesforce Photon.
We are talking about translating a Natural Language Query (NLQ) to SQL in this article, also known as a Natural Language Interface to Databases (NLIDB).
For example, let us consider a Country table with Language and Population details — illustrative schema below:
Country table: Country ID | Name | Language | Population Count
NLQ1: Which country has the maximum population count?
SQL1: Select Name, max([Population Count]) from Country;
At the core of most Natural Language Q&A systems [1], is a Natural Language Understanding Unit (NLU) module that is trying to understand the NLQ’s intent by extracting and classifying the ‘utterances’. In simple words, one can think of utterances as the key phrases in the sentence, e.g., country, maximum, population, count.
The next step is to generate the corresponding SQL query based on this information. So we need a transformation / mapping logic to map ‘country’ to the ‘Country’ table (the table to be queried), ‘maximum’ to the MAX SQL function, ‘population count’ to the column ‘Population Count’. And, this is where things start to get challenging.
Mapping NLQ utterances to the right SQL operators, esp., in determining if an utterance corresponds to a Table, Column, Primary / Foreign Key, SQL operator, in the first place —is non-trivial.
For example, without any inherent knowledge of the database schema, it is very difficult for the mapping logic to determine that the ‘count’ in this case refers to the column ‘population count’ , and not the SQL function COUNT. The problem gets amplified for complex queries, e.g.,
NLQ2: Which language is spoken by maximum number of countries?
whose SQL translation would involve both the SQL functions: MAX & COUNT. Other examples of complex queries include scenarios where we need to JOIN multiple tables.
NLQ — SQL translation Deep Dive
Keep reading with a 7-day free trial
Subscribe to Debmalya’s Substack to keep reading this post and get 7 days of free access to the full post archives.