SQL is a terrible API

A long time ago…

In 1974 the first relational database was created and along with it a new user interface language was developed to make it easy to query the data. This language (or rather sub-language) was called Structured English Query Language or SEQUEL. SEQUEL would later evolved into SQL and become the preeminent language for querying relational databases. In the 45+ years since it’s introduction SQL has evolved slightly but it is still largely similar to the version introduced in the 70s. What has changed during that time is how we interact with the database. Today, SQL is used significantly more by applications querying the database than by end users. This usage pattern has created its own ecosystem of utilities and libraries which help applications build SQL queries. One of the more popular libraries for building SQL queries in Java is Hibernate’s Criteria Builder which uses an annotation pre-processor and a builder pattern to facilitate building SQL queries. (Java’s Persistence API was based heavily on Hibernate’s Criteria Builder). But, at the end of the day the builder still generates a basic SQL string which is sent to the target database, parsed and then executed. Hibernate is just one example, but there are hundreds of other libraries that exist across various languages which all seek to provide the same basic functionality. To provide an API which abstracts away the actual building of an SQL query.

So why do these all tools exist? Is an SQL query so complicated to construct that we need libraries and utilities to help us? In general, no, most SQL queries are relatively straightforward to construct. The problem is that SQL was designed for humans and not computers. The inclusion of the word ‘English’ in the original name was not by chance, SQL was intended to be similar enough to English that it would be self descriptive and would only require minimal transformation from the spoken question to the actual query. SQL was not written to make querying easy from other applications. In order to query a database from an application, the application needs to build the SQL query programmatically at run time, which means that all errors in the SQL query string are also going to be discovered at run time.

An Impedance Mismatch

Compiled languages offer an enormous advantage over interpreted languages in that you can be confident that, if the application compiles, then it does not contain any syntax errors. This assurance removes an entire class of errors that exist in pure scripting languages. Yet, by having an application generate SQL you are re-introducing the possibility of a run time syntax error. Furthermore, the errors which do get introduced are almost always related to the construction of the SQL query rather than actual column names or keywords in the query. Meaning the column names and keywords are not as likely to be the source of syntax errors as they don’t normally change based on the user request. For example, an application which allows the searching of available flights is less likely to return different a type of data depending on the destination city. You may get back more or less result, but it will usually be the same basic set of information. Syntax errors around the column names and keywords are often found through basic testing and fixed.

The more common (and harder to find) syntax errors are introduced in the construction of the query itself.

  • Did you remember a space after the ‘SELECT’ and ‘FROM’ keywords?
  • Did you join your select fields with a comma, but remember to not include a comma after the last one?
  • If this is the first filter then we need to add the ‘WHERE’ keyword, but if this is the second one we need to add the ‘AND’ keyword, and don’t forget about the parentheses.

These types of syntax errors are often introduced based on the users search criteria and can be significantly more difficult to find with basic testing due to the sheer number of permutations. This is the real advantage of using a helper library like Hibernate Criteria Builder. They provide an assurance that if you use their libraries, the generated SQL will be free of construction syntax errors. This assurance is most often achieved through massive test coverage and a responsive development team which quickly patches any errors that are found. It would be safe to say that in the 45+ years since the introduction of SQL, millions of lines of code have been written in various languages, all attempting to work around this same basic problem. It’s a huge waste of intellectual effort for what is essentially a self inflicted problem.

Where to go from here

SQL and similar English based DSLs (domain specific languages) are immensely powerful for building complex queries quickly. As an end user, nothing is more frustrating than trying to build a complex search criteria through a form based UI. First enter the subject, then select the predicate, then enter the object, then click the plus sign to add another filter, rinse and repeat. This becomes incredibly tedious very quickly and makes building complex groupings all but impossible.

This is where an English based DSL really shines. It allows a seasoned user to rapidly create complex queries as they are doing it interactively and have to do minimum mapping from what they are trying to search for to the actual query they are running (just as SQL was initially designed for). But that is where the DSL should remain, as a tool for the end user. Do not just create an API which takes the DSL directly as, at best, you will be recreating the same impedance mismatch and likely forcing your users to create the same set of “builder” libraries.

Instead, define a flexible structure which can be used to describe a query. JSON, XML or even one of the compressed protocols like Thrift or Flatbuffers could work. Users can then create their query programmatically and completely remove the risk of introducing a construction based error because they forgot a space or included a comma at the wrong location. Your API can either accept this structure directly for querying, or you can split it up and have different endpoints accept different categories of queries based on the structure of the return type. Either way, creating a well defined structure programmatically is straight forward and can easily be supported from a multitude of languages.

As for your UI, you are going to have to create a parser for your DSL anyhow which is going to build an intermediary representation of your query. If you merge your intermediary representation with your defined query structure it should be trivial to show the user what they would need to send to the API in order to reproduce their query in the API.

In this scenario your query language can continue to evolve separate from your query structure. You can even support other English based DSLs as long as they can all generate the same query structure. But what ever you do, keep the two domains separate. SQL is a terrible API.

Joshua Gerth
Joshua Gerth
Engineering Manager
Distributed Systems Engineer
Systems Architect

My research interests include big data, language parsing and ray tracing.