2023-12-02

SQL Table and Field Extraction

Analyzing and Parsing SQL

SQL is very common in the realm of the data world. In every second, there are a huge number of SQL queries being executed to perform operations in databases. It is all over the world, both externally and internally in organizations. There are many data roles and positions are requiring SQL skills to support all kinds of data applications and analytics. Nonetheless, the overwhelming demand leads to a common question to understand what data have been used. A lot of organizations are looking for solutions on how to understand the data cycles, the data lineage and the data governance. Unfortunately, there is no handy tool available. In particular, parsing the SQL query and extracting the underlying tables and fields are also unavailable.

For this question, I have been asked for a solution quite a few times. The intuition is that even there is a way to do, it would not be perfect. Apparently, it is time-consuming and required plenty of tests and experiments. In terms of the overall priority in organizations, it is 100% not in the top priority list. Thus, even there is interest, there is no one trying to work out in this area (maybe there is, ping me and let me know).

Nonetheless, I think it is good to try this out. One of the reasons is that this can be a good exercise to practice different skills like design skill, analysis skill, problem solving skills, coding skill, etc. Another reason is that it is possible to be a core component in the future of data realm (perhaps GenAI can do later?). I am hoping that it can benefit to the people looking for the same solution.

As a result, in this article, I am going to provide the details on how this question is answered. I will try best to provide my step by step thinking and analysis for reference so that you would be able to pick up the skills I did. Or share me something new based on what I have done.

OK. Let's start.

* the tables in the article also refers to tables or views.

Overview of SQL Query

Let's take an example of a SQL query that I obtained from the web.

SELECT
DATE_FORMAT(co.order_date, '%Y-%m') AS order_month,
DATE_FORMAT(co.order_date, '%Y-%m-%d') AS order_day,
COUNT(DISTINCT co.order_id) AS num_orders,
COUNT(ol.book_id) AS num_books,
SUM(ol.price) AS total_price,
SUM(COUNT(ol.book_id)) OVER (
PARTITION BY DATE_FORMAT(co.order_date, '%Y-%m')
ORDER BY DATE_FORMAT(co.order_date, '%Y-%m-%d')
) AS running_total_num_books
FROM cust_order co
INNER JOIN order_line ol ON co.order_id = ol.order_id
GROUP BY
DATE_FORMAT(co.order_date, '%Y-%m'),
DATE_FORMAT(co.order_date, '%Y-%m-%d')
ORDER BY co.order_date ASC;

In the SQL query, there are many elements:

SQL Clauses, e.g. SELECT, FROM, JOIN, GROUP BY, ORDER BY, etc.
Functions, e.g. DATE_FORMAT, SUM and COUNT.
Table, e.g. cust_order, order_line
Field, order_date, order_id, price, etc.
Alias, for table, e.g. co (cust_order), ol (order_line). For field, e.g. running_total_num_books.
Reserved Word, e.g. ASC, INNER, ON, etc.
Text, they are enclosed by single quote, e.g. '%Y-%m-%d'. After the start of single quote, it can be any character until the pair of quote is met.
Separator, e.g. commas (,), space ( ), newline (\n), bracket, etc. I refer them all to symbols.
Dot, it is used to separate the server, database, schema, table and field. Usually, server, database and schema can be omitted

In fact, there are a few other possible elements not showing the example:

Comments, it can be either // or /* */ or -- depending on which database and its proprietary syntax.
Double Quote, it is used for explicitly to provide the object name to escape some characters, e.g. "My Column". Space in it will be treated as a character of the field name.
Asterisk (*), it is a special indicator to get all the fields.
Operators, e.g. +, -, *, /, ||, etc.

As you can see, there are a lot of details are conveyed inside the SQL query. This is my first basic understanding about the SQL query by some known knowledge and observations. The initial is started by the above. And later on, if more scenarios come in, it will be enhanced case by case.

Tokenize the SQL Query

After the SQL query overview, next I am thinking to tokenize the SQL query. It is to construct a way to break the SQL query into tokens. It is important to be able to identify each component in the SQL before we can analyze and identify the table or field.

However, how to do it? A straight way from everyone would be using whitespace, i.e. space, new line, tab, etc as a separator to 'chopped' the SQL into pieces. Then let's have a try.

Based on the example, the below result will be obtained:

1. SELECT
2.
3. DATE_FORMAT(co.order_date,
4. '%Y-%m')
5. AS
6. order_month,
... ...

Apparently, it is not as our expectation. However, let's calm down, observe and analyze the results.

Whitespace is possibly one of the separators.
It looks like symbols can also be a separators. e.g. bracket, comma, single quote, etc.
The separator (either whitespace or symbol) is right after the end of each token.

Then, let's revise the separator list as below.

\s (whitespace)
,
+
-
*
/
\
(
)
. (dot)
"
'
=
<
>
;

Then, let's re-try the tokenization (this can be achieved by using regex, I will focus on the concept and idea in this article. If there is a need for more details in the regex, leave me a comment). So, the tokenized result will be like:

1. SELECT
2. DATE_FORMAT(
4. co
5. .
6. order_date,
7.
7. '
8. %Y-%m'
9. )
10.
11. AS
11. order_month,
... ...

Thus, each token is now either clean without a suffix symbol or not clean with a suffix symbol. Thus, it would be possible to further separate the symbol into next row, i.e.:

1. SELECT
2. DATE_FORMAT
3. (
4. co
5. .
6. order_date
7. ,
8.
9. '
10. %Y-%m
11. '
12. )
13.
14. AS
15. order_month
16. ,
... ...

After these steps, each token now is clean, i.e. either a word or a symbol. Then next, the tokens are required to be manipulated into SQL unit.

SQL Unit

SQL unit, basically, refers to the unit that is recognized as an human recognizable object (I am not quite sure how to call it and so named it SQL unit). For example, 'Hello World', the entire 'Hello World' is a SQL unit. Another example, Field as "Renamed As", "Renamed As" is a SQL unit. Another example, mySchema.myTable.myField, the entire string is a SQL unit. From human interpretation, they are a single unit.

The SQL unit is atomic and with unique meaning. Examples of SQL unit are:

string, each single quote pair will be become one SQL unit.
multirow comments, everything inside /* */ means to be a SQL unit.
double quoted string, each double quote pair will become one SQL unit. It is similar to single quote.
dot, the dot will concatenate the server, database, schema, table and field as a SQL unit.

The pattern in the tokens are therefore analyzed to combine together as SQL unit. As a result, the below will be obtained:

1. SELECT
2. DATE_FORMAT
3. (
4. co.order_date
5. ,
6.
7. '%Y-%m'
8. )
9.
10. AS
11. order_month
12. ,
... ...

As can be seen, co.order_date is SQL unit of field, '%Y-%m' is a string for date formatting, etc.

Up to this moment, the whitespace at beginning and the end is actually redundant because the meaningful whitespace is already included inside the SQL unit. After trimming (or remove), it will be like:

1. SELECT
2. DATE_FORMAT
3. (
4. co.order_date
5. ,
6. 7. '%Y-%m'
7. )
8. AS
9. order_month
10. ,
... ...

Then next, it is going to identify patterns. There are a few things can be identified.

Identifying SQL Clause

SQL query has a syntax to follow and the syntax is actually explained by SQL clauses. For example, Select clause to have all the fields to be extracted from the table/view. From clause to specify tables/views. Dependent on which database is used, the clauses might be slightly different. But in general, the below must be able to be identified.

SELECT
FROM
JOIN (From)
ON (From)
WHERE
GROUP BY
ORDER BY
etc

Identifying the SQL clauses can help to distinguish the extracted SQL unit is a table or is a field. Obviously, table can only be found in FROM Clause and JOIN (From) clause. Field can be found in all other clauses.

* it should refer to the syntax of the SQL query allowed in the database.

Identifying Subquery

Subquery is inside a SELECT statement, there is another SELECT statement. It can appear in SELECT clause for scalar output, or in FROM, JOIN (FROM) clause as temporary table result.

Subquery is easy to determine because it must be between a pair of bracket and starts with SELECT, i.e. (SELECT ... )

Identifying subquery is important because in terms of table and field extraction from the SQL query, the only focus is the first time of use. The nested subqueries are usually re-using the extracted fields to continue calculation. The hierarchical usage will not be a concern.

Based on this, the subquery is required a SubqueryID. So that the table and field can be matched within the same subquery. With this concept, it facilitates the matching processes for table and field. It will discussed later in the article.

Identifying Function

Function is always expressed in the form like function (parameter1, parameter2, ... , parameter n). It must have a bracket to include a parameter list. Thus, if bracket is found and it is not a SQL unit of subquery, it must be function.

Function means that it neither table nor field. Thus, it should be ignored for the extraction.

Identifying Special Patterns

This is very difficult to define them in the first place. It is required a lot of testing of SQL queries to figure this out.

First example is at time zone. All three words in this special pattern is not in the reserved word list. But they are allowed in SQL to specify the timezone. It can be tracked as a SQL unit. But no matter what, they are ignored as they are not table or field.

Another example is date component appearing in some of the date functions, e.g. date_trunc(year, DateField)

Usually, inside the function call, it is either a field, a number or string. For specific function like date_trunc, date_part, date_dateadd, extract and datediff, these functions will take a parameter call date part to indicate which date component to be used, e.g. year, month, day, etc. They are neither required to be quoted nor are reserved words. A special pattern is required to identify this.

* Just take a note that there might be more to be figured out.

Identifying Alias
There are different patterns for alias. They include:

1. SQLUnit as Alias

It is typical, e.g. field as fieldName or table as tableName.

2. SQLUnit Alias

It is also typical to omit the AS, e.g. field fieldName or table tableName.

3. SQLUnit "Alias" or SQLUnit as "Alias"

It is also typical to define the alias with double quote to escape the special characters, e.g. field "field Name" or table as "table Name". AS can be optionally be used.

4. With Alias

Alias inside the with clause is having a special syntax that allows multiple queries to be first executed and named aliases. Then a following query will make use of those temporary results to complete the final the SQL result.

e.g. WITH t1 AS (

SELECT AVG(Sales) AVG_SALES FROM Store_Sales

)

SELECT a1.* FROM Store_Sales a1, t1

WHERE a1.Sales > t1.AVG_SALES;

Instead of specifying the alias at the end, it is specifying the alias at the very beginning. The example t1 is the alias for the result SELECT AVG(Sales) AVG_SALES FROM Store_Sales.

Therefore, it is possible to know aliases of fields and aliases of tables.

Removing Objects
The original purpose is to find out table and field used in the SQL query. Thus, it means to remove all other that are not tables nor fields. They include:

Reserved Words
Functions
Symbols (separators)
Comments (single row comment // and multirow comments /* */)
String
Operators
Numeric
Identified Patterns
Aliases

Once all the above are identified and removed in the SQL query, the remaining is only table or field.

The removing objects are identified based on regression test. Once the result is obtained, it is obvious to notice what is further required to remove. The above list is at least running 10+ times of different queries to identify.

Once the patterns are identified, next is to finalize the table and field discovered.

Matching Table and Field

After tokenization, SQL unit combination and pattern identification, the next will be to match the table and field against the metadata in the database.

For table, it is straightforward to obtain the table from the FROM or JOIN (FROM) clauses. But for field, it is a bit tricky.

Because of the possible use of asterisk, the only possible way is to to cross check against the database metadata. Otherwise, it is impossible to know what fields are included. Also, the field specified in the SQL query is usually without the table prefix. Unless, there is a full list of metadata to cross check, it is impossible to ensure the field belongs to which table.

To match the field, there are different matching required:

Match the exact name (no table prefix)
Match with alias
Match the Full/Partial FQDN (database.schema.table.field)

Each FROM clause or JOIN (FROM) clause is specifying the table while the corresponding select clause is specifying the field. So, each subquery, an SubqueryID has been assigned. The outmost query is 0. As a result, the field must come from one of the tables with the same SubqueryID. Thus, the above 3 matching criteria, either one will ensure a match.

* it is assumed the query is syntactically correct and is executable without error. Thus, cross-checking with database metadata, the answer is unique.

Matching Asterisk

The match for asterisk is similar to have:

Match with alias
Match the Full/Partial FQDN

But instead of matching a single field, it is getting all the fields inside the specified table. If not prefix, it means all fields in the table list.

Conclusion

The purpose of this article is not trying to provide a tool for this purpose but to share the understanding and exploration that have been done such that the mindset of tackling this kind of problems can be understood. Or if possible, to enable you to think further.

There is an assumption that the SQL query must be syntactically correct and the SQL is runnable. The extraction is not doing debugging but instead, it is purely pattern recognition.

The method discussed still cannot promise a 100% extraction but to continue for more test cases and patterns to be identified, it will eventually becoming 100% (Also, it depends on which SQL standard is used and it is varied). But in terms of our own testing, it can tackle majority of the SQL queries.

At the end, I would like to say "practice makes perfect". Try more and you would learn more.

Thanks for reading. I hope you enjoy my post.

By the way, share me your thoughts, leave me comments.

* If you like my post, support me buy me coffee https://buymeacoffee.com/kongsoncheung.

Labels

Saturday, December 2, 2023