Saturday, March 23, 2024

Analytical Calendar - Simplification

Post Index

Analytical Calendar The Story

2024-03-23

Analytical Calendar

Simplification

Enhancement Based on Use Feedback


Analytical Calendar is very simple to use with the sequence to select:

1. Date Perspective to select the date granularity

2. Unique Date to select the date for analysis

3. Analysis Perspective to select the date range

4. Comparison period to select the date range for comparison.  

Logically speaking, it is quite systematical but a lot of end users are not familiar or understand very well why date perspective should be chosen at the beginning.  And it generates questions on the usage.  Being said, it is simple to sequentially select these fields one by one, it does not cognitively accept by end user of this sequence.  It, thus, becomes difficult to user and understand.

It has been a while for the issue until a feedback comes in.



Feedback and Analysis

Recently, it was heard a feedback to change the sequence to below:

1. Analysis Perspective to select the date range

2. Date Perspective to select the date granularity

3. Unique Date to select the date for analysis

4. Comparison period to select the date range for comparison.  

Ah.....it brainstorms that end user starts analysis not from date granularity but analysis perspective, e.g Actual, YTD, Rolling 6 months, etc.  With this idea, the below are popped up to revisit:

1. Is it possible to remove date perspective but can maintain the relationship against Unique Date?

2. Is it possible to apply this sequence flow?

3. Can the analytical calendar be simpler?


As a result, a solution is out to try to address this.  Let's see what it does and how can solve this.



Analysis Perspective merges with Date Perspective

Let's first review what is included in the Date Perspective and Analysis Perspective.  In short, one is for date granularity and one is for the date period range.



It is always being said that it was inconvenient because of the date perspective.  It can be understood that human does not deal with this explicitly in daily life.  More often, it is used unconsciously without user cognition.  So, in particular to this, an attempt is made to a closer look to into the date perspective and analysis perspective.

It is found that to some extent, Analysis Perspective also includes the ingredient of date granularity, e.g. Rolling 2 Years, Year to Quarter, etc.  All these ranges have date granularity but just "Actual" is not.  So, the below is tried to merge the date perspective into the analysis perspective.


The major changes, in fact, to the Analysis Perspective is on the "Actual".  Now, it becomes "Actual Year", "Actual Quarter", "Actual Month", "Actual Week" and "Actual Date".  It breaks exactly which Actual Date granularity.


As a result, there are only now three fields required now and date perspective can be removed.

1. Analysis Perspective

2. Unique Date

3. Comparison Perspective

Does it look better and easier now?  Yes, it was also heard Unique Date is not easy to interpret but do you have any suggestion for a better term?  I think from meaning-wise, unqiue date is the most suitable.


Model Simplification

Along with the attempt to simplify the selection processes and fields required, there is also attempt to simplify the data model to let everyone easier to manipulate the analytical calendar.

The below is the simplified data model.


The changes include:

1. Remove Custom Date comparison.  It creates complicated data model and the use case is less occurred.

2. Combined Analysis Calendar and Comparison Calendar into Analytical Calendar.

As a result, only three tables are required.

1. ANALYTICAL_CALENDAR to have all analysis and comparison calendar details.

2. DATE_RANGE_LINKED_TABLE to have all associated dates against the selection in analytical calendar.

3. DIM_CALENDAR to provide dimension values for chart creation.


Is it simple to use and understand now?  What do you think?  Please leave me comments to improve.



Thank you for reading.  I hope it help.  Leave me message if you have any discussion or want to share your views.




Saturday, March 16, 2024

Analytical Calendar - PowerPoint Deck

   Post Index

Analytical Calendar The Story

2024-03-16

Analytical Calendar

Presentation Deck

A Deck helps to explain Analytical Calendar

*A new version will be available on 2024-03-30 


This post is to provide you a deck to help explain about the analytical calendar.  The screen captures are shown below and the corresponding PowerPoint as well as the applications, you can find in below links.


PowerPoint

https://github.com/kongson-cheung/analytical-calendar/blob/main/PowerPoint/AnalyticalCalendar_2024-03-16.pptx


QlikView Example

https://github.com/kongson-cheung/analytical-calendar/blob/main/QlikView%20Analytical%20Calendar%20Examples/AnalyticalCalendar5.4.qvf


PowerBI Example

https://github.com/kongson-cheung/analytical-calendar/blob/main/PowerBI%20Analytical%20Calendar%20Examples/AnalyticalCalendar5.4.pbix


Qlik Sense Example

https://github.com/kongson-cheung/analytical-calendar/blob/main/Qlik%20Sense%20Analytical%20Calendar%20Examples/Analytical%20Calendar%205.0_Complete_Week_With_Custom.qvf

* Qlik Sense Example, I have got some issue in my Qlik Sense Desktop.  You might refer to old version at the moment.


The presentation slides:



























Thank you for reading.  I hope it help.  Leave me message if you have any discussion or want to share your views.

Saturday, February 3, 2024

Analytical Calendar - Concept in Brief

  Post Index

Technical Discussion | Analytical Calendar The Story | The Index Page

2024-02-03

Analytical Calendar

Concept In Brief

Core Understanding of Analytical Calendar


Analytical Calendar looks like simple but also complicated.  To make sure it is easier to digest and understand, I would like to provide the concept in brief.

Let's first have a review of Analytical Calendar. Analytical Calendar consists of two calendars including Analysis Calendar and the Comparison Calendar.  Analysis Calendar is to define the analysis period while the comparison calendar is to define the comparison period.


Analytical Calendar consists of 4 core fields including:

Date Perspective

It is to define what date granularity is required for the analysis and comparison.  Without this selection, it is still possible but the Unique Date selection is difficult to filter out the required Unique Date in the list.

Unique Date

Based on the date perspective, it allows the corresponding date to be shown for selection.  For instance, If Year is selected in Date Perspective, Unique Date will show 2024, 2023, 2022, etc; if Month is selected, it shows 2024-Feb, 2024-Jan, 2023-Dec, etc.

Analysis Perspective

With the Unique Date as reference, Analysis Perspective define the period requirements such as:

- Actual

- YTD, YTM, YTW, etc

- Rolling N Year/Quarter/Month/Week/Day

Comparison Perspective

With the defined Analysis requirements, it further selects the comparison period with respect to the analysis requirements.

- N Year/Quarter/Month/Week/Day Before

- Custom Year/Quarter/Month/Week/Day

- N/A



How the Calendars Work

In Analytical Calendar, the usage is like below flow.

Analysis Period --> Comparison Period --> Calendar Control --> Fact Table Association


Analysis Period

The Analysis Period is purely defined by Date Perspective, Unique Date and Analysis Perspective.

Date Perspective and Unique Date are apparently a new method to define a date in particular granularity.  This method eliminates conflicts in human interpretation and manipulation about the selected date by date hierarchy, i.e. Year-->Month/Week-->Date.

Once the Date with Granularity is defined, next is to define the date range by Date Perspective.  The selected Unique Date is actually the date end of the date range.  If date range is having same start and end date, it is called Actual.  Other will be represented by YTD, YTM, Rolling N Year/Month/Quarter/Week/Day.

All these then construct the Analysis Calendar.


Comparison Period

Comparison Period cannot exist without the Analysis Period.  It is, in fact, a period with reference to the date range of the analysis period.  The Comparison Perspective is generalized with a standard representation in two ways.  One is to define a date range before and the other is free date selection.

There is one very special scenario for Comparison Perspective is no comparison period.  For completeness purpose, it is required for a value to be selected in the comparison period.

As a result, three categories are available.  And the Date Perspective, Unique Date and Analysis Perspective and Comparison Perspective construct the Comparison Calendar.


Calendar Control

Both analysis period and comparison period are, in fact, trying to reflect the selection in Analysis Calendar and Comparison Calendar.  The analysis period defined is tagged as ANALYSIS while the comparison period defined is tagged as COMPARISON.  If the CONTROL is not selected with a value, it means both analysis and comparison.

This CONTROL is very important in order to control which calendar to take effect.


Fact Table Associated

The Fact Table associated with both analysis and comparison periods in the form of the actual date association.   Based on the Control (ANALYSIS or COMPARISON), it allows to reflect the analysis period and comparison period.


Two examples are illustrated below.



In the first example, Date Perspective, Unique Date and Analysis Perspective first define 2023-05 for Analysis period.  And then the Comparison Perspective furthers define the Comparison period to be 2023-04.  Both Analysis Period and Comparison Period are combined into the into the Calendar Control where indicating 2023-05-01 to 2023-05-31 is ANALYSIS Period and 2023-04-01 to 2023-04-30 is COMPARISON Period.  Then the actual date range is reflected into the fact table with the range 2023-04-01 to 2023-05-31.

The second example is similar to the first example but this time, no comparison period is defined.  Thus, the Calendar Control only has ANALYSIS Period and Fact Table is only reflecting the date range from 2023-05-01 to 2023-05-31.


Summary

Analysis Calendar provide methods to select values to define the ANALYSIS period while Comparison Calendar provide methods to select values to define the COMPARISON period.  Fortunately, both calendars share the same fields including Date Perspective, Unique Date and Analysis Perspective.  This is one of the main design of Analytical Calendar and it is one reason why it can simplify the selections.  The other main concept is the Calendar Control.  It is important that there is a way to distinguish the ANALYSIS and COMPARISON periods.  Incorporating these concepts altogether, it is the Analytical Calendar.

I hope this brief description can help you understand further about the Analytical Calendar.  Next, I would share how to apply Analytical Calendar into other tools like PowerBI and Tableau.


Thank you for reading.  I hope it help you.  Just leave a message if you have any discussion/share want to make.









Saturday, January 27, 2024

Analytical Calendar - Minor Enhancements

 Post Index

Technical Discussion | Analytical Calendar The Story | The Index Page

2024-01-27

Analytical Calendar

Minor Enhancements

Feedback drives the New Design




There are three minor enhancements about the Analytical Calendar including:

1. N/A in the Comparison Calendar

2. N Year Comparison in Week

3. N Day Rolling in Month End

These three were based on some feedbacks and enlightened up the new design.


Let's take a look the issue one by one with a possible solution and discussion.


N/A in the Comparison Calendar

While using the Analytical Calendar, it strictly follows the green-while-grey design in Qlik, i.e. the associative model requires four selections to make the Analytical Calendar and Comparison Calendar to function including Date Perspective, Unique Date, Analysis Perspective and Comparison Perspective.

Unfortunately, in some of the analysis, it does only require the Analysis Calendar rather than the Comparison Calendar.  There is no way in Comparison Calendar to bypass the comparison.   Thus, the associated results would not be accurate due to the inclusion of the comparison calendar.  No matter one or many of the Comparison Perspective are selected, the association is broken in some sense.

As illustrated below, if nothing is selected, it includes all dates but even if one of the value is selected, it still has redundant dates in the comparison calendar that makes the association is not accurately associated with the required dates.



In order to fix this issue, "N/A" is introduced in the Comparison Perspective to allow association effective only for Analysis Calendar.  "N/A" means no effect in Comparison Calendar.  Thus, what is selected in Analysis Calendar, it directly reflects in the association.  An example is illustrated below.



In general, whenever there is no Comparison Calendar requirement in certain sheets, it could simply lock the "N/A" value in the Comparison Perspective and once it exits the sheet, it can unlock the selection.  Although it looks redundant to select "N/A", it is still clear to have what you select is what you see.


N Year Comparison in Week

In 2023, very lucky, 1 Jan starts on Sunday.  Thus, in a full week design calendar, it has a total of 53 weeks.  For all other years that 1 Jan is non-Sunday, it has only 52 weeks.  For year with 365 days, it shifts for the start to the next weekday in next year.  So, in 2024, 1 Jan starts on Monday.  However, there are leap years having 366 days.  This, therefore, creates an irregular shift pattern.  Let's take a look the year from 1993 to 2024.  Below table illustrates the weekday of 1 Jan and also it is a leap year.



Every 28 (7 *4) years, the whole pattern reappear.   1 Jan is Sunday can happens in 5, 6, 11 year gap.  The 7 day week and the total number of day in year creates a very confusing situation.

Let's then take a look the comparison. As illustrated below, whenever it is passing through the week of 1 Jan on Sunday, a year originally compares 52 weeks before should shift 1 more week.


This problem is fixed with the additional week shift.  However, I am not able to identify a rule for this.  My method to review all the inclusive years in the comparison in order to make the shift.  If you come up with purely formula-based method, please share me.


N Day Rolling in Month End

When talking about rolling, the comparison period should have the same number of days against the analysis period.   Now, let's take an example of 2024-02-29 rolling 5 days.  So, the date range would be from 2024-02-25 to 2024-02-29 (5 days).  If it is comparing 1 year before, what should it be?

If using the start and end dates with 1 year back, it becomes 2023-02-25 to 2023-02-28 (4 days).  It does not make sense for the rolling and comparison concept to have same number of days in the rolling periods.

Let's apply similar method like we define the date range of 2024-02-29 rolling 5 days.  It is based on the end date to find 5 days before.  So, in comparison, we first define the end date, i.e. 2023-02-28 (1 year before 2024-02-29, more precisely, it is 2023-02-29 by deducting 1 in the year component.  Since, this date does not exists, it shifts to the nearest end of the month, i.e. 2023-02-28).  With the end date at 2023-02-28, rolling 5 dates, it has the range between 2023-02-24 to 2023-02-28 (5 days).

With the same concept, 2024-03-31 rolling 3 days compare to 1 month before.  The analysis period is from 2024-03-29 to 2024-03-31 (3 days) while the comparison period is from 2024-02-26 to 2024-02-28 (3 days).  Does this make sense to you as well?


Conclusion

Analytical Calendar has many discussion how to manipulate the complicated date.  The irregular grouping of Month, Week , Year makes the operations very difficult.  The innovation of the Analytical Calendar aims at reducing the effort for human interpretation of these operations and try standardize the method of calculations.  There is still a long way to make all these standardized and have a mutual agreement the usage and calculations in the world but I hope the shared concepts and implementation methods can continue to evolve.  One day, the analysis can become more straight-forward, easy to learn and without confusion.


Thank you for reading.  I hope it help you.  Just leave a message if you have any discussion/share want to make.




Saturday, December 2, 2023

SQL Table and Field Extraction

 Post Index

2023-12-02


SQL Table and Field Extraction

Analyzing and Parsing SQL

How to design, analyze and explore 
for 
the SQL parsing and extraction



SQL is very common in the realm of the data world.  In every second, there are a huge number of SQL queries being executed to perform operations in databases.  It is all over the world, both externally and internally in organizations.  There are many data roles and positions are requiring SQL skills to support all kinds of data applications and analytics.  Nonetheless, the overwhelming demand leads to a common question to understand what data have been used.  A lot of organizations are looking for solutions on how to understand the data cycles, the data lineage and the data governance.  Unfortunately, there is no handy tool available.  In particular, parsing the SQL query and extracting the underlying tables and fields are also unavailable.

For this question, I have been asked for a solution quite a few times.  The intuition is that even there is a way to do, it would not be perfect.  Apparently, it is time-consuming and required plenty of tests and experiments.  In terms of the overall priority in organizations, it is 100% not in the top priority list.  Thus, even there is interest, there is no one trying to work out in this area (maybe there is, ping me and let me know).

Nonetheless, I think it is good to try this out.  One of the reasons is that this can be a good exercise to practice different skills like design skill, analysis skill, problem solving skills, coding skill, etc.  Another reason is that it is possible to be a core component in the future of data realm (perhaps GenAI can do later?).  I am hoping that it can benefit to the people looking for the same solution.

As a result, in this article, I am going to provide the details on how this question is answered.  I will try best to provide my step by step thinking and analysis for reference so that you would be able to pick up the skills I did.  Or share me something new based on what I have done.


OK.  Let's start.

* the tables in the article also refers to tables or views.


Overview of SQL Query

Let's take an example of a SQL query that I obtained from the web.

SELECT
  DATE_FORMAT(co.order_date, '%Y-%m') AS order_month,
  DATE_FORMAT(co.order_date, '%Y-%m-%d') AS order_day,
  COUNT(DISTINCT co.order_id) AS num_orders,
  COUNT(ol.book_id) AS num_books,
  SUM(ol.price) AS total_price,
  SUM(COUNT(ol.book_id)) OVER (
  PARTITION BY DATE_FORMAT(co.order_date, '%Y-%m')
    ORDER BY DATE_FORMAT(co.order_date, '%Y-%m-%d')
  ) AS running_total_num_books
FROM cust_order co
INNER JOIN order_line ol ON co.order_id = ol.order_id
GROUP BY 
  DATE_FORMAT(co.order_date, '%Y-%m'),
  DATE_FORMAT(co.order_date, '%Y-%m-%d')
ORDER BY co.order_date ASC;

In the SQL query, there are many elements:

  1. SQL Clauses, e.g. SELECT, FROM, JOIN, GROUP BY, ORDER BY, etc.
  2. Functions, e.g. DATE_FORMAT, SUM and COUNT.
  3. Table, e.g. cust_order, order_line
  4. Field, order_date, order_id, price, etc.
  5. Alias, for table, e.g. co (cust_order), ol (order_line).  For field, e.g. running_total_num_books.
  6. Reserved Word, e.g. ASC, INNER, ON, etc.
  7. Text, they are enclosed by single quote, e.g. '%Y-%m-%d'.  After the start of single quote, it can be any character until the pair of quote is met.
  8. Separator, e.g. commas (,), space ( ), newline (\n), bracket, etc.  I refer them all to symbols.
  9. Dot, it is used to separate the server, database, schema, table and field.  Usually, server, database and schema can be omitted
In fact, there are a few other possible elements not showing the example:
  1. Comments, it can be either // or /* */ or -- depending on which database and its proprietary syntax.
  2.  Double Quote, it is used for explicitly to provide the object name to escape some characters, e.g. "My Column".  Space in it will be treated as a character of the field name.
  3.  Asterisk (*), it is a special indicator to get all the fields.
  4. Operators, e.g. +, -, *, /, ||, etc.

As you can see, there are a lot of details are conveyed inside the SQL query.  This is my first basic understanding about the SQL query by some known knowledge and observations.  The initial is started by the above.  And later on, if more scenarios come in, it will be enhanced case by case.


Tokenize the SQL Query

After the SQL query overview, next I am thinking to tokenize the SQL query.  It is to construct a way to break the SQL query into tokens.  It is important to be able to identify each component in the SQL before we can analyze and identify the table or field.

However, how to do it?  A straight way from everyone would be using whitespace, i.e. space, new line, tab, etc as a separator to 'chopped' the SQL into pieces.  Then let's have a try.


Based on the example, the below result will be obtained:

1. SELECT
2. 
3. DATE_FORMAT(co.order_date,
4. '%Y-%m')
5. AS
6. order_month,
... ...


Apparently, it is not as our expectation.  However, let's calm down, observe and analyze the results.

  1. Whitespace is possibly one of the separators.
  2. It looks like symbols can also be a separators.  e.g.  bracket, comma, single quote, etc.
  3. The separator (either whitespace or symbol) is right after the end of each token.  
Then, let's revise the separator list as below.

  1. \s (whitespace)
  2. ,
  3. +
  4. -
  5. *
  6. /
  7. \
  8. (
  9. )
  10. . (dot)
  11. "
  12. '
  13. <
  14. >
  15. ;

Then, let's re-try the tokenization (this can be achieved by using regex, I will focus on the concept and idea in this article.  If there is a need for more details in the regex, leave me a comment).   So, the tokenized result will be like:

1. SELECT
2. DATE_FORMAT(
4. co
5. .
6. order_date,
7. 
7. '
8. %Y-%m'
9. )
10.
11. AS
11. order_month,
... ...


Thus, each token is now either clean without a suffix symbol or not clean with a suffix symbol.  Thus, it would be possible to further separate the symbol into next row, i.e.:

1. SELECT
2. DATE_FORMAT
3. (
4. co
5. .
6. order_date
7. ,
8. 
9. '
10. %Y-%m
11. '
12. )
13.
14. AS
15. order_month
16. ,
... ...


After these steps, each token now is clean, i.e. either a word or a symbol.  Then next, the tokens are required to be manipulated into SQL unit.


SQL Unit

SQL unit, basically, refers to the unit that is recognized as an human recognizable object (I am not quite sure how to call it and so named it SQL unit).  For example, 'Hello World', the entire 'Hello World' is a SQL unit.  Another example, Field as "Renamed As", "Renamed As" is a SQL unit.  Another example, mySchema.myTable.myField, the entire string is a SQL unit.  From human interpretation, they are a single unit.

The SQL unit is atomic and with unique meaning.  Examples of SQL unit are:

  1. string, each single quote pair will be become one SQL unit.
  2. multirow comments, everything inside /* */ means to be a SQL unit.
  3. double quoted string, each double quote pair will become one SQL unit.  It is similar to single quote.
  4. dot, the dot will concatenate the server, database, schema, table and field as a SQL unit.

The pattern in the tokens are therefore analyzed to combine together as SQL unit.  As a result, the below will be obtained:


1. SELECT
2. DATE_FORMAT
3. (
4. co.order_date
5. ,
6. 
7. '%Y-%m'
8. )
9.
10. AS
11. order_month
12. ,
... ...


As can be seen, co.order_date is SQL unit of field,  '%Y-%m' is a string for date formatting, etc.


Up to this moment, the whitespace at beginning and the end is actually redundant because the meaningful whitespace is already included inside the SQL unit.   After trimming (or remove), it will be like:

1. SELECT
2. DATE_FORMAT
3. (
4. co.order_date
5. ,
6. 7. '%Y-%m'
7. )
8. AS
9. order_month
10. ,
... ...



Then next, it is going to identify patterns.  There are a few things can be identified.



Identifying SQL Clause

SQL query has a syntax to follow and the syntax is actually explained by SQL clauses.  For example, Select clause to have all the fields to be extracted from the table/view.   From clause to specify tables/views.  Dependent on which database is used, the clauses might be slightly different.  But in general, the below must be able to be identified.

  1. SELECT
  2. FROM
  3. JOIN (From)
  4. ON (From)
  5. WHERE
  6. GROUP BY
  7. ORDER BY 
  8. etc

Identifying the SQL clauses can help to distinguish the extracted SQL unit is a table or is a field.  Obviously, table can only be found in FROM Clause and JOIN (From) clause.  Field can be found in all other clauses.

* it should refer to the syntax of the SQL query allowed in the database.


Identifying Subquery

Subquery is inside a SELECT statement, there is another SELECT statement.  It can appear in SELECT clause for scalar output, or in FROM, JOIN (FROM) clause as temporary table result.

Subquery is easy to determine because it must be between a pair of bracket and starts with SELECT, i.e. (SELECT ... ) 

Identifying subquery is important because in terms of table and field extraction from the SQL query, the only focus is the first time of use.  The nested subqueries are usually re-using the extracted fields to continue calculation.  The hierarchical usage will not be a concern.

Based on this, the subquery is required a SubqueryID.  So that the table and field can be matched within the same subquery.  With this concept, it facilitates the matching processes for table and field.  It will discussed later in the article.


Identifying Function

Function is always expressed in the form like function (parameter1, parameter2, ... , parameter n).   It must have a bracket to include a parameter list.  Thus, if bracket is found and it is not a SQL unit of subquery, it must be function.

Function means that it neither table nor field.  Thus, it should be ignored for the extraction.


Identifying Special Patterns

This is very difficult to define them in the first place.   It is required a lot of testing of SQL queries to figure this out.

First example is at time zone.  All three words in this special pattern is not in the reserved word list.  But they are allowed in SQL to specify the timezone.  It can be tracked as a SQL unit.  But no matter what, they are ignored as they are not table or field.

Another example is date component appearing in some of the date functions,  e.g. date_trunc(year, DateField)

Usually, inside the function call, it is either a field, a number or string.  For specific function like date_trunc, date_part, date_dateadd, extract and datediff, these functions will take a parameter call date part to indicate which date component to be used, e.g. year, month, day, etc.  They are neither required to be quoted nor are reserved words.   A special pattern is required to identify this.


* Just take a note that there might be more to be figured out.



Identifying Alias
There are different patterns for alias.  They include:

1. SQLUnit as Alias

It is typical, e.g. field as fieldName or table as tableName.


2. SQLUnit Alias

It is also typical to omit the AS, e.g. field fieldName or table tableName.


3. SQLUnit "Alias" or SQLUnit as "Alias"

It is also typical to define the alias with double quote to escape the special characters, e.g. field "field Name" or table as "table Name".  AS can be optionally be used.


4. With Alias

Alias inside the with clause is having a special syntax that allows multiple queries to be first executed and named aliases.  Then a following query will make use of those temporary results to complete the final the SQL result.

e.g. WITH t1 AS (

SELECT AVG(Sales) AVG_SALES FROM Store_Sales

)

SELECT a1.* FROM Store_Sales a1, t1

WHERE a1.Sales > t1.AVG_SALES;


Instead of specifying the alias at the end, it is specifying the alias at the very beginning.  The example t1 is the alias for the result SELECT AVG(Sales) AVG_SALES FROM Store_Sales.


Therefore, it is possible to know aliases of fields and aliases of tables.



Removing Objects
The original purpose is to find out table and field used in the SQL query.  Thus, it means to remove all other that are not tables nor fields.  They include:

  1. Reserved Words
  2. Functions
  3. Symbols (separators)
  4. Comments (single row comment // and multirow comments /* */)
  5. String
  6. Operators
  7. Numeric 
  8. Identified Patterns
  9. Aliases

Once all the above are identified and removed in the SQL query, the remaining is only table or field.

The removing objects are identified based on regression test.   Once the result is obtained, it is obvious to notice what is further required to remove.  The above list is at least running 10+ times of different queries to identify.




Once the patterns are identified, next is to finalize the table and field discovered.

Matching Table and Field

After tokenization, SQL unit combination and pattern identification, the next will be to match the table and field against the metadata in the database.

For table, it is straightforward to obtain the table from the FROM or JOIN (FROM) clauses.  But for field, it is a bit tricky.

Because of the possible use of asterisk, the only possible way is to to cross check against the database metadata.  Otherwise, it is impossible to know what fields are included.  Also, the field specified in the SQL query is usually without the table prefix.  Unless, there is a full list of metadata to cross check, it is impossible to ensure the field belongs to which table.

To match the field, there are different matching required:
  1. Match the exact name (no table prefix)
  2. Match with alias
  3. Match the Full/Partial FQDN  (database.schema.table.field)

Each FROM clause or JOIN (FROM) clause is specifying the table while the corresponding select clause is specifying the field.  So, each subquery, an SubqueryID has been assigned.  The outmost query is 0.   As a result, the field must come from one of the tables with the same SubqueryID.   Thus, the above 3 matching criteria, either one will ensure a match.

* it is assumed the query is syntactically correct and is executable without error.  Thus, cross-checking with database metadata, the answer is unique.



Matching Asterisk

The match for asterisk is similar to have:
  1. Match with alias
  2. Match the Full/Partial FQDN
But instead of matching a single field, it is getting all the fields inside the specified table.  If not prefix, it means all fields in the table list.




Conclusion

The purpose of this article is not trying to provide a tool for this purpose but to share the understanding and exploration that have been done such that the mindset of tackling this kind of problems can be understood.  Or if possible, to enable you to think further.

There is an assumption that the SQL query must be syntactically correct and the SQL is runnable.  The extraction is not doing debugging but instead, it is purely pattern recognition.

The method discussed still cannot promise a 100% extraction but to continue for more test cases and patterns to be identified, it will eventually becoming 100% (Also, it depends on which SQL standard is used and it is varied).  But in terms of our own testing, it can tackle majority of the SQL queries.

At the end, I would like to say "practice makes perfect".   Try more and you would learn more.




Thanks for reading.  I hope you enjoy my post.


By the way, share me your thoughts, leave me comments.