Saturday, February 3, 2024

Analytical Calendar - Concept in Brief

  Post Index

Technical Discussion | Analytical Calendar The Story | The Index Page

2024-02-03

Analytical Calendar

Concept In Brief

Core Understanding of Analytical Calendar


Analytical Calendar looks like simple but also complicated.  To make sure it is easier to digest and understand, I would like to provide the concept in brief.

Let's first have a review of Analytical Calendar. Analytical Calendar consists of two calendars including Analysis Calendar and the Comparison Calendar.  Analysis Calendar is to define the analysis period while the comparison calendar is to define the comparison period.


Analytical Calendar consists of 4 core fields including:

Date Perspective

It is to define what date granularity is required for the analysis and comparison.  Without this selection, it is still possible but the Unique Date selection is difficult to filter out the required Unique Date in the list.

Unique Date

Based on the date perspective, it allows the corresponding date to be shown for selection.  For instance, If Year is selected in Date Perspective, Unique Date will show 2024, 2023, 2022, etc; if Month is selected, it shows 2024-Feb, 2024-Jan, 2023-Dec, etc.

Analysis Perspective

With the Unique Date as reference, Analysis Perspective define the period requirements such as:

- Actual

- YTD, YTM, YTW, etc

- Rolling N Year/Quarter/Month/Week/Day

Comparison Perspective

With the defined Analysis requirements, it further selects the comparison period with respect to the analysis requirements.

- N Year/Quarter/Month/Week/Day Before

- Custom Year/Quarter/Month/Week/Day

- N/A



How the Calendars Work

In Analytical Calendar, the usage is like below flow.

Analysis Period --> Comparison Period --> Calendar Control --> Fact Table Association


Analysis Period

The Analysis Period is purely defined by Date Perspective, Unique Date and Analysis Perspective.

Date Perspective and Unique Date are apparently a new method to define a date in particular granularity.  This method eliminates conflicts in human interpretation and manipulation about the selected date by date hierarchy, i.e. Year-->Month/Week-->Date.

Once the Date with Granularity is defined, next is to define the date range by Date Perspective.  The selected Unique Date is actually the date end of the date range.  If date range is having same start and end date, it is called Actual.  Other will be represented by YTD, YTM, Rolling N Year/Month/Quarter/Week/Day.

All these then construct the Analysis Calendar.


Comparison Period

Comparison Period cannot exist without the Analysis Period.  It is, in fact, a period with reference to the date range of the analysis period.  The Comparison Perspective is generalized with a standard representation in two ways.  One is to define a date range before and the other is free date selection.

There is one very special scenario for Comparison Perspective is no comparison period.  For completeness purpose, it is required for a value to be selected in the comparison period.

As a result, three categories are available.  And the Date Perspective, Unique Date and Analysis Perspective and Comparison Perspective construct the Comparison Calendar.


Calendar Control

Both analysis period and comparison period are, in fact, trying to reflect the selection in Analysis Calendar and Comparison Calendar.  The analysis period defined is tagged as ANALYSIS while the comparison period defined is tagged as COMPARISON.  If the CONTROL is not selected with a value, it means both analysis and comparison.

This CONTROL is very important in order to control which calendar to take effect.


Fact Table Associated

The Fact Table associated with both analysis and comparison periods in the form of the actual date association.   Based on the Control (ANALYSIS or COMPARISON), it allows to reflect the analysis period and comparison period.


Two examples are illustrated below.



In the first example, Date Perspective, Unique Date and Analysis Perspective first define 2023-05 for Analysis period.  And then the Comparison Perspective furthers define the Comparison period to be 2023-04.  Both Analysis Period and Comparison Period are combined into the into the Calendar Control where indicating 2023-05-01 to 2023-05-31 is ANALYSIS Period and 2023-04-01 to 2023-04-30 is COMPARISON Period.  Then the actual date range is reflected into the fact table with the range 2023-04-01 to 2023-05-31.

The second example is similar to the first example but this time, no comparison period is defined.  Thus, the Calendar Control only has ANALYSIS Period and Fact Table is only reflecting the date range from 2023-05-01 to 2023-05-31.


Summary

Analysis Calendar provide methods to select values to define the ANALYSIS period while Comparison Calendar provide methods to select values to define the COMPARISON period.  Fortunately, both calendars share the same fields including Date Perspective, Unique Date and Analysis Perspective.  This is one of the main design of Analytical Calendar and it is one reason why it can simplify the selections.  The other main concept is the Calendar Control.  It is important that there is a way to distinguish the ANALYSIS and COMPARISON periods.  Incorporating these concepts altogether, it is the Analytical Calendar.

I hope this brief description can help you understand further about the Analytical Calendar.  Next, I would share how to apply Analytical Calendar into other tools like PowerBI and Tableau.


Thank you for reading.  I hope it help you.  Just leave a message if you have any discussion/share want to make.









Saturday, January 27, 2024

Analytical Calendar - Minor Enhancements

 Post Index

Technical Discussion | Analytical Calendar The Story | The Index Page

2024-01-27

Analytical Calendar

Minor Enhancements

Feedback drives the New Design




There are three minor enhancements about the Analytical Calendar including:

1. N/A in the Comparison Calendar

2. N Year Comparison in Week

3. N Day Rolling in Month End

These three were based on some feedbacks and enlightened up the new design.


Let's take a look the issue one by one with a possible solution and discussion.


N/A in the Comparison Calendar

While using the Analytical Calendar, it strictly follows the green-while-grey design in Qlik, i.e. the associative model requires four selections to make the Analytical Calendar and Comparison Calendar to function including Date Perspective, Unique Date, Analysis Perspective and Comparison Perspective.

Unfortunately, in some of the analysis, it does only require the Analysis Calendar rather than the Comparison Calendar.  There is no way in Comparison Calendar to bypass the comparison.   Thus, the associated results would not be accurate due to the inclusion of the comparison calendar.  No matter one or many of the Comparison Perspective are selected, the association is broken in some sense.

As illustrated below, if nothing is selected, it includes all dates but even if one of the value is selected, it still has redundant dates in the comparison calendar that makes the association is not accurately associated with the required dates.



In order to fix this issue, "N/A" is introduced in the Comparison Perspective to allow association effective only for Analysis Calendar.  "N/A" means no effect in Comparison Calendar.  Thus, what is selected in Analysis Calendar, it directly reflects in the association.  An example is illustrated below.



In general, whenever there is no Comparison Calendar requirement in certain sheets, it could simply lock the "N/A" value in the Comparison Perspective and once it exits the sheet, it can unlock the selection.  Although it looks redundant to select "N/A", it is still clear to have what you select is what you see.


N Year Comparison in Week

In 2023, very lucky, 1 Jan starts on Sunday.  Thus, in a full week design calendar, it has a total of 53 weeks.  For all other years that 1 Jan is non-Sunday, it has only 52 weeks.  For year with 365 days, it shifts for the start to the next weekday in next year.  So, in 2024, 1 Jan starts on Monday.  However, there are leap years having 366 days.  This, therefore, creates an irregular shift pattern.  Let's take a look the year from 1993 to 2024.  Below table illustrates the weekday of 1 Jan and also it is a leap year.



Every 28 (7 *4) years, the whole pattern reappear.   1 Jan is Sunday can happens in 5, 6, 11 year gap.  The 7 day week and the total number of day in year creates a very confusing situation.

Let's then take a look the comparison. As illustrated below, whenever it is passing through the week of 1 Jan on Sunday, a year originally compares 52 weeks before should shift 1 more week.


This problem is fixed with the additional week shift.  However, I am not able to identify a rule for this.  My method to review all the inclusive years in the comparison in order to make the shift.  If you come up with purely formula-based method, please share me.


N Day Rolling in Month End

When talking about rolling, the comparison period should have the same number of days against the analysis period.   Now, let's take an example of 2024-02-29 rolling 5 days.  So, the date range would be from 2024-02-25 to 2024-02-29 (5 days).  If it is comparing 1 year before, what should it be?

If using the start and end dates with 1 year back, it becomes 2023-02-25 to 2023-02-28 (4 days).  It does not make sense for the rolling and comparison concept to have same number of days in the rolling periods.

Let's apply similar method like we define the date range of 2024-02-29 rolling 5 days.  It is based on the end date to find 5 days before.  So, in comparison, we first define the end date, i.e. 2023-02-28 (1 year before 2024-02-29, more precisely, it is 2023-02-29 by deducting 1 in the year component.  Since, this date does not exists, it shifts to the nearest end of the month, i.e. 2023-02-28).  With the end date at 2023-02-28, rolling 5 dates, it has the range between 2023-02-24 to 2023-02-28 (5 days).

With the same concept, 2024-03-31 rolling 3 days compare to 1 month before.  The analysis period is from 2024-03-29 to 2024-03-31 (3 days) while the comparison period is from 2024-02-26 to 2024-02-28 (3 days).  Does this make sense to you as well?


Conclusion

Analytical Calendar has many discussion how to manipulate the complicated date.  The irregular grouping of Month, Week , Year makes the operations very difficult.  The innovation of the Analytical Calendar aims at reducing the effort for human interpretation of these operations and try standardize the method of calculations.  There is still a long way to make all these standardized and have a mutual agreement the usage and calculations in the world but I hope the shared concepts and implementation methods can continue to evolve.  One day, the analysis can become more straight-forward, easy to learn and without confusion.


Thank you for reading.  I hope it help you.  Just leave a message if you have any discussion/share want to make.




Saturday, December 2, 2023

SQL Table and Field Extraction

 Post Index

2023-12-02


SQL Table and Field Extraction

Analyzing and Parsing SQL

How to design, analyze and explore 
for 
the SQL parsing and extraction



SQL is very common in the realm of the data world.  In every second, there are a huge number of SQL queries being executed to perform operations in databases.  It is all over the world, both externally and internally in organizations.  There are many data roles and positions are requiring SQL skills to support all kinds of data applications and analytics.  Nonetheless, the overwhelming demand leads to a common question to understand what data have been used.  A lot of organizations are looking for solutions on how to understand the data cycles, the data lineage and the data governance.  Unfortunately, there is no handy tool available.  In particular, parsing the SQL query and extracting the underlying tables and fields are also unavailable.

For this question, I have been asked for a solution quite a few times.  The intuition is that even there is a way to do, it would not be perfect.  Apparently, it is time-consuming and required plenty of tests and experiments.  In terms of the overall priority in organizations, it is 100% not in the top priority list.  Thus, even there is interest, there is no one trying to work out in this area (maybe there is, ping me and let me know).

Nonetheless, I think it is good to try this out.  One of the reasons is that this can be a good exercise to practice different skills like design skill, analysis skill, problem solving skills, coding skill, etc.  Another reason is that it is possible to be a core component in the future of data realm (perhaps GenAI can do later?).  I am hoping that it can benefit to the people looking for the same solution.

As a result, in this article, I am going to provide the details on how this question is answered.  I will try best to provide my step by step thinking and analysis for reference so that you would be able to pick up the skills I did.  Or share me something new based on what I have done.


OK.  Let's start.

* the tables in the article also refers to tables or views.


Overview of SQL Query

Let's take an example of a SQL query that I obtained from the web.

SELECT
  DATE_FORMAT(co.order_date, '%Y-%m') AS order_month,
  DATE_FORMAT(co.order_date, '%Y-%m-%d') AS order_day,
  COUNT(DISTINCT co.order_id) AS num_orders,
  COUNT(ol.book_id) AS num_books,
  SUM(ol.price) AS total_price,
  SUM(COUNT(ol.book_id)) OVER (
  PARTITION BY DATE_FORMAT(co.order_date, '%Y-%m')
    ORDER BY DATE_FORMAT(co.order_date, '%Y-%m-%d')
  ) AS running_total_num_books
FROM cust_order co
INNER JOIN order_line ol ON co.order_id = ol.order_id
GROUP BY 
  DATE_FORMAT(co.order_date, '%Y-%m'),
  DATE_FORMAT(co.order_date, '%Y-%m-%d')
ORDER BY co.order_date ASC;

In the SQL query, there are many elements:

  1. SQL Clauses, e.g. SELECT, FROM, JOIN, GROUP BY, ORDER BY, etc.
  2. Functions, e.g. DATE_FORMAT, SUM and COUNT.
  3. Table, e.g. cust_order, order_line
  4. Field, order_date, order_id, price, etc.
  5. Alias, for table, e.g. co (cust_order), ol (order_line).  For field, e.g. running_total_num_books.
  6. Reserved Word, e.g. ASC, INNER, ON, etc.
  7. Text, they are enclosed by single quote, e.g. '%Y-%m-%d'.  After the start of single quote, it can be any character until the pair of quote is met.
  8. Separator, e.g. commas (,), space ( ), newline (\n), bracket, etc.  I refer them all to symbols.
  9. Dot, it is used to separate the server, database, schema, table and field.  Usually, server, database and schema can be omitted
In fact, there are a few other possible elements not showing the example:
  1. Comments, it can be either // or /* */ or -- depending on which database and its proprietary syntax.
  2.  Double Quote, it is used for explicitly to provide the object name to escape some characters, e.g. "My Column".  Space in it will be treated as a character of the field name.
  3.  Asterisk (*), it is a special indicator to get all the fields.
  4. Operators, e.g. +, -, *, /, ||, etc.

As you can see, there are a lot of details are conveyed inside the SQL query.  This is my first basic understanding about the SQL query by some known knowledge and observations.  The initial is started by the above.  And later on, if more scenarios come in, it will be enhanced case by case.


Tokenize the SQL Query

After the SQL query overview, next I am thinking to tokenize the SQL query.  It is to construct a way to break the SQL query into tokens.  It is important to be able to identify each component in the SQL before we can analyze and identify the table or field.

However, how to do it?  A straight way from everyone would be using whitespace, i.e. space, new line, tab, etc as a separator to 'chopped' the SQL into pieces.  Then let's have a try.


Based on the example, the below result will be obtained:

1. SELECT
2. 
3. DATE_FORMAT(co.order_date,
4. '%Y-%m')
5. AS
6. order_month,
... ...


Apparently, it is not as our expectation.  However, let's calm down, observe and analyze the results.

  1. Whitespace is possibly one of the separators.
  2. It looks like symbols can also be a separators.  e.g.  bracket, comma, single quote, etc.
  3. The separator (either whitespace or symbol) is right after the end of each token.  
Then, let's revise the separator list as below.

  1. \s (whitespace)
  2. ,
  3. +
  4. -
  5. *
  6. /
  7. \
  8. (
  9. )
  10. . (dot)
  11. "
  12. '
  13. <
  14. >
  15. ;

Then, let's re-try the tokenization (this can be achieved by using regex, I will focus on the concept and idea in this article.  If there is a need for more details in the regex, leave me a comment).   So, the tokenized result will be like:

1. SELECT
2. DATE_FORMAT(
4. co
5. .
6. order_date,
7. 
7. '
8. %Y-%m'
9. )
10.
11. AS
11. order_month,
... ...


Thus, each token is now either clean without a suffix symbol or not clean with a suffix symbol.  Thus, it would be possible to further separate the symbol into next row, i.e.:

1. SELECT
2. DATE_FORMAT
3. (
4. co
5. .
6. order_date
7. ,
8. 
9. '
10. %Y-%m
11. '
12. )
13.
14. AS
15. order_month
16. ,
... ...


After these steps, each token now is clean, i.e. either a word or a symbol.  Then next, the tokens are required to be manipulated into SQL unit.


SQL Unit

SQL unit, basically, refers to the unit that is recognized as an human recognizable object (I am not quite sure how to call it and so named it SQL unit).  For example, 'Hello World', the entire 'Hello World' is a SQL unit.  Another example, Field as "Renamed As", "Renamed As" is a SQL unit.  Another example, mySchema.myTable.myField, the entire string is a SQL unit.  From human interpretation, they are a single unit.

The SQL unit is atomic and with unique meaning.  Examples of SQL unit are:

  1. string, each single quote pair will be become one SQL unit.
  2. multirow comments, everything inside /* */ means to be a SQL unit.
  3. double quoted string, each double quote pair will become one SQL unit.  It is similar to single quote.
  4. dot, the dot will concatenate the server, database, schema, table and field as a SQL unit.

The pattern in the tokens are therefore analyzed to combine together as SQL unit.  As a result, the below will be obtained:


1. SELECT
2. DATE_FORMAT
3. (
4. co.order_date
5. ,
6. 
7. '%Y-%m'
8. )
9.
10. AS
11. order_month
12. ,
... ...


As can be seen, co.order_date is SQL unit of field,  '%Y-%m' is a string for date formatting, etc.


Up to this moment, the whitespace at beginning and the end is actually redundant because the meaningful whitespace is already included inside the SQL unit.   After trimming (or remove), it will be like:

1. SELECT
2. DATE_FORMAT
3. (
4. co.order_date
5. ,
6. 7. '%Y-%m'
7. )
8. AS
9. order_month
10. ,
... ...



Then next, it is going to identify patterns.  There are a few things can be identified.



Identifying SQL Clause

SQL query has a syntax to follow and the syntax is actually explained by SQL clauses.  For example, Select clause to have all the fields to be extracted from the table/view.   From clause to specify tables/views.  Dependent on which database is used, the clauses might be slightly different.  But in general, the below must be able to be identified.

  1. SELECT
  2. FROM
  3. JOIN (From)
  4. ON (From)
  5. WHERE
  6. GROUP BY
  7. ORDER BY 
  8. etc

Identifying the SQL clauses can help to distinguish the extracted SQL unit is a table or is a field.  Obviously, table can only be found in FROM Clause and JOIN (From) clause.  Field can be found in all other clauses.

* it should refer to the syntax of the SQL query allowed in the database.


Identifying Subquery

Subquery is inside a SELECT statement, there is another SELECT statement.  It can appear in SELECT clause for scalar output, or in FROM, JOIN (FROM) clause as temporary table result.

Subquery is easy to determine because it must be between a pair of bracket and starts with SELECT, i.e. (SELECT ... ) 

Identifying subquery is important because in terms of table and field extraction from the SQL query, the only focus is the first time of use.  The nested subqueries are usually re-using the extracted fields to continue calculation.  The hierarchical usage will not be a concern.

Based on this, the subquery is required a SubqueryID.  So that the table and field can be matched within the same subquery.  With this concept, it facilitates the matching processes for table and field.  It will discussed later in the article.


Identifying Function

Function is always expressed in the form like function (parameter1, parameter2, ... , parameter n).   It must have a bracket to include a parameter list.  Thus, if bracket is found and it is not a SQL unit of subquery, it must be function.

Function means that it neither table nor field.  Thus, it should be ignored for the extraction.


Identifying Special Patterns

This is very difficult to define them in the first place.   It is required a lot of testing of SQL queries to figure this out.

First example is at time zone.  All three words in this special pattern is not in the reserved word list.  But they are allowed in SQL to specify the timezone.  It can be tracked as a SQL unit.  But no matter what, they are ignored as they are not table or field.

Another example is date component appearing in some of the date functions,  e.g. date_trunc(year, DateField)

Usually, inside the function call, it is either a field, a number or string.  For specific function like date_trunc, date_part, date_dateadd, extract and datediff, these functions will take a parameter call date part to indicate which date component to be used, e.g. year, month, day, etc.  They are neither required to be quoted nor are reserved words.   A special pattern is required to identify this.


* Just take a note that there might be more to be figured out.



Identifying Alias
There are different patterns for alias.  They include:

1. SQLUnit as Alias

It is typical, e.g. field as fieldName or table as tableName.


2. SQLUnit Alias

It is also typical to omit the AS, e.g. field fieldName or table tableName.


3. SQLUnit "Alias" or SQLUnit as "Alias"

It is also typical to define the alias with double quote to escape the special characters, e.g. field "field Name" or table as "table Name".  AS can be optionally be used.


4. With Alias

Alias inside the with clause is having a special syntax that allows multiple queries to be first executed and named aliases.  Then a following query will make use of those temporary results to complete the final the SQL result.

e.g. WITH t1 AS (

SELECT AVG(Sales) AVG_SALES FROM Store_Sales

)

SELECT a1.* FROM Store_Sales a1, t1

WHERE a1.Sales > t1.AVG_SALES;


Instead of specifying the alias at the end, it is specifying the alias at the very beginning.  The example t1 is the alias for the result SELECT AVG(Sales) AVG_SALES FROM Store_Sales.


Therefore, it is possible to know aliases of fields and aliases of tables.



Removing Objects
The original purpose is to find out table and field used in the SQL query.  Thus, it means to remove all other that are not tables nor fields.  They include:

  1. Reserved Words
  2. Functions
  3. Symbols (separators)
  4. Comments (single row comment // and multirow comments /* */)
  5. String
  6. Operators
  7. Numeric 
  8. Identified Patterns
  9. Aliases

Once all the above are identified and removed in the SQL query, the remaining is only table or field.

The removing objects are identified based on regression test.   Once the result is obtained, it is obvious to notice what is further required to remove.  The above list is at least running 10+ times of different queries to identify.




Once the patterns are identified, next is to finalize the table and field discovered.

Matching Table and Field

After tokenization, SQL unit combination and pattern identification, the next will be to match the table and field against the metadata in the database.

For table, it is straightforward to obtain the table from the FROM or JOIN (FROM) clauses.  But for field, it is a bit tricky.

Because of the possible use of asterisk, the only possible way is to to cross check against the database metadata.  Otherwise, it is impossible to know what fields are included.  Also, the field specified in the SQL query is usually without the table prefix.  Unless, there is a full list of metadata to cross check, it is impossible to ensure the field belongs to which table.

To match the field, there are different matching required:
  1. Match the exact name (no table prefix)
  2. Match with alias
  3. Match the Full/Partial FQDN  (database.schema.table.field)

Each FROM clause or JOIN (FROM) clause is specifying the table while the corresponding select clause is specifying the field.  So, each subquery, an SubqueryID has been assigned.  The outmost query is 0.   As a result, the field must come from one of the tables with the same SubqueryID.   Thus, the above 3 matching criteria, either one will ensure a match.

* it is assumed the query is syntactically correct and is executable without error.  Thus, cross-checking with database metadata, the answer is unique.



Matching Asterisk

The match for asterisk is similar to have:
  1. Match with alias
  2. Match the Full/Partial FQDN
But instead of matching a single field, it is getting all the fields inside the specified table.  If not prefix, it means all fields in the table list.




Conclusion

The purpose of this article is not trying to provide a tool for this purpose but to share the understanding and exploration that have been done such that the mindset of tackling this kind of problems can be understood.  Or if possible, to enable you to think further.

There is an assumption that the SQL query must be syntactically correct and the SQL is runnable.  The extraction is not doing debugging but instead, it is purely pattern recognition.

The method discussed still cannot promise a 100% extraction but to continue for more test cases and patterns to be identified, it will eventually becoming 100% (Also, it depends on which SQL standard is used and it is varied).  But in terms of our own testing, it can tackle majority of the SQL queries.

At the end, I would like to say "practice makes perfect".   Try more and you would learn more.




Thanks for reading.  I hope you enjoy my post.


By the way, share me your thoughts, leave me comments.













Saturday, November 18, 2023

Agile Project Management - Requirement Log, Issue Log and Test Log

Post Index

2023-11-18


Agile Project Management

Requirement Log, Issue Log and Test Log

How to use make use of simple Excel log files 
to 
manage data analytic project



With the fast growing world, agile project management is very common nowadays and it has a lot of advantages over waterfall project management, in particular to data analytic project.  One of the apparent reasons is that data analytic project usually involves data exploration and data analysis which is not possible to define all requirements beforehand.  It, in fact, involves a number of try-and-error to understand the relationship of data and try to figure out patterns and methods that are useful for business users to find insights.  To make it simple, it would be easier to understand by the phase "seeing is believing", i.e. creating a mockup dashboard will win over a thousand words to describe the details.  Thus, in agile data analytic project, a number of sprints/cycles is in place to develop, to mockup, to revise, to enhance until the ultimate and desired dashboard is completed.  Unless, there are explicit requirements.  Usually, a time-box approach will be used.  More often, a rough estimation is proposed for initial requirements and based on the development sprints, more in-depth discussion will take place to finalize the dashboard.

However, how to manage the detail and discussion? In this article, I will share how I use the simple documents to keep track the status and details on this kind of agile data analytic project.  The documents include:

  • Requirement Log
  • Issue Log
  • Test Log

These are Excel documents for easy maintenance.  Definitely, the same concepts can be applied into project management application.  It depends on what is available in your organization.


Requirement Log 

Requirement log by its name is to log the requirements.  Its main purpose is to keep track of the discussion of requirements from initial high level discussion to detail on how to implement each charts in the dashboard.  It also includes the data mapping indicating the source table/view and field and description on how to derived new fields.

The requirement logs contains of two sheets:

  • Requirement Log - the detail and status of the requirement.
  • Screen Capture - to provide screen capture/pictures for requirement.
  • # - it is ID of the requirement.
  • Log Date - it is the date that the item is logged.
  • Raised By - it is who provides the requirement.
  • Follow up by - it is who to follow up the implementation of the requirement.
  • Last Follow Up Date - when this item has been last updated.
  • Priority - the priority to indicate the importance.
  • Status - New, In Progress, Pending Confirmation, Completed, etc, i.e. a status that can indicate the situation.
  • Description - the details of the requirement.
  • Remark - additional information to supplement the item.
  • Screen Capture - it is providing a reference number of the screen capture, e.g. #1

To highlight a bit on the description, it is in the format like below:


YYYY-MM-DD (Who logged)
Details


The description is accumulated.  Therefore, whenever, there is an update of the requirement or there is any discussion, it will be logged down in the description.  As a result, there is a very clear trace of the requirements.  And the requirement can be started very high level, e.g. a dashboard for support cases to use major KPI.  And this item can be Closed while it is further broken down into other items to continue the follow-up.  The description can simply reference to the # number to indicate the trace.  It looks simple but it also requires experience to wisely use this requirement log.



For the screen capture, normally, it is very difficult to put this into Excel because when there is a long description, it requires more space and it leads to the misposition of the screen capture and eventually mess up the entire document.  A wiser approach would be freely to put the screen capture into another sheet with a # number.  And when required, it can locate the screen capture.  When you try it out, you will find its beauty.



All items must be completed or closed before it can go into the testing/UAT stage.
In the requirement log, it also contains table and field, derived field as a requirement.  Here, I do not go into details of this as it is quite common to have a list fields to be used and how they will be renamed as business name.  But, it should be emphasized that it can add more sheets as a support of the requirements.  It aims at having a single truth of requirements.


Issue Log

Issue log is very similar to requirement log to have two sheets including issue log and screen capture.  It aims at tracking all the issues arise in the project.  The issue log contains the follow columns:

  • # - it is ID of the issue.
  • Log Date - it is the date that the issue is logged.
  • Raised By - who reports the issue.
  • Follow up by - who is following up the issue.
  • Last Follow Up Date - when this issue is last followed up.
  • Category - which area this issue is belonging to.
  • Status - New, In Progress, Resolved, etc, a status to indicate the progress.
  • Description - the details of the issue.  It has similar format to requirement log having date, who and details for each follow up.
  • Action - what actions have been done to fix the issue.  It has similar format to requirement log having date, who and details for each follow up.
  • Screen Capture -  it is providing a reference number of the screen capture, e.g. #1
To reduce the number of document, requirement log and issue log could be combined into a single document for easy management and follow-up.
All items in the issue log are required to be resolved before it can go into testing stage.


Test Log

Test log is also very similar to the requirement log but the details are focusing on the testing issues.  It contains the following columns:

  • # - it is ID of the testing issue.
  • Log Date - it is the date that the test issue is logged.
  • Raised By - it is who finds out the issue.
  • Follow up by - it is who to follow up the fix.
  • Last Follow Up Date - when this item has been last updated.
  • Description - the details of the issue.
  • Status - New, In Progress, Pending User Check, Pending Fix, Fixed, etc, i.e. a status that can indicate the situation.
  • Screen Capture - it is providing a reference number of the screen capture, e.g. #1

All items in the test log must be completed before it can start the migration process to production.


Conclusion

These three documents including requirement log, issue log and test log can already tackle majority of agile data analytic project.  This simple framework is still required experience on how to write the details in a meaningful way.  One rule of thumb is to bear in mind of MVP, i.e. minimal viable product.  In other words, it should not have redundant details.  All details should be neat and simple.  However, it depends on your understanding and how you can ignore the "noise" and irrelevancy.  I have been using this for a few years and working fine and there are a number of fellow also available to pick this up and continue this kind of neat and MVP approach.  Try this out and see if you can ride on the concepts for your data journey.





I hope you like the sharing.  Thank you for reading!  Drop me a message for discussion.

Saturday, October 7, 2023

Masking Data Environment

Post Index

2023-10-07


Synthesize Data Environment

Techniques Sharing

How to use Production Data to Synthesize Sample Data 
for 
Development and Testing



One of the issues in data realm is that there is a lack of development and testing data.  Even though it claims to have testing data available, they are loosely coupled and does not conform to the data integrity.  Using this kind of development and testing data may not cover all scenarios that may happen in production.  And thus, in development and testing, it does not come with the best results.  Even worse, it increases the effort, difficulty, time and cost.

Ideally, the production data is of the best to be used for development and testing but there are many reasons like security, performance, maintenance concerns, etc to hinder this possibility.  In this article, I will try to share some ideas and techniques that might be possible to use production data to generate development and testing data.  The method to be discussed might not be perfect and it still requires more input and enhancements.  The idea, however, is good to be shared and discussed.  Kindly let me know what you think and leave a message for discussion.




Techniques

There are a few techniques to be used including:

  • Data Masking
  • Data Shuffling
  • Randomizing Values

These techniques can be applied individually but to get a better result, a combination of them is required.  Imagine that the production data is the best source for development and testing because it has all actual scenarios captured in the dataset and all columns are with the expected values already.  However, because of the data sensitivity, it is not open to everyone.  Thus, further to image that if the values in the data table are all maintained, the values are just shuffled, masked or randomized, it is a close to the production data.  Although the integrity might not be 100% enforced,  majority of the synthesized data is already possibly enough for development and testing.

Some simple analysis can be performed in each table to see which methodology would be best fit to be used.


Data Masking

Data masking is obviously to mask the data.  To be more specific, it can be achieved in several ways:

  • hash function mask (string / integer)
  • fixed string/number mask


Hash function mask is to apply a hash function to the values such that a fixed string or fixed number can be generated.  This operation is irreversible, i.e. the hashed value is not able to be generated back to the original value.  Some examples of hash function are SHA1, SHA2, MD5, etc.  Apparently, it is able to "guess" or "deduce" the original values if the provided values can generate the same hash values.  In order to avoid such situation, a fixed constant, current date, can be added to the value for the hash generation.

Fixed string/number mask is to convert the value into a fixed string or number.  If it is too sensitive, it's better just to masked it with ***MASKED DATA*** for example.



Data Shuffling

Data shuffling means that all the values in the column are shuffled. This is good to maintain the general values such as gender (Male, Female), Product, etc.  The shuffled values are still meaningful but they are in another rows.

Data shuffling is also very useful to be applied in the key column.  Thus, the data association will become more random.



Randomizing Values

Randomizing values obviously is to randomize numbers, dates or values.  

For number, it is easy to apply random function.

For date, it is easy to apply random function as well with a start date.  The start date plus the randomized days will become the randomized date.

For values, a list of value should be ready beforehand and the random value will be based on the list.  

Random values are always bounded by a range or a list of values.




More Considerations

In general, there are several kinds of data:

  • Key
  • Fixed list of data values
  • Fixed range of values
  • Free Text

Key - if it is not sensitive, shuffling would be enough.  Otherwise, apply data masking and shuffling.  For masking, it is a must to apply a fixed constant or date into the hash function.

Fixed list of data values - shuffling or randomizing would be good choices.

Fixed range of values - randomizing would be good.

Free text - not applicable for the above techniques.  There are more considerations.

The synthesize data is not trying to duplicate the entire production data and convert them into development and testing data.  Only a portion of them can be extracted and applied the above techniques.  And synthesizing the data periodically is good practice to lower the chance the data can be deduced.  


Conclusion

The above techniques are not perfect but it is a starting point to consider how to automate for development and testing data.   One beauty of the above techniques is because of its simplicity.  The idea is not trying re-produce the full set of production data but a portion of them to be enough to facilitate development and testing.  There are possibly plenty of other methods to achieve the same purpose.

Currently, there is also topic on generative AI to generate data for the same purpose.  Perhaps, a combination of the techniques and AI power will be the next phase to move on.



I hope you like the sharing.  Thank you for reading!  Drop me a message for discussion.