2023-10-07

Synthesize Data Environment

Techniques Sharing

How to use Production Data to Synthesize Sample Data

One of the issues in data realm is that there is a lack of development and testing data. Even though it claims to have testing data available, they are loosely coupled and does not conform to the data integrity. Using this kind of development and testing data may not cover all scenarios that may happen in production. And thus, in development and testing, it does not come with the best results. Even worse, it increases the effort, difficulty, time and cost.

Ideally, the production data is of the best to be used for development and testing but there are many reasons like security, performance, maintenance concerns, etc to hinder this possibility. In this article, I will try to share some ideas and techniques that might be possible to use production data to generate development and testing data. The method to be discussed might not be perfect and it still requires more input and enhancements. The idea, however, is good to be shared and discussed. Kindly let me know what you think and leave a message for discussion.

Techniques

There are a few techniques to be used including:

Data Masking
Data Shuffling
Randomizing Values

These techniques can be applied individually but to get a better result, a combination of them is required. Imagine that the production data is the best source for development and testing because it has all actual scenarios captured in the dataset and all columns are with the expected values already. However, because of the data sensitivity, it is not open to everyone. Thus, further to image that if the values in the data table are all maintained, the values are just shuffled, masked or randomized, it is a close to the production data. Although the integrity might not be 100% enforced, majority of the synthesized data is already possibly enough for development and testing.

Some simple analysis can be performed in each table to see which methodology would be best fit to be used.

Data Masking

Data masking is obviously to mask the data. To be more specific, it can be achieved in several ways:

hash function mask (string / integer)
fixed string/number mask

Hash function mask is to apply a hash function to the values such that a fixed string or fixed number can be generated. This operation is irreversible, i.e. the hashed value is not able to be generated back to the original value. Some examples of hash function are SHA1, SHA2, MD5, etc. Apparently, it is able to "guess" or "deduce" the original values if the provided values can generate the same hash values. In order to avoid such situation, a fixed constant, current date, can be added to the value for the hash generation.

Fixed string/number mask is to convert the value into a fixed string or number. If it is too sensitive, it's better just to masked it with ***MASKED DATA*** for example.

Data Shuffling

Data shuffling means that all the values in the column are shuffled. This is good to maintain the general values such as gender (Male, Female), Product, etc. The shuffled values are still meaningful but they are in another rows.

Data shuffling is also very useful to be applied in the key column. Thus, the data association will become more random.

Randomizing Values

Randomizing values obviously is to randomize numbers, dates or values.

For number, it is easy to apply random function.

For date, it is easy to apply random function as well with a start date. The start date plus the randomized days will become the randomized date.

For values, a list of value should be ready beforehand and the random value will be based on the list.

Random values are always bounded by a range or a list of values.

More Considerations

In general, there are several kinds of data:

Key
Fixed list of data values
Fixed range of values
Free Text

Key - if it is not sensitive, shuffling would be enough. Otherwise, apply data masking and shuffling. For masking, it is a must to apply a fixed constant or date into the hash function.

Fixed list of data values - shuffling or randomizing would be good choices.

Fixed range of values - randomizing would be good.

Free text - not applicable for the above techniques. There are more considerations.

The synthesize data is not trying to duplicate the entire production data and convert them into development and testing data. Only a portion of them can be extracted and applied the above techniques. And synthesizing the data periodically is good practice to lower the chance the data can be deduced.

Conclusion

The above techniques are not perfect but it is a starting point to consider how to automate for development and testing data. One beauty of the above techniques is because of its simplicity. The idea is not trying re-produce the full set of production data but a portion of them to be enough to facilitate development and testing. There are possibly plenty of other methods to achieve the same purpose.

Currently, there is also topic on generative AI to generate data for the same purpose. Perhaps, a combination of the techniques and AI power will be the next phase to move on.

I hope you like the sharing. Thank you for reading! Drop me a message for discussion.

Labels

Saturday, October 7, 2023

Masking Data Environment

Synthesize Data Environment

Techniques Sharing