2025-02-15
Data Flow Practice
Generic Pattern for Reference and Discussion
In business analytic realm, it usually involves two core parts. One is data transformation and the other is data visualization. There are many terms for data transformation like data wrangling, data preparation, data processing, etc, no matter how it is called, it is a process to prepare the analytic data for visualization or presentation.
In a lot of the analytic teams, it involves tools to help transform the data and it takes the data sources into the tool and then perform a lot of data operations to generate and save the output to storage. In which, the data flow is often a headache to many data analysts because it is not easy to write a maintainable flow.
In this article, it hopes to provide you some concepts about some important areas in data flow and let you understand the practice. As the rapid change of the world, it does not cover everything but at least, it is a concept you might consider to follow.
Data Flow Process
In all the data flow, it has three main components involving:
- Input
- Preparation
- Output
Input is a process to take in the necessary data into the tool for processing. Necessary data means the required columns and required rows. Required columns are columns that are required in the output or they are for making new derived columns. Required rows refer to the relevant records. For instance, rolling 4 years of data. Input process is crucial to filter out irrelevancy because it helps the downstream processing to save resources by the means of not doing redundant operations.
Many data analysts do not pay much attention to this. It is understood that while implementing the data flow, it keeps rapidly changing of the flow logics. However, at the end, the redundant input should be tidied up. Apparently, it is good for the system performance but it also increase the readability and maintainability.
Preparation is a process to prepare the necessary business data for the output. It generally involves tasks like data cleansing, creating new columns for business purpose, joining / union, grouping, etc. It could be a complicated process dependent on the analytic requirements. More commonly, it is a process to prepare a data model for visualization tool. The enterprise data is taking in and the business units are on top to generate new values for the data based on their operation or analytic purposes. Without a proper concept, it always leads into a messy flow which has low performance and low maintainability. In fact, there is some steps that might consider for better flow.
Output is obvious a processing to save the data into the storage. It also is a process to tidy up the data for the downstream usage. Data flow because it helps the downstream to continue the flow for processing and generating data values. The tidy-up process usually involves a column renaming. For example, the columns are renamed from database naming convention to business naming convention to facilitate the visualization purpose. Output is, in fact, the input of the other data flow / processing. It should well understand both the up/down stream impact in order for the best data flow development.
Single Table Pattern
Let's take a simple case that the data flow only involves one table.
For all data operations, it starts with single table (* there are semi-structure data that is not in tabular format, the concept here is to deal with a single set of data initially).
When dealing with single table, it involves the generic steps as shown in the below diagram.
Let's break down into the three core components in data flow to explain what it actually does.
Input process, as discussed, brings in the necessary columns and rows. It specifies column name and row filters. No matter what kinds of data, it is the first rule of thumb to do to just bring in relevant data.
Preparation process is then to manipulate the brought-in data by formula or logics such as data cleansing (trim space, split text, change date format, etc), deriving new columns for business analysis, creating new join key (it happens when more tables are involved). Once the required data processing is completed, the next step is to remove the irrelevant columns. It means to only keep the columns that are required for the output.
Output process is then to rename all the relevant columns for the downstream usage. More often, it applies the business naming convention. Once it is completed, it will then write out the data to the target storage.
The pattern looks simple but in reality, you will see a lot of data flow is not achieving this simple practice. The reason is understood that the rapid change of the requirement and data might be difficult to keep this practice. However, indeed, if this practice is always maintained, it is far easier than not to apply it. Once applied, the benefit is obvious that you feel more happy to read and digest the data flow details.
Two Table Pattern
Let's then take a another simple case of involving two tables.
When dealing with two tables, it involves the generic steps as shown in the below diagram.
Two tables mean two single tables. Therefore, it follows the same practice for input and preparation processes for two single tables individually before it combines the data into the combined table. In here, the single table should first complete all single table operations that means it finishes creating the new derived fields if it relies on only one single table data. In such as way, the data flow is easier to track and trace.
Let then takes a closer look the combined table preparation process and output process.
Preparation process is actually the same as single table with an additional step to combine the data using join or union. After combing the tables, it continues the data operations. The data operations refers to the operations require columns in both tables. Once completed, the irrelevant columns would be removed.
Output process is the same as the single table pattern that it rename the columns to facilitate downstream operations before saving out the data.
Again, the pattern looks simple but in reality, you will see a lot of data flow is not achieving this simple practice. You can image that it is just a step by step process to first deal with merely a single table and then two tables.
Three Table Pattern
Three tables are apparently very similar to two tables. It is the same that it first manages the single table and the combined table. The only additional step is that it has two combined steps.
The generic steps as shown in the below diagram.
It actually adds a step to combine the third table against the two combined table. The general steps are same. The concept is very clear to deal with one table and then two, then three and so on.
Sound easy and try to keep the flow clear.
Optimization is a not a must do task but it has big advantages in general to enterprise working on data flow. Optimization actually is reducing the human readability because it helps optimize the data flow to facilitate the computer to run faster and more efficient. It could not be hard to image that it helps save money because nowadays, every piece of cloud resources is actually money.
So, working on data flow, which is more important? Human readability or computer efficiency? The answer is obvious to strike the balance. But frankly, it should first be human readable then computer runs efficiently because maintaining the data flow is human. Unless, there is a very strong team that is very technical. Otherwise, let's vote for human readability. In the digitalization decades, more business users are actually starting to deal with data. Even the skill sets are increasing for business users to deal with data, it is still far possibly they are more technical than IT specialists. At least, for my personal experience, enabling the business units to get into the data world, the benefit is much more than having just a fast running flow. Do you agree?