Friday, September 8, 2023

Analytical Calendar - How to Use

Post Index

Technical DiscussionAnalytical Calendar The Story | The Index Page

2023-09-08

Analytical Calendar



Analytical Calendar is extremely useful to be used in interactive dashboard.  It is, in fact, having all the core components when manipulating date and calendar but it is re-designed to satisfy a more simplified way to benefit both the analysts and developers.

On the front-end, because of the simplicity to have just analysis perspective and comparison perspective, the presentation can be more consistent.  Instead of providing different charts for different date granularity, a single chart can already presented different period with different date granularity.  More details will be shared later in this article.

On the back-end, because now everything is already pre-calculated in the data model, it does not require complicated expression to be developed.  It greatly saves the effort for the developer.  More details will be shared later in this article.

In short, this article focuses on the demonstration of the analytical calendar to share the benefits of it.  And it also covers a bit on how it helps the analysts on selecting the date period with a few selections and how it helps developers to reduce the time and effort in design and development.

Let's start.

* this analytical calendar is a more enhanced version from my previous posts.

Two examples are shared in Github:

It is date level with complete week example and also with custom.

https://github.com/kongson-cheung/analytical-calendar/blob/main/Qlik%20Sense%20Analytical%20Calendar%20Examples/Analytical%20Calendar%205.0_Complete_Week_With_Custom.qvf

It is month level example.

https://github.com/kongson-cheung/analytical-calendar/blob/main/Qlik%20Sense%20Analytical%20Calendar%20Examples/Analytical%20Calendar%205.0_Month_Level.qvf


Analytical Calendar

Analytical calendar is a composite of tables associated together to provide time intelligent experience on usage and development.  It is a generic design that can be applied in different kinds of interactive dashboard.  The below will share the core component to let you understand how it can be used.  There are four main components required user selection including:

  • Date Perspective
  • Unique Date
  • Analysis Perspective
  • Comparison Perspective

Only selecting these 4 elements, it is allowed to perform time intelligent date range selection.


Date Perspective

It is uncommon that analysis is performing in different granularity levels.  Even if it is possible, it has a lot of assumptions to make it happen.  Defining a proper date granularity, it makes the overall analysis consistent and easier.  It named as Date Perspective to make it easier to understand by general users.



Date Perspective consists of only five values including Year, Quarter, Month, Week and Date.  It is to ensure the proper date granularity of the overall analysis and more technically, it is controlling the corresponding selection to comply with the integrity.  For example,


if Year is selected,
Unique Date: 2023, 2022, 2021, etc.
Analysis Perspective: Actual, Rolling 2 Years, etc.
Comparison Perspective: 1 Year before, 2 Year before, etc.


if Quarter is selected, 
Unique Date: 2023-Q3, 2023-Q2, etc.
Analysis Perspective: Actual, YTQ, Rolling 2 Quarters, etc.
Comparison Perspective: 1 Year before, 1 Quarter before, etc.


if Month is selected, 
Unique Date: 2023-Sep, 2023-Aug, etc.
Analysis Perspective: Actual, YTM, Rolling 2 Months, etc.
Comparison Perspective: 1 Year before, 1 Quarter before, 1 Month before, etc.


if Week is selected, 
Unique Date: 2023-W41, 2023-W40, etc.
Analysis Perspective: Actual, YTW, Rolling 2 Weeks, etc
Comparison Perspective: 1 Year before, 1 Week before, etc.


if Date is selected, 
Unique Date: 2023-09-09, 2023-09-08, etc.
Analysis Perspective: Actual, YTD, MYD, Rolling 2 Days, etc.
Comparison Perspective: 1 Year before, 1 Quarter before, 1 Month before, 1 Week before, 1 Day before, etc.


In any analysis, it is important to first define the date "perspective" for the analysis.  It avoids the chance to lose focus when manipulating the date components compared to the traditional design.



Unique Date

Unique Date is to further make clear the overall analysis in selecting the exact date representation in the specific date granularity.  

In traditional design, it requires to first select Year then Quarter or Month then to Day.  For developer, it is a disaster to determine the date granularity based on the selection.  A lot of rule-based calculation is required to be defined at the expression to manipulate the selection.  And because of this, this increases the chance to make multiple chart with the sake for different date level presentation.

For general users, it generates confusion, especially, in changing the selection or the view for the analysis.  It needs first to remove the selection and then to apply the new selection.  However, Year, Quarter, Month, Week, Day are conflicting with each other, e.g. if you select a month, you should not use week or vice versa.    If the selection is overridden, there is no mechanism to alleviate the proper selection.  All are just based on the human intelligence to rectify the date component selection.

Although for advanced users, it is definitely not a problem.  The analytical calendar design still hopes to alleviate this issue by making use of the unique date.



Unique date is a union of all the date representations by granularity of Year, Quarter, Month, Week and Date.

For example,

Year: 2023, 2022, 2021, etc.
Quarter: 2023-Q3, 2023-Q2, etc.
Month: 2023-Sep, 2023-Aug, etc.
Week: 2023-W41, 2023-W40, etc.
Date: 2023-09-09, 2023-09-08, etc.


In this design, it just requires to select the exact date representation directly.  There is no human interpretation and human computation required for the correct  and accurate selection.  In short, what you select is what you get.



Analysis Perspective

Analysis, in general, is always to deal a period.  A period means a start date and an end date in the same granularity level.  There are plenty kinds of representation available (to review), it looks the combination of a date plus the range type is more commonly used.  For example, 2023-09-09 YTD, 2023-06 Rolling 6 Months, etc, i.e. a unique date plus the date range type.  To be more precise, there is a date range type called Actual.  It represents itself.

Analysis perspective also follows strictly on the date integrity.  For example, Unique Date=2023, it does not allow you to select YTD, YTM, Rolling 5 months, etc.  Only if the valid combinations fallen into the same date granularity, it allows the value for selection.  In a lot of traditional design, this is not doing very well.


Comparison Perspective

Comparison is another important element in analysis.  Instead of just looking at the current performance, it also very common to compare the historical performance to see whether it is good or bad.

The most common way to represent comparison is last year, last month, or a specific date.  To make it more generic, it becomes last X Year, last X month or a custom date.  Custom date means any unique date.  It can be Year, Quarter, Month, Week or Date. 


In the implementation, it represents X Years before, X Quarters before, X Months before, X Weeks before and X Days before.  For custom date, it is a list of date representation similar to the Unique date.  Similarly, there is date integrity applied.   For example, if the 2023 Actual, it would not allow the selection of X Months before, X Weeks before because they are not in the valid granularity level.



Front-end Design Improvement


With the capability of the Analytical Calendar, the front-end now can be more neat and clean.  The reason is because there is only two perspectives for periods, i.e. Analysis period and Comparison period.

Analysis period is the period for analysis.  It is similar to the traditional design to select year, quarter, month, week and day.   Comparison period is the period for comparison.  In traditional design it is usually incorporated into the chart and each chart will represent a kind of comparison.

By making use of the Analytical Calendar, all charts now can be possible to provide either or both analysis and comparison period.  There is no need to replicate the same chart to have each to show individual Year, Month or Week, etc for the KPI and the comparison.  All charts are on-the-fly to provide the information based on the user selection.

In other words, it provide a coherent view in terms of usage and experience.  To further let the analysts understand what is being analyzed, on the top left corner, there is a textbox showing both the selection details as well as the actual date range selected.  In this way, it is crystal clear on what analysis is being performed and what comparison is being performed across from the selection to all the charts being shown in the sheet.


Back-end Design Improvement

In order to use the Analytical Calendar, it requires only two variables named vSetAnalysisAnalysisCalendar and vSetAnalysisComparisonCalendar.  The former is used for filtering the Analysis Perspective period while the latter is used for filtering the Comparison Perspective period.

The variable vSetAnalysisAnalysisCalendar is very simple as illustrated below.  It just requires to set the %CONTROL_PERSPECTIVE equals to ANALYSIS and make sure there is no Comparison Perspective.

Similarly, the variable vSetAnalysisComparisonCalendar is also very simple as illustrated below.  It just requires to ignore the %CONTROL_PERSPECTIVE equals to ANALYSIS.


With these two variables, all the expression in the chart development is straightforward.  It just needs to apply the set analysis for the two variables.  

When requiring Analysis Perspective, the below set analysis can be used.


When requiring Comparison Perspective, the below set analysis can be used.

As the front-end is now consistent and coherent, the same expression can be applied and across all the charts.  Unless, there are specific chart would be shown, these two kinds of expressions can satisfy a majority of the use cases.  Definitely, the Analytical Calendar can also be applied to more advanced use scenarios which would not cover here.


Other Advantages

Indeed, the Analytical Calendar also has other advantages.  In designing the data model, majority of the time would fall into a situation to pre-build a data or to calculate on the fly.  The rule of thumb is always to pre-build it if possible to minimize the computation required during use.

When manipulating date, there are also a few scenarios that might not be generic for every usage.  For example, if today is 2023-09-09, should it allow the selection of 2023-Sep?  In my experience, it depends on the flavor of the end users but this becomes a question of developers for a proper expression development.  Another example is the cross year week.  There are different ways to handle the incomplete week like making it full week and count as the previous year week or the let the week to be incomplete.  This is a headache not just to developer but also the analysts using the interactive dashboards.

Thus, the design of Analytical Calendar also incorporates this element.  It allows to pre-calculate all these into the data model and do not need very complicated on-the-fly expression to calculate the results.  It can still maintain as Analysis Perspective and Comparison Perspective.  The below are some scenarios for sharing.







This concludes how to use the Analytical Calendar.  Please leave me message if further discussion is needed.


Thank you for reading.  I hope it help you.  Just leave a message if you have any discussion/share want to make.

Saturday, August 26, 2023

How to Develop Alteryx Custom Tool

Post Index  

2023-08-26


How to Develop Alteryx Custom Tool

with the help of Platform SDK

AYX UI SDK, AYX Python SDK v2, AYX Plugin CLI



The platform SDK allows you to develop Alteryx custom tools and plugins.  Alteryx provides a lot of amazing tools but it does not come out of the box for everything.  Just like my previous posts of QVD Input Tool and QVD Output Tool, Alteryx does not come out of the box to have the capability to manipulate QVD files.  In the past, there are Engine and HTML GUI SDKs.  As mentioned in the Alteryx platform SDK help, the outdated technology is limited for extension.  There is a need to use the latest technology that can extend the opportunities for improvement.

For details of Platform SDK, you may refer to the help. This article will focus on the basic to let you understand how to develop a simple Alteryx Custom Tool.




Environment Setup

For basic requirements, please refer to AYX Python SKD v2 help page.

Before you can start the development, there are a few steps required.

1. Install Minicoda3 (download from the site and install).


2. Open Anacoda Prompt (click on the Windows button and search it)


3. Create a virtual environment with the command
    conda create -n [ENV_NAME] python=3.8.5

e.g. conda create -n test_env python=3.8.5

4. Activate the environment.
    conda activate [ENV_NAME]

e.g. conda activate test_env

4. Install the AYX Python SDK.
    pip install ayx-python-sdk


5. Install the AYX Plugin CLI
    pip install ayx-plugin-cli


6. Create (or initialize) the AYX Plugin Workspace
    ayx_plugin_cli sdk-workspace-init

    A few questions to answer to help initialize the configuration files.

    Package Name: [The name of the package]
    Tool Category [Python SDK Examples]: [The category of the tools]
    Description []: Description of the package.
    Author []: Who is the author of the package.
    Company []: The company name.
    Backend Language (python): python

7. Install NodeJS.  It is required to developer UI for the tool.
    conda install -c anaconda nodejs=16.13.1




7. Create the AYX Plugin
    ayx_plugin_cli create-ayx_plugin --use-ui
    * --use-ui must be explicitly input to make sure the UI part is created by the plugin CLI.




    A few questions to answer to help initialize the configuration files.

Tool Name: [The name of the tool]
Tool Type (input, multiple-inputs, multiple-outputs, optional, output, single-input-single-output, multi-connection-input-anchor) [single-   input-single-output]: [the type of the tool]
Description []: [description of the tool]
Tool Version [1.0]: [version of the tool]
DCM Namespace []:  [DCM Namespace to be used]

Once it is done, you will have a folder structure created as below.

Backend Development (Python)

After the initialization of the environment, workspace and tool, there is a backend file created in the path \backend\ayx_plugins.

Open this using an editor or any IDE to start the tool development.  There are four important functions:

def __init__(self, provider: AMPProviderV2) -> None:
This is to initialize the tools.  Usually, to initialize the variable to be used in the backend code.

def on_incoming_connection_complete(self, anchor: namedtuple) -> None:
When there is an incoming anchor complete, this will be called.  Input tool does not required to handle this.

def on_record_batch(self, batch: "Table", anchor: namedtuple) -> None:
This is to handle the record provided in batches.  Input tool does not required to manipulate this.

def on_complete(self) -> None:
Once the batches have been manipulated, this is to free up the resources and also finalize the completion.  All logics happen here for input tool.

For each function, the comment is already in the backend plugin file. 

Frontend Development (Reactjs)

For front-end development, it should locate to the path \ui\TestInputTool\src.  There is a file called index.tsx.  This is the file to be updated.

The content is reactjs.

More UI help can be found in the Alteryx UI help.


Create the YXI and Deploy to Alteryx Designer

Once the frontend, backend, configuration (like icon, description, version, etc) are completed, it can be packaged to YXI file to share and also install into Alteryx Designer.  It is simply running the command:
ayx_plugin_cli designer-install --use-ui
* --use-ui must be specified explicitly if UI component is involved.

It will prompt a question.
Install Type (user, admin) [user]: [user means to install user location that only for single user while admin means to install into admin location that allows all users in the machine to use]

user location: C:\Users\[user]\AppData\Roaming\Alteryx\Tools
admin location: C:\ProgramData\Alteryx\Tools


Test in Designer

Open up the Alteryx designer, the tool palette contains a new category named "Test Tools" and the Test Input Tool can be selected.  Once this executed, there is a sample result.



This concludes how to develop an Alteryx Custom Tool.  For details about python code, reactjs, etc, leave me message if further discussion is needed.


Thank you for reading.  I hope it help you.  Just leave a message if you have any discussion/share want to make.



Thursday, August 17, 2023

Alteryx QVD Output Tool - Prototype

 Post Index

2023-08-17


Alteryx QVD Output Tool

Alteryx Custom Tool




In the previous post, the Alteryx QVD Input Tool prototype is shared and following in this post, the Alteryx QVD Output Tool will be introduced.

It reads the data from Alteryx and then convert into QVD, i.e. convert each column into a list of symbols and symbol indexes and then compact each record  by the symbol index into the least bits required to store the record data.  XML information is saved in memory during the processes.  Once everything is ready, it flushes out the XML, symbols as well as the records.


Alteryx QVD Output Tool

The Alteryx QVD Input Tool is very simple.  It just takes in a QVD file and read all the content then convert it as an Alteryx output stream.  The input UI is as below.



* the new Alteryx SDK is now using reactjs where it is not possible to get through the security to get the full path.  Thus, there is no button to pop up a dialog to ask for file location.  Instead, there is only a textfield for inputting the path.  If you have any clue to get this through, it is welcome.  The prototype is hoping to show the possibility to integrate with QVD files.


The prototype

If you hope to try it, you can download it in my github.  https://github.com/kongson-cheung/Alteryx-QVD-Tools/blob/main/yxi/QVD%20Tools_v1.1.yxi

I have share the core files to create this Alteryx QVD Output Tool.  Since the SDK includes a large number of files, I did not upload them all.  If you need any help, feel free to drop me a message.


* Note: this is still very early version of prototype.  It still requires a number of improvements for intensive use.


Next

I will try summarize how to develop the Custom Alteryx Tool.


Thank you for reading.  I hope it help you.  Appreciated your sharing if you have any discussion/share want to make.



Monday, August 7, 2023

Alteryx QVD Input Tool - Prototype

 Post Index

2023-08-07


Alteryx QVD Input Tool

Alteryx Custom Tool


With the findings in the previous post (QlikView Data File (QVD) - Reverse Engineering), I have developed a prototype of the Alteryx QVD input tool.  Alteryx is an extremely good tool for data wrangling and contains a bundle of tools that allow simple data transformation and predictive analysis.  However, it does not have the ability to deal with QVD.  It only allows to read and write QVX.

Alteryx, in fact, is a good upfront data stream for QlikView and Qlik Sense.  It can help business to make clear use of data with drag and drop capability to explore, transform and try new business logics.  QVD integration would be benefitial for Qlik at the lower data stream in the data cycle.   Sound like advertisement but it is real project experience to conclude this.  QVX does not work great with heavy usage.  The performance is similar to CSV.  Still, QVD is the best for Qlik.


Alteryx QVD Input Tool

The Alteryx QVD Input Tool is very simple.  It just takes in a QVD file and read all the content then convert it as an Alteryx output stream.  The input UI is as below.


* the new Alteryx SDK is now using reactjs where it is not possible to get through the security to get the full path.  Thus, there is no button to pop up a dialog to ask for file location.  Instead, there is only a textfield for inputting the path.  If you have any clue to get this through, it is welcome.  The prototype is hoping to show the possibility to integrate with QVD files.


An Example

Taking a QVD file as an example.



The QVD file contains 2 columns named Num and Text.  It has total number of 4 records.  In Alteryx, the result runs as below.

If you hope to try it, you can download it in my github.  https://github.com/kongson-cheung/Alteryx-QVD-Tools/blob/main/yxi/QVD%20Tools_v1.0.yxi

I have share the core files to create this Alteryx QVD Input Tool.  Since the SDK includes a large number of files, I did not upload them all.  If you need any help, feel free to drop me a message.


* Note: this is still very early version of prototype.  It still requires a number of improvements for intensive use.


Next

The Alteryx QVD Input Tool is made up by Platform SDK, UI SDK, Python SDK.  This is a first prototype for reading QVD.  I am exploring a prototype of Alteryx QVD Output Tool that write QVD as output.


Thank you for reading.  I hope it help you.  Appreciated your sharing if you have any discussion/share want to make.




Monday, July 31, 2023

QlikView Data File (QVD) - Reverse Engineering

Post Index 

2023-07-31


QlikView Data File (QVD)

Reverse Engineering

Revealing what is inside a QVD file



QVD is a Qlik proprietary format that is widely use in the Qlik products.  And this format works very well within Qlik environment but other than using Qlik products, it is not possible to convert data into QVD format.

QVD is famous on its performance and compression.  Comparing to CSV, Excel and other formats, it has 10x performance gain because the format can, in fact, directly be loaded into memory and directly be used by the Qlik products.

From technically perspective, this article will try to discuss the QVD in detail so that we can understand the amazing elements in this QVD file.


Note: I was analyzing this because I was trying to do a project that hopes to read/write QVD using Alteryx.  I then put some effort for this reverse engineering and developed some prototypes but then a simpler method is used and the information and effort seem deemed.  Hopefully, with this article, I think it is good for every to applause for the design and also learn how to do reverse engineering a bit.  If I get enough time, I will release the prototype in github.



QVD is a file that contains of three major parts:

1) XML

The XML is providing the metadata information about the QVD.   It describes the QVD table and the QVD field with a number of internal used elements.


2) Symbol

Symbol means the unique value of a field.  Each field has a list of symbols.  This is why QVD is highly compressed because each unique value in a field is only saved once.  Each symbol is indexed by a unique number, i.e. 0, 1, 2, ....


3) Record Data

Record data is stored in binary format.  Each record is stored with the size of a record byte size.  With the record bytes, each field value in a record indicated of offset and length.  This portion of binary data can then be converted into the index to get the symbol value.  The entire method makes uses of of bit operations and bit masks to reduce the byte required.  This is one of the main reasons why the data volume is highly compressed.


XML

The QVD XML is illustrated below:


<QvdTableHeader>

    <QvBuildNo>...</QvBuildNo>

    <CreatorDoc>...</CreatorDoc>

    <CreateUtcTime>...</CreateUtcTime>

    <SourceCreateUtcTime>...</SourceCreateUtcTime>

    <SourceFileUtcTime>...</SourceFileUtcTime>

    <SourceFileSize>...</SourceFileSize>

    <StaleUtcTime>...</StaleUtcTime>

    <Fields>

        <QvdFieldHeader>

            <FieldName>...</FieldName>

            <BitOffset>...<BitOffset>

            <BitWidth>...<BitWidth>

            <Bias>...</Bias>

            <NumberFormat>

                <Type>...</Type>

                <nDec>...</nDec>

                <UseThou>...</UseThou>

                <Fmt>...</Fmt>

                <Dec>...</Dec>

                <Thou>...</Thou>

            </NumberFormat>

            <NoOfSymbols>...</NoOfSymbols>

            <Offset>...</Offset>

            <Length>...</Length>

            <Comment>...</Comment>

            <Tags>

                <String>...</String>

                <String>...</String>

            </Tags>

        </QvdFieldHeader>

    </Fields>

    <Compression>...</Compression>

    <RecordByteSize>...</RecordByteSize>

    <NoOfRecords>...</NoOfRecords>

    <Offset>...</Offset>

    <Length>...</Length>

    <Lineage>

        <LineageInfo>

            <Discriminator>...</Discriminator>

            <Statement>...</Statement>

        </LineageInfo

    </Lineage>

    <Comment>...</Comment>

</QvdTableHeader>


Some of the core tags are explained:

QvBuildNo

The QVD version.

CreateUtcTime

The QVD file created date time.

TableName

The QVD table name.

QvdFieldHeader

The details about each field for the QVD parser

QvdFieldHeader/FieldName

The Field Name

QvdFieldHeader/BitOffset

In the record byte, which starting bit to start extract the symbol index.

QvdFieldHeader/BitWidth

In the record byte, how many bits to extract starting from bit offset in order to get the symbol index.

QvdFieldHeader/Bias

It is a special indicator for special handling.

QvdFieldHeader/NoOfSymbols

The number of symbols in the field.

RecordByteSize

The size required to store a record in this QVD dataset.

NoOfRecords

The number of records in the QVD


Symbol

In each fields, there is a list of symbols stored with the pattern of symbol type and symbol data.

Symbol Type is 1 byte to indicate the what kind of data and pattern to parse.

Symbol data is the data content stored in the file.  It is unique in each field for each symbol.



For symbol type 5 and 6, it is where DUAL data type is introduced.  Dual is a special data type that how Qlik stores data.  It is a pair of number and text in the form of (Text, Number).  All data in QVD, in fact, are in dual form.  For example, integer 25, it is stored as (NULL, 25).  A text "Hello" is stored as ("Hello", NULL).  A date is special that it stores ("DATE-STRING", DATE_INT).  A datetime is stored as ("DATETIME-STRING", DATE_NUMBER).  In general, it can be any (Text, Number) pair but generally, DUAL is tackling date and datetime. Sometime, color code will also make use of dual, e.g. (RED, 1), (GREEN, 2), etc.


The known (as result of reverse engineering) symbol types are:

1. Symbol Type = 1

4-byte integer is in this type.


2. Symbol Type = 2

8-byte number is in this type.  This also include decimal point numbers.


3. Symbol Type = 4

Text is in this type.  And a NULL char is at the end to indicate the end of the text.


4. Symbol Type = 5

4-byte integer along with text with a NULL end.  It is date with the form (Text, 4-byte integer).  In fact, other than date, it is possible to store any text/integer pair.


5. Symbol Type = 6

8-byte number along with text with a NULL end.  It is datetime with the form (Text, 8-byte number)  In fact, other than date, it is possible to store any text/number pair.


The order of how these symbols are read indicates the corresponding symbol index.  For example in a field, "c" is the first read for the field, it will have a index =0, the "a" is the second symbol read, it will have index = 1.   It does not require a proper sorting.

Moreover, it is also required special attention on how it manipulates NULL.  Using an example will be easier to understand.  With a table with three fields Num, Text, Dummy with the following data:


Num,Text,Dummy

2, A,

1, B,

1, A,

2, B,

,,


Num has 3 symbols including values of 2, 1 and NULL.

Text has 3 symbols including values of A, B and NULL

Dummy has 1 symbol including values of i.e. NULL


The symbol stored of Num will be

[Symbol Type =5][1 and "1"]  => index =0

[Symbol Type =5][2, "2"] => index = 1

[Symbol Type =5][NULL, NULL] => index = 2


It requires 8 bytes + 2 bytes (utf-8, 2 bytes for a char) for the symbol data.  To indicate 3 symbol indexes, it requires 2 bits.

As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 2 and Bias is 0.


* it will treat as 5 as it does not know the data type well.  It happens when the data is coming from a CSV without data type specification.  If it is coming from DB, it has a mapping between the DB type and the type to be used in QVD.


The symbol stored of Text will be

[Symbol Type =4][A] => index = 0

[Symbol Type =4][B] => index = 1

[Symbol Type =4][NULL] => index = 2

It requires 2 bytes (utf-8, 2 bytes for a char) for the symbol data.  To indicate 3 symbol indexes, it requires 2 bits.

As a result in the QVDFieldHeader, BitOffset is 2 and BitWidth is 6 and Bias is 0.  Since it is the last column with data, it will take up all bits to form a full byte, i.e. 6-bits even the smallest and required is just 2 bits.  And funny is that the last column means the last column with data.  If it is all null, it would not treat as the last column.


The symbol stored of Dummy will be

[Symbol Type =4][NULL] => index = 0

It requires nothing for data storage.  But the symbol type 4 is still required to store 1 bytes, i.e. NULL byte.

As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 0 and Bias is 0.  Offset, width and bias are zeros indicate no bytes are required for record. 


Record Data

In QVD, each record is not storing the exact field values.  Indeed, it stores the symbol indexes of the all fields.

Taking the same example used in the symbol illustration, the table below:


Num,Text,Dummy

2, A,

1, B,

1, A,

2, B,

,,


First record: require Num[index=0] and Text[index=0] and dummy=nothing, the record represents as [0000], 2 bits for Num, 2 bits for Text.

Second record: require Num[index=1] and Text[index=1] and dummy=nothing, 3 bits represents as [0101]

Third record: require Num[index=1] and Text[index=0] and dummy=nothing, 2 bits represents as [0001]

Forth record: require Num[index=0] and Text[index=1] and dummy=nothing, 2 bits represents as [0100]

Fifth record: require Num[index=2] and Text[index=2] and dummy=nothing, 2 bits represents as [1010]

The first field will be stored in the rightmost bits while the last field will be stored in the left most bits.

Thus, the 5 four records will be stored as [0000], [0101] [0001] [0100], [1010].  The just enough bytes are required to hold these binary data, i.e. 8 bits, 1 byte.  Thus, they become [0000 0000], [0000 0101], [0000 0100], [0000 0001], [00001010] => 0, 5, 1, 4, 10.  These 5 integers are used to represent total of 15 values.

To complete the QvdTableHeader description, the RecordByteSize is 1.  And NoOfRecords are 5 because there are 5 rows.


* Bias, still, needs more investigation on the exact usage.


Reverse Engineering - What have been done?

In order to try further understanding the format, a simple way to is generate a simple QVD file and look into the details.  For example a single column with 2 rows with integer only.  Keep iterating with different data and review the details, it is easy to spot the changes.  It might require a notepad editor that can show the invisible byte like NULL, EOT, etc.  A good notepad editor is notepad++.

Obviously, there might be still more handling in QVD but it is already showcase the beauty of it and why it can process that fast and compress that much.

Personally, it terms of file operation, I seriously hope that Qlik can further expand this usage because nowadays the data usage is huge and file is everywhere.  A proper and manageable file is important.  In particular to support the cloud computing, an enhanced of QVD might do the trick as well.  If it can break through the area and become open-source to use, it will greatly beneficial to everyone dealing with data.