Monday, July 31, 2023

QlikView Data File (QVD) - Reverse Engineering

2023-07-31

QlikView Data File (QVD)

Reverse Engineering

Revealing what is inside a QVD file

QVD is a Qlik proprietary format that is widely use in the Qlik products. And this format works very well within Qlik environment but other than using Qlik products, it is not possible to convert data into QVD format.
QVD is famous on its performance and compression. Comparing to CSV, Excel and other formats, it has 10x performance gain because the format can, in fact, directly be loaded into memory and directly be used by the Qlik products.
From technically perspective, this article will try to discuss the QVD in detail so that we can understand the amazing elements in this QVD file.

Note: I was analyzing this because I was trying to do a project that hopes to read/write QVD using Alteryx. I then put some effort for this reverse engineering and developed some prototypes but then a simpler method is used and the information and effort seem deemed. Hopefully, with this article, I think it is good for every to applause for the design and also learn how to do reverse engineering a bit. If I get enough time, I will release the prototype in github.

QVD Component

QVD is a file that contains of three major parts:
1) XML
The XML is providing the metadata information about the QVD. It describes the QVD table and the QVD field with a number of internal used elements.

2) Symbol
Symbol means the unique value of a field. Each field has a list of symbols. This is why QVD is highly compressed because each unique value in a field is only saved once. Each symbol is indexed by a unique number, i.e. 0, 1, 2, ....

3) Record Data
Record data is stored in binary format. Each record is stored with the size of a record byte size. With the record bytes, each field value in a record indicated of offset and length. This portion of binary data can then be converted into the index to get the symbol value. The entire method makes uses of of bit operations and bit masks to reduce the byte required. This is one of the main reasons why the data volume is highly compressed.

XML
The QVD XML is illustrated below:

<QvdTableHeader>
<QvBuildNo>...</QvBuildNo>
<CreatorDoc>...</CreatorDoc>
<CreateUtcTime>...</CreateUtcTime>
<SourceCreateUtcTime>...</SourceCreateUtcTime>
<SourceFileUtcTime>...</SourceFileUtcTime>
<SourceFileSize>...</SourceFileSize>
<StaleUtcTime>...</StaleUtcTime>
<Fields>
      <QvdFieldHeader>
          <FieldName>...</FieldName>
          <BitOffset>...<BitOffset>
          <BitWidth>...<BitWidth>
          <Bias>...</Bias>
          <NumberFormat>
              <Type>...</Type>
              <nDec>...</nDec>
              <UseThou>...</UseThou>
              <Fmt>...</Fmt>
              <Dec>...</Dec>
              <Thou>...</Thou>
          </NumberFormat>
          <NoOfSymbols>...</NoOfSymbols>
          <Offset>...</Offset>
          <Length>...</Length>
          <Comment>...</Comment>
          <Tags>
              <String>...</String>
              <String>...</String>
          </Tags>
      </QvdFieldHeader>
   </Fields>
<Compression>...</Compression>
<RecordByteSize>...</RecordByteSize>
<NoOfRecords>...</NoOfRecords>
<Offset>...</Offset>
<Length>...</Length>
<Lineage>
      <LineageInfo>
          <Discriminator>...</Discriminator>
          <Statement>...</Statement>
      </LineageInfo
</Lineage>
<Comment>...</Comment>
</QvdTableHeader>

Some of the core tags are explained:
QvBuildNo
The QVD version.
CreateUtcTime
The QVD file created date time.
TableName
The QVD table name.
QvdFieldHeader
The details about each field for the QVD parser
QvdFieldHeader/FieldName
The Field Name
QvdFieldHeader/BitOffset
In the record byte, which starting bit to start extract the symbol index.
QvdFieldHeader/BitWidth
In the record byte, how many bits to extract starting from bit offset in order to get the symbol index.
QvdFieldHeader/Bias
It is a special indicator for special handling.
QvdFieldHeader/NoOfSymbols
The number of symbols in the field.
RecordByteSize
The size required to store a record in this QVD dataset.
NoOfRecords
The number of records in the QVD

Symbol
In each fields, there is a list of symbols stored with the pattern of symbol type and symbol data.
Symbol Type is 1 byte to indicate the what kind of data and pattern to parse.
Symbol data is the data content stored in the file. It is unique in each field for each symbol.

For symbol type 5 and 6, it is where DUAL data type is introduced. Dual is a special data type that how Qlik stores data. It is a pair of number and text in the form of (Text, Number). All data in QVD, in fact, are in dual form. For example, integer 25, it is stored as (NULL, 25). A text "Hello" is stored as ("Hello", NULL). A date is special that it stores ("DATE-STRING", DATE_INT). A datetime is stored as ("DATETIME-STRING", DATE_NUMBER). In general, it can be any (Text, Number) pair but generally, DUAL is tackling date and datetime. Sometime, color code will also make use of dual, e.g. (RED, 1), (GREEN, 2), etc.

The known (as result of reverse engineering) symbol types are:
1. Symbol Type = 1
4-byte integer is in this type.

2. Symbol Type = 2
8-byte number is in this type. This also include decimal point numbers.

3. Symbol Type = 4
Text is in this type. And a NULL char is at the end to indicate the end of the text.

4. Symbol Type = 5
4-byte integer along with text with a NULL end. It is date with the form (Text, 4-byte integer). In fact, other than date, it is possible to store any text/integer pair.

5. Symbol Type = 6
8-byte number along with text with a NULL end. It is datetime with the form (Text, 8-byte number) In fact, other than date, it is possible to store any text/number pair.

The order of how these symbols are read indicates the corresponding symbol index. For example in a field, "c" is the first read for the field, it will have a index =0, the "a" is the second symbol read, it will have index = 1. It does not require a proper sorting.
Moreover, it is also required special attention on how it manipulates NULL. Using an example will be easier to understand. With a table with three fields Num, Text, Dummy with the following data:

Num,Text,Dummy
2, A,
1, B,
1, A,
2, B,
,,

Num has 3 symbols including values of 2, 1 and NULL.
Text has 3 symbols including values of A, B and NULL
Dummy has 1 symbol including values of i.e. NULL

The symbol stored of Num will be
[Symbol Type =5][1 and "1"] => index =0
[Symbol Type =5][2, "2"] => index = 1
[Symbol Type =5][NULL, NULL] => index = 2

It requires 8 bytes + 2 bytes (utf-8, 2 bytes for a char) for the symbol data. To indicate 3 symbol indexes, it requires 2 bits.
As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 2 and Bias is 0.

* it will treat as 5 as it does not know the data type well. It happens when the data is coming from a CSV without data type specification. If it is coming from DB, it has a mapping between the DB type and the type to be used in QVD.

The symbol stored of Text will be
[Symbol Type =4][A] => index = 0
[Symbol Type =4][B] => index = 1
[Symbol Type =4][NULL] => index = 2
It requires 2 bytes (utf-8, 2 bytes for a char) for the symbol data. To indicate 3 symbol indexes, it requires 2 bits.
As a result in the QVDFieldHeader, BitOffset is 2 and BitWidth is 6 and Bias is 0. Since it is the last column with data, it will take up all bits to form a full byte, i.e. 6-bits even the smallest and required is just 2 bits. And funny is that the last column means the last column with data. If it is all null, it would not treat as the last column.

The symbol stored of Dummy will be
[Symbol Type =4][NULL] => index = 0
It requires nothing for data storage. But the symbol type 4 is still required to store 1 bytes, i.e. NULL byte.
As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 0 and Bias is 0. Offset, width and bias are zeros indicate no bytes are required for record.

Record Data
In QVD, each record is not storing the exact field values. Indeed, it stores the symbol indexes of the all fields.
Taking the same example used in the symbol illustration, the table below:

Num,Text,Dummy
2, A,
1, B,
1, A,
2, B,
,,

First record: require Num[index=0] and Text[index=0] and dummy=nothing, the record represents as [0000], 2 bits for Num, 2 bits for Text.
Second record: require Num[index=1] and Text[index=1] and dummy=nothing, 3 bits represents as [0101]
Third record: require Num[index=1] and Text[index=0] and dummy=nothing, 2 bits represents as [0001]
Forth record: require Num[index=0] and Text[index=1] and dummy=nothing, 2 bits represents as [0100]
Fifth record: require Num[index=2] and Text[index=2] and dummy=nothing, 2 bits represents as [1010]
The first field will be stored in the rightmost bits while the last field will be stored in the left most bits.
Thus, the 5 four records will be stored as [0000], [0101] [0001] [0100], [1010]. The just enough bytes are required to hold these binary data, i.e. 8 bits, 1 byte. Thus, they become [0000 0000], [0000 0101], [0000 0100], [0000 0001], [00001010] => 0, 5, 1, 4, 10. These 5 integers are used to represent total of 15 values.
To complete the QvdTableHeader description, the RecordByteSize is 1. And NoOfRecords are 5 because there are 5 rows.

* Bias, still, needs more investigation on the exact usage.

Reverse Engineering - What have been done?
In order to try further understanding the format, a simple way to is generate a simple QVD file and look into the details. For example a single column with 2 rows with integer only. Keep iterating with different data and review the details, it is easy to spot the changes. It might require a notepad editor that can show the invisible byte like NULL, EOT, etc. A good notepad editor is notepad++.
Obviously, there might be still more handling in QVD but it is already showcase the beauty of it and why it can process that fast and compress that much.
Personally, it terms of file operation, I seriously hope that Qlik can further expand this usage because nowadays the data usage is huge and file is everywhere. A proper and manageable file is important. In particular to support the cloud computing, an enhanced of QVD might do the trick as well. If it can break through the area and become open-source to use, it will greatly beneficial to everyone dealing with data.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)