For symbol type 5 and 6, it is where DUAL data type is introduced. Dual is a special data type that how Qlik stores data. It is a pair of number and text in the form of (Text, Number). All data in QVD, in fact, are in dual form. For example, integer 25, it is stored as (NULL, 25). A text "Hello" is stored as ("Hello", NULL). A date is special that it stores ("DATE-STRING", DATE_INT). A datetime is stored as ("DATETIME-STRING", DATE_NUMBER). In general, it can be any (Text, Number) pair but generally, DUAL is tackling date and datetime. Sometime, color code will also make use of dual, e.g. (RED, 1), (GREEN, 2), etc.
The known (as result of reverse engineering) symbol types are:
1. Symbol Type = 1
4-byte integer is in this type.
2. Symbol Type = 2
8-byte number is in this type. This also include decimal point numbers.
3. Symbol Type = 4
Text is in this type. And a NULL char is at the end to indicate the end of the text.
4. Symbol Type = 5
4-byte integer along with text with a NULL end. It is date with the form (Text, 4-byte integer). In fact, other than date, it is possible to store any text/integer pair.
5. Symbol Type = 6
8-byte number along with text with a NULL end. It is datetime with the form (Text, 8-byte number) In fact, other than date, it is possible to store any text/number pair.
The order of how these symbols are read indicates the corresponding symbol index. For example in a field, "c" is the first read for the field, it will have a index =0, the "a" is the second symbol read, it will have index = 1. It does not require a proper sorting.
Moreover, it is also required special attention on how it manipulates NULL. Using an example will be easier to understand. With a table with three fields Num, Text, Dummy with the following data:
Num,Text,Dummy
2, A,
1, B,
1, A,
2, B,
,,
Num has 3 symbols including values of 2, 1 and NULL.
Text has 3 symbols including values of A, B and NULL
Dummy has 1 symbol including values of i.e. NULL
The symbol stored of Num will be
[Symbol Type =5][1 and "1"] => index =0
[Symbol Type =5][2, "2"] => index = 1
[Symbol Type =5][NULL, NULL] => index = 2
It requires 8 bytes + 2 bytes (utf-8, 2 bytes for a char) for the symbol data. To indicate 3 symbol indexes, it requires 2 bits.
As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 2 and Bias is 0.
* it will treat as 5 as it does not know the data type well. It happens when the data is coming from a CSV without data type specification. If it is coming from DB, it has a mapping between the DB type and the type to be used in QVD.
The symbol stored of Text will be
[Symbol Type =4][A] => index = 0
[Symbol Type =4][B] => index = 1
[Symbol Type =4][NULL] => index = 2
It requires 2 bytes (utf-8, 2 bytes for a char) for the symbol data. To indicate 3 symbol indexes, it requires 2 bits.
As a result in the QVDFieldHeader, BitOffset is 2 and BitWidth is 6 and Bias is 0. Since it is the last column with data, it will take up all bits to form a full byte, i.e. 6-bits even the smallest and required is just 2 bits. And funny is that the last column means the last column with data. If it is all null, it would not treat as the last column.
The symbol stored of Dummy will be
[Symbol Type =4][NULL] => index = 0
It requires nothing for data storage. But the symbol type 4 is still required to store 1 bytes, i.e. NULL byte.
As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 0 and Bias is 0. Offset, width and bias are zeros indicate no bytes are required for record.
Record Data
In QVD, each record is not storing the exact field values. Indeed, it stores the symbol indexes of the all fields.
Taking the same example used in the symbol illustration, the table below:
Num,Text,Dummy
2, A,
1, B,
1, A,
2, B,
,,
First record: require Num[index=0] and Text[index=0] and dummy=nothing, the record represents as [0000], 2 bits for Num, 2 bits for Text.
Second record: require Num[index=1] and Text[index=1] and dummy=nothing, 3 bits represents as [0101]
Third record: require Num[index=1] and Text[index=0] and dummy=nothing, 2 bits represents as [0001]
Forth record: require Num[index=0] and Text[index=1] and dummy=nothing, 2 bits represents as [0100]
Fifth record: require Num[index=2] and Text[index=2] and dummy=nothing, 2 bits represents as [1010]
The first field will be stored in the rightmost bits while the last field will be stored in the left most bits.
Thus, the 5 four records will be stored as [0000], [0101] [0001] [0100], [1010]. The just enough bytes are required to hold these binary data, i.e. 8 bits, 1 byte. Thus, they become [0000 0000], [0000 0101], [0000 0100], [0000 0001], [00001010] => 0, 5, 1, 4, 10. These 5 integers are used to represent total of 15 values.
To complete the QvdTableHeader description, the RecordByteSize is 1. And NoOfRecords are 5 because there are 5 rows.
* Bias, still, needs more investigation on the exact usage.
Reverse Engineering - What have been done?
In order to try further understanding the format, a simple way to is generate a simple QVD file and look into the details. For example a single column with 2 rows with integer only. Keep iterating with different data and review the details, it is easy to spot the changes. It might require a notepad editor that can show the invisible byte like NULL, EOT, etc. A good notepad editor is notepad++.
Obviously, there might be still more handling in QVD but it is already showcase the beauty of it and why it can process that fast and compress that much.
Personally, it terms of file operation, I seriously hope that Qlik can further expand this usage because nowadays the data usage is huge and file is everywhere. A proper and manageable file is important. In particular to support the cloud computing, an enhanced of QVD might do the trick as well. If it can break through the area and become open-source to use, it will greatly beneficial to everyone dealing with data.