Showing posts with label QVD. Show all posts
Showing posts with label QVD. Show all posts

Thursday, August 17, 2023

Alteryx QVD Output Tool - Prototype

 Post Index

2023-08-17


Alteryx QVD Output Tool

Alteryx Custom Tool




In the previous post, the Alteryx QVD Input Tool prototype is shared and following in this post, the Alteryx QVD Output Tool will be introduced.

It reads the data from Alteryx and then convert into QVD, i.e. convert each column into a list of symbols and symbol indexes and then compact each record  by the symbol index into the least bits required to store the record data.  XML information is saved in memory during the processes.  Once everything is ready, it flushes out the XML, symbols as well as the records.


Alteryx QVD Output Tool

The Alteryx QVD Input Tool is very simple.  It just takes in a QVD file and read all the content then convert it as an Alteryx output stream.  The input UI is as below.



* the new Alteryx SDK is now using reactjs where it is not possible to get through the security to get the full path.  Thus, there is no button to pop up a dialog to ask for file location.  Instead, there is only a textfield for inputting the path.  If you have any clue to get this through, it is welcome.  The prototype is hoping to show the possibility to integrate with QVD files.


The prototype

If you hope to try it, you can download it in my github.  https://github.com/kongson-cheung/Alteryx-QVD-Tools/blob/main/yxi/QVD%20Tools_v1.1.yxi

I have share the core files to create this Alteryx QVD Output Tool.  Since the SDK includes a large number of files, I did not upload them all.  If you need any help, feel free to drop me a message.


* Note: this is still very early version of prototype.  It still requires a number of improvements for intensive use.


Next

I will try summarize how to develop the Custom Alteryx Tool.


Thank you for reading.  I hope it help you.  Appreciated your sharing if you have any discussion/share want to make.



Monday, August 7, 2023

Alteryx QVD Input Tool - Prototype

 Post Index

2023-08-07


Alteryx QVD Input Tool

Alteryx Custom Tool


With the findings in the previous post (QlikView Data File (QVD) - Reverse Engineering), I have developed a prototype of the Alteryx QVD input tool.  Alteryx is an extremely good tool for data wrangling and contains a bundle of tools that allow simple data transformation and predictive analysis.  However, it does not have the ability to deal with QVD.  It only allows to read and write QVX.

Alteryx, in fact, is a good upfront data stream for QlikView and Qlik Sense.  It can help business to make clear use of data with drag and drop capability to explore, transform and try new business logics.  QVD integration would be benefitial for Qlik at the lower data stream in the data cycle.   Sound like advertisement but it is real project experience to conclude this.  QVX does not work great with heavy usage.  The performance is similar to CSV.  Still, QVD is the best for Qlik.


Alteryx QVD Input Tool

The Alteryx QVD Input Tool is very simple.  It just takes in a QVD file and read all the content then convert it as an Alteryx output stream.  The input UI is as below.


* the new Alteryx SDK is now using reactjs where it is not possible to get through the security to get the full path.  Thus, there is no button to pop up a dialog to ask for file location.  Instead, there is only a textfield for inputting the path.  If you have any clue to get this through, it is welcome.  The prototype is hoping to show the possibility to integrate with QVD files.


An Example

Taking a QVD file as an example.



The QVD file contains 2 columns named Num and Text.  It has total number of 4 records.  In Alteryx, the result runs as below.

If you hope to try it, you can download it in my github.  https://github.com/kongson-cheung/Alteryx-QVD-Tools/blob/main/yxi/QVD%20Tools_v1.0.yxi

I have share the core files to create this Alteryx QVD Input Tool.  Since the SDK includes a large number of files, I did not upload them all.  If you need any help, feel free to drop me a message.


* Note: this is still very early version of prototype.  It still requires a number of improvements for intensive use.


Next

The Alteryx QVD Input Tool is made up by Platform SDK, UI SDK, Python SDK.  This is a first prototype for reading QVD.  I am exploring a prototype of Alteryx QVD Output Tool that write QVD as output.


Thank you for reading.  I hope it help you.  Appreciated your sharing if you have any discussion/share want to make.




Monday, July 31, 2023

QlikView Data File (QVD) - Reverse Engineering

Post Index 

2023-07-31


QlikView Data File (QVD)

Reverse Engineering

Revealing what is inside a QVD file



QVD is a Qlik proprietary format that is widely use in the Qlik products.  And this format works very well within Qlik environment but other than using Qlik products, it is not possible to convert data into QVD format.

QVD is famous on its performance and compression.  Comparing to CSV, Excel and other formats, it has 10x performance gain because the format can, in fact, directly be loaded into memory and directly be used by the Qlik products.

From technically perspective, this article will try to discuss the QVD in detail so that we can understand the amazing elements in this QVD file.


Note: I was analyzing this because I was trying to do a project that hopes to read/write QVD using Alteryx.  I then put some effort for this reverse engineering and developed some prototypes but then a simpler method is used and the information and effort seem deemed.  Hopefully, with this article, I think it is good for every to applause for the design and also learn how to do reverse engineering a bit.  If I get enough time, I will release the prototype in github.



QVD is a file that contains of three major parts:

1) XML

The XML is providing the metadata information about the QVD.   It describes the QVD table and the QVD field with a number of internal used elements.


2) Symbol

Symbol means the unique value of a field.  Each field has a list of symbols.  This is why QVD is highly compressed because each unique value in a field is only saved once.  Each symbol is indexed by a unique number, i.e. 0, 1, 2, ....


3) Record Data

Record data is stored in binary format.  Each record is stored with the size of a record byte size.  With the record bytes, each field value in a record indicated of offset and length.  This portion of binary data can then be converted into the index to get the symbol value.  The entire method makes uses of of bit operations and bit masks to reduce the byte required.  This is one of the main reasons why the data volume is highly compressed.


XML

The QVD XML is illustrated below:


<QvdTableHeader>

    <QvBuildNo>...</QvBuildNo>

    <CreatorDoc>...</CreatorDoc>

    <CreateUtcTime>...</CreateUtcTime>

    <SourceCreateUtcTime>...</SourceCreateUtcTime>

    <SourceFileUtcTime>...</SourceFileUtcTime>

    <SourceFileSize>...</SourceFileSize>

    <StaleUtcTime>...</StaleUtcTime>

    <Fields>

        <QvdFieldHeader>

            <FieldName>...</FieldName>

            <BitOffset>...<BitOffset>

            <BitWidth>...<BitWidth>

            <Bias>...</Bias>

            <NumberFormat>

                <Type>...</Type>

                <nDec>...</nDec>

                <UseThou>...</UseThou>

                <Fmt>...</Fmt>

                <Dec>...</Dec>

                <Thou>...</Thou>

            </NumberFormat>

            <NoOfSymbols>...</NoOfSymbols>

            <Offset>...</Offset>

            <Length>...</Length>

            <Comment>...</Comment>

            <Tags>

                <String>...</String>

                <String>...</String>

            </Tags>

        </QvdFieldHeader>

    </Fields>

    <Compression>...</Compression>

    <RecordByteSize>...</RecordByteSize>

    <NoOfRecords>...</NoOfRecords>

    <Offset>...</Offset>

    <Length>...</Length>

    <Lineage>

        <LineageInfo>

            <Discriminator>...</Discriminator>

            <Statement>...</Statement>

        </LineageInfo

    </Lineage>

    <Comment>...</Comment>

</QvdTableHeader>


Some of the core tags are explained:

QvBuildNo

The QVD version.

CreateUtcTime

The QVD file created date time.

TableName

The QVD table name.

QvdFieldHeader

The details about each field for the QVD parser

QvdFieldHeader/FieldName

The Field Name

QvdFieldHeader/BitOffset

In the record byte, which starting bit to start extract the symbol index.

QvdFieldHeader/BitWidth

In the record byte, how many bits to extract starting from bit offset in order to get the symbol index.

QvdFieldHeader/Bias

It is a special indicator for special handling.

QvdFieldHeader/NoOfSymbols

The number of symbols in the field.

RecordByteSize

The size required to store a record in this QVD dataset.

NoOfRecords

The number of records in the QVD


Symbol

In each fields, there is a list of symbols stored with the pattern of symbol type and symbol data.

Symbol Type is 1 byte to indicate the what kind of data and pattern to parse.

Symbol data is the data content stored in the file.  It is unique in each field for each symbol.



For symbol type 5 and 6, it is where DUAL data type is introduced.  Dual is a special data type that how Qlik stores data.  It is a pair of number and text in the form of (Text, Number).  All data in QVD, in fact, are in dual form.  For example, integer 25, it is stored as (NULL, 25).  A text "Hello" is stored as ("Hello", NULL).  A date is special that it stores ("DATE-STRING", DATE_INT).  A datetime is stored as ("DATETIME-STRING", DATE_NUMBER).  In general, it can be any (Text, Number) pair but generally, DUAL is tackling date and datetime. Sometime, color code will also make use of dual, e.g. (RED, 1), (GREEN, 2), etc.


The known (as result of reverse engineering) symbol types are:

1. Symbol Type = 1

4-byte integer is in this type.


2. Symbol Type = 2

8-byte number is in this type.  This also include decimal point numbers.


3. Symbol Type = 4

Text is in this type.  And a NULL char is at the end to indicate the end of the text.


4. Symbol Type = 5

4-byte integer along with text with a NULL end.  It is date with the form (Text, 4-byte integer).  In fact, other than date, it is possible to store any text/integer pair.


5. Symbol Type = 6

8-byte number along with text with a NULL end.  It is datetime with the form (Text, 8-byte number)  In fact, other than date, it is possible to store any text/number pair.


The order of how these symbols are read indicates the corresponding symbol index.  For example in a field, "c" is the first read for the field, it will have a index =0, the "a" is the second symbol read, it will have index = 1.   It does not require a proper sorting.

Moreover, it is also required special attention on how it manipulates NULL.  Using an example will be easier to understand.  With a table with three fields Num, Text, Dummy with the following data:


Num,Text,Dummy

2, A,

1, B,

1, A,

2, B,

,,


Num has 3 symbols including values of 2, 1 and NULL.

Text has 3 symbols including values of A, B and NULL

Dummy has 1 symbol including values of i.e. NULL


The symbol stored of Num will be

[Symbol Type =5][1 and "1"]  => index =0

[Symbol Type =5][2, "2"] => index = 1

[Symbol Type =5][NULL, NULL] => index = 2


It requires 8 bytes + 2 bytes (utf-8, 2 bytes for a char) for the symbol data.  To indicate 3 symbol indexes, it requires 2 bits.

As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 2 and Bias is 0.


* it will treat as 5 as it does not know the data type well.  It happens when the data is coming from a CSV without data type specification.  If it is coming from DB, it has a mapping between the DB type and the type to be used in QVD.


The symbol stored of Text will be

[Symbol Type =4][A] => index = 0

[Symbol Type =4][B] => index = 1

[Symbol Type =4][NULL] => index = 2

It requires 2 bytes (utf-8, 2 bytes for a char) for the symbol data.  To indicate 3 symbol indexes, it requires 2 bits.

As a result in the QVDFieldHeader, BitOffset is 2 and BitWidth is 6 and Bias is 0.  Since it is the last column with data, it will take up all bits to form a full byte, i.e. 6-bits even the smallest and required is just 2 bits.  And funny is that the last column means the last column with data.  If it is all null, it would not treat as the last column.


The symbol stored of Dummy will be

[Symbol Type =4][NULL] => index = 0

It requires nothing for data storage.  But the symbol type 4 is still required to store 1 bytes, i.e. NULL byte.

As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 0 and Bias is 0.  Offset, width and bias are zeros indicate no bytes are required for record. 


Record Data

In QVD, each record is not storing the exact field values.  Indeed, it stores the symbol indexes of the all fields.

Taking the same example used in the symbol illustration, the table below:


Num,Text,Dummy

2, A,

1, B,

1, A,

2, B,

,,


First record: require Num[index=0] and Text[index=0] and dummy=nothing, the record represents as [0000], 2 bits for Num, 2 bits for Text.

Second record: require Num[index=1] and Text[index=1] and dummy=nothing, 3 bits represents as [0101]

Third record: require Num[index=1] and Text[index=0] and dummy=nothing, 2 bits represents as [0001]

Forth record: require Num[index=0] and Text[index=1] and dummy=nothing, 2 bits represents as [0100]

Fifth record: require Num[index=2] and Text[index=2] and dummy=nothing, 2 bits represents as [1010]

The first field will be stored in the rightmost bits while the last field will be stored in the left most bits.

Thus, the 5 four records will be stored as [0000], [0101] [0001] [0100], [1010].  The just enough bytes are required to hold these binary data, i.e. 8 bits, 1 byte.  Thus, they become [0000 0000], [0000 0101], [0000 0100], [0000 0001], [00001010] => 0, 5, 1, 4, 10.  These 5 integers are used to represent total of 15 values.

To complete the QvdTableHeader description, the RecordByteSize is 1.  And NoOfRecords are 5 because there are 5 rows.


* Bias, still, needs more investigation on the exact usage.


Reverse Engineering - What have been done?

In order to try further understanding the format, a simple way to is generate a simple QVD file and look into the details.  For example a single column with 2 rows with integer only.  Keep iterating with different data and review the details, it is easy to spot the changes.  It might require a notepad editor that can show the invisible byte like NULL, EOT, etc.  A good notepad editor is notepad++.

Obviously, there might be still more handling in QVD but it is already showcase the beauty of it and why it can process that fast and compress that much.

Personally, it terms of file operation, I seriously hope that Qlik can further expand this usage because nowadays the data usage is huge and file is everywhere.  A proper and manageable file is important.  In particular to support the cloud computing, an enhanced of QVD might do the trick as well.  If it can break through the area and become open-source to use, it will greatly beneficial to everyone dealing with data.