Monday, July 31, 2023

QlikView Data File (QVD) - Reverse Engineering

Post Index

2023-07-31

QlikView Data File (QVD)

Reverse Engineering

Revealing what is inside a QVD file

QVD is a Qlik proprietary format that is widely use in the Qlik products. And this format works very well within Qlik environment but other than using Qlik products, it is not possible to convert data into QVD format.
QVD is famous on its performance and compression. Comparing to CSV, Excel and other formats, it has 10x performance gain because the format can, in fact, directly be loaded into memory and directly be used by the Qlik products.
From technically perspective, this article will try to discuss the QVD in detail so that we can understand the amazing elements in this QVD file.

Note: I was analyzing this because I was trying to do a project that hopes to read/write QVD using Alteryx. I then put some effort for this reverse engineering and developed some prototypes but then a simpler method is used and the information and effort seem deemed. Hopefully, with this article, I think it is good for every to applause for the design and also learn how to do reverse engineering a bit. If I get enough time, I will release the prototype in github.

QVD Component

QVD is a file that contains of three major parts:
1) XML
The XML is providing the metadata information about the QVD. It describes the QVD table and the QVD field with a number of internal used elements.

2) Symbol
Symbol means the unique value of a field. Each field has a list of symbols. This is why QVD is highly compressed because each unique value in a field is only saved once. Each symbol is indexed by a unique number, i.e. 0, 1, 2, ....

3) Record Data
Record data is stored in binary format. Each record is stored with the size of a record byte size. With the record bytes, each field value in a record indicated of offset and length. This portion of binary data can then be converted into the index to get the symbol value. The entire method makes uses of of bit operations and bit masks to reduce the byte required. This is one of the main reasons why the data volume is highly compressed.

XML
The QVD XML is illustrated below:

<QvdTableHeader>
<QvBuildNo>...</QvBuildNo>
<CreatorDoc>...</CreatorDoc>
<CreateUtcTime>...</CreateUtcTime>
<SourceCreateUtcTime>...</SourceCreateUtcTime>
<SourceFileUtcTime>...</SourceFileUtcTime>
<SourceFileSize>...</SourceFileSize>
<StaleUtcTime>...</StaleUtcTime>
<Fields>
      <QvdFieldHeader>
          <FieldName>...</FieldName>
          <BitOffset>...<BitOffset>
          <BitWidth>...<BitWidth>
          <Bias>...</Bias>
          <NumberFormat>
              <Type>...</Type>
              <nDec>...</nDec>
              <UseThou>...</UseThou>
              <Fmt>...</Fmt>
              <Dec>...</Dec>
              <Thou>...</Thou>
          </NumberFormat>
          <NoOfSymbols>...</NoOfSymbols>
          <Offset>...</Offset>
          <Length>...</Length>
          <Comment>...</Comment>
          <Tags>
              <String>...</String>
              <String>...</String>
          </Tags>
      </QvdFieldHeader>
   </Fields>
<Compression>...</Compression>
<RecordByteSize>...</RecordByteSize>
<NoOfRecords>...</NoOfRecords>
<Offset>...</Offset>
<Length>...</Length>
<Lineage>
      <LineageInfo>
          <Discriminator>...</Discriminator>
          <Statement>...</Statement>
      </LineageInfo
</Lineage>
<Comment>...</Comment>
</QvdTableHeader>

Some of the core tags are explained:
QvBuildNo
The QVD version.
CreateUtcTime
The QVD file created date time.
TableName
The QVD table name.
QvdFieldHeader
The details about each field for the QVD parser
QvdFieldHeader/FieldName
The Field Name
QvdFieldHeader/BitOffset
In the record byte, which starting bit to start extract the symbol index.
QvdFieldHeader/BitWidth
In the record byte, how many bits to extract starting from bit offset in order to get the symbol index.
QvdFieldHeader/Bias
It is a special indicator for special handling.
QvdFieldHeader/NoOfSymbols
The number of symbols in the field.
RecordByteSize
The size required to store a record in this QVD dataset.
NoOfRecords
The number of records in the QVD

Symbol
In each fields, there is a list of symbols stored with the pattern of symbol type and symbol data.
Symbol Type is 1 byte to indicate the what kind of data and pattern to parse.
Symbol data is the data content stored in the file. It is unique in each field for each symbol.

For symbol type 5 and 6, it is where DUAL data type is introduced. Dual is a special data type that how Qlik stores data. It is a pair of number and text in the form of (Text, Number). All data in QVD, in fact, are in dual form. For example, integer 25, it is stored as (NULL, 25). A text "Hello" is stored as ("Hello", NULL). A date is special that it stores ("DATE-STRING", DATE_INT). A datetime is stored as ("DATETIME-STRING", DATE_NUMBER). In general, it can be any (Text, Number) pair but generally, DUAL is tackling date and datetime. Sometime, color code will also make use of dual, e.g. (RED, 1), (GREEN, 2), etc.

The known (as result of reverse engineering) symbol types are:
1. Symbol Type = 1
4-byte integer is in this type.

2. Symbol Type = 2
8-byte number is in this type. This also include decimal point numbers.

3. Symbol Type = 4
Text is in this type. And a NULL char is at the end to indicate the end of the text.

4. Symbol Type = 5
4-byte integer along with text with a NULL end. It is date with the form (Text, 4-byte integer). In fact, other than date, it is possible to store any text/integer pair.

5. Symbol Type = 6
8-byte number along with text with a NULL end. It is datetime with the form (Text, 8-byte number) In fact, other than date, it is possible to store any text/number pair.

The order of how these symbols are read indicates the corresponding symbol index. For example in a field, "c" is the first read for the field, it will have a index =0, the "a" is the second symbol read, it will have index = 1. It does not require a proper sorting.
Moreover, it is also required special attention on how it manipulates NULL. Using an example will be easier to understand. With a table with three fields Num, Text, Dummy with the following data:

Num,Text,Dummy
2, A,
1, B,
1, A,
2, B,
,,

Num has 3 symbols including values of 2, 1 and NULL.
Text has 3 symbols including values of A, B and NULL
Dummy has 1 symbol including values of i.e. NULL

The symbol stored of Num will be
[Symbol Type =5][1 and "1"] => index =0
[Symbol Type =5][2, "2"] => index = 1
[Symbol Type =5][NULL, NULL] => index = 2

It requires 8 bytes + 2 bytes (utf-8, 2 bytes for a char) for the symbol data. To indicate 3 symbol indexes, it requires 2 bits.
As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 2 and Bias is 0.

* it will treat as 5 as it does not know the data type well. It happens when the data is coming from a CSV without data type specification. If it is coming from DB, it has a mapping between the DB type and the type to be used in QVD.

The symbol stored of Text will be
[Symbol Type =4][A] => index = 0
[Symbol Type =4][B] => index = 1
[Symbol Type =4][NULL] => index = 2
It requires 2 bytes (utf-8, 2 bytes for a char) for the symbol data. To indicate 3 symbol indexes, it requires 2 bits.
As a result in the QVDFieldHeader, BitOffset is 2 and BitWidth is 6 and Bias is 0. Since it is the last column with data, it will take up all bits to form a full byte, i.e. 6-bits even the smallest and required is just 2 bits. And funny is that the last column means the last column with data. If it is all null, it would not treat as the last column.

The symbol stored of Dummy will be
[Symbol Type =4][NULL] => index = 0
It requires nothing for data storage. But the symbol type 4 is still required to store 1 bytes, i.e. NULL byte.
As a result in the QVDFieldHeader, BitOffset is 0 and BitWidth is 0 and Bias is 0. Offset, width and bias are zeros indicate no bytes are required for record.

Record Data
In QVD, each record is not storing the exact field values. Indeed, it stores the symbol indexes of the all fields.
Taking the same example used in the symbol illustration, the table below:

Num,Text,Dummy
2, A,
1, B,
1, A,
2, B,
,,

First record: require Num[index=0] and Text[index=0] and dummy=nothing, the record represents as [0000], 2 bits for Num, 2 bits for Text.
Second record: require Num[index=1] and Text[index=1] and dummy=nothing, 3 bits represents as [0101]
Third record: require Num[index=1] and Text[index=0] and dummy=nothing, 2 bits represents as [0001]
Forth record: require Num[index=0] and Text[index=1] and dummy=nothing, 2 bits represents as [0100]
Fifth record: require Num[index=2] and Text[index=2] and dummy=nothing, 2 bits represents as [1010]
The first field will be stored in the rightmost bits while the last field will be stored in the left most bits.
Thus, the 5 four records will be stored as [0000], [0101] [0001] [0100], [1010]. The just enough bytes are required to hold these binary data, i.e. 8 bits, 1 byte. Thus, they become [0000 0000], [0000 0101], [0000 0100], [0000 0001], [00001010] => 0, 5, 1, 4, 10. These 5 integers are used to represent total of 15 values.
To complete the QvdTableHeader description, the RecordByteSize is 1. And NoOfRecords are 5 because there are 5 rows.

* Bias, still, needs more investigation on the exact usage.

Reverse Engineering - What have been done?
In order to try further understanding the format, a simple way to is generate a simple QVD file and look into the details. For example a single column with 2 rows with integer only. Keep iterating with different data and review the details, it is easy to spot the changes. It might require a notepad editor that can show the invisible byte like NULL, EOT, etc. A good notepad editor is notepad++.
Obviously, there might be still more handling in QVD but it is already showcase the beauty of it and why it can process that fast and compress that much.
Personally, it terms of file operation, I seriously hope that Qlik can further expand this usage because nowadays the data usage is huge and file is everywhere. A proper and manageable file is important. In particular to support the cloud computing, an enhanced of QVD might do the trick as well. If it can break through the area and become open-source to use, it will greatly beneficial to everyone dealing with data.

Wednesday, July 19, 2023

Qlik Sense Server Migration - API and CURL Detail

Post Index

2023-07-19

How to use CURL and QRS API

Qlik Sense Server Migration Overview

Qlik Sense Server Migration

API and CURL Detail

Migrating Qlik Sense items using APIs

There are a number of QS items required to be migrated in the Qlik Sense server migration process. This article focuses two core components to illustrate the concept and steps required for migration. First, it will discuss about the App migration that involves App, App Object and App Content. Then it will discuss about the Reload Task that involves Reload Task, System Event and Composite Event.
For other, the concept is similar. I am happy to help if you have questions.

General QRS API Call
More technically, Qlik Sense Server Migration involves a few types of QRS API calls:

1) GET
To obtain the metadata in JSON and this JSON will be modified or trimmed off in order to POST/PUT into the new cloud server.
2) POST
POST command is generally to create a new item in the QS server.
3) PUT
PUT command is usually to update the QS item that is already created in the server.

For POST and PUT, usually, it requires details to be provided in the HTTP body. It would be much easier to GET the metadata directly from the on-premises server and modify the necessary component in the JSON for the POST and PUT APIs. One obvious example is the modified date and it must be larger than the last modified date. More related to this article discussion is the app ID. Since APP ID will be generated after upload, it should maintain a mapping in order to update the JSON for the POST and PUT APIs.

App Object ID vs Engine Object ID
These two IDs are tricky and easy to get confused. When the app object is created, a unique GUID will be created and this ID is used internally in QS for QMC. At the same time when the object is created, the Engine Object ID is also created. It is an in-app unique ID to distinguish the object. It is for the qvf used.
In the QRS API, app object ID is referring to the App Object ID. However, in QMC or in the link, it is showing the Engine Object ID.

QS Application Migration
Before migration QS applications, firstly, some QS items should be migrated beforehand. For instance, data connection, extension, content libraries, stream, etc. This makes sure the application can make use of all these components.
In fact, to migrate an application, it is similar to do what manually do with the following steps:
1) Export the app with data and cover all the base, shared and private app objects.
2) Import the app with data with all the exported app objects and maintain the ownership
3) Publish the app to stream (if necessary)
4) Unapprove the app object
5) Unpublish the app object
6) Change the owner of the app object to maintain all the ownership

* If app is exported without data, the always one selected value setting in the field will be lost. If you are certain that the application does not have this setting, it is recommended to export without data and do the reload afterwards.

Export App
There are three methods that are available to obtain the QS app with all the app objects:

1) using QRS API /qrs/app/exportapps (refer to May 2023 version)
This is only available in or above May 2023 version of Qlik Sense. The export scope should be configured to "all".
2) copy the app directly in the Qlik shared folder, i.e. \\server-name\QlikShare\Apps
This folder contains all the Qlik Sense applications that are in the Qlik Sense server. The files in this folder are named by the QS Application ID without extension. These files are actually a binary qvf.

3) Publish/Approve the sheets and then export the app.

This method has impact to the end users because the private and community app objects will be published and approved to base app objects. After triggering the export, it can simultaneously unapprove and unpublish the app objects but it still take a while for these operations. If the server is frozen, this method would be a good choice.

This method requires the following QRS API:
1) PUT /qrs/app/object/{id}/publish
To publish app objects into sharing state.
2) POST /qrs/app/object/{id}/approve
To approve app objects into base state.
3) POST /qrs/app/{id}/export
To create the download link and export the app
4) GET /qrs/download/app/{id}/{exportTicketId}/{fileName}
To download the app based on the download link
5) * if require to restore app object state, POST /qrs/app/object/{id}/unapprove
To unapprove the app objects back to sharing state.
6) * if require to restore app object state, PUT /qrs/app/object/{id}/unpublish
To unpublish the app objects back to private / personal state.

Import App
There are two APIs available for importing the QVF into the Qlik Sense server:
1) POST /qrs/app/import
This requires the qvf first located in the app folder. C:\ProgramData\Qlik\Sense\App

2) POST /qrs/app/upload
This method is recommended.
Apparently, manual upload is also possible.

Publish App to Stream
The next step is to publish the app into stream with the QRS API.
PUT /qrs/app/{id}/publish
Some of the applications might be in the work stream that does not require this step.

Unapprove App and Change Ownership
This step is a bit tricky because we are not using the unapprove API. Instead,
the app object put is used.
PUT /qrs/app/object/{id}
As a result, we can update both the unapprove flag as well as the app object ownership in one-go.

Unpublish App
The last step is to unpublish the app object if they are private/personal.
PUT /qrs/app/object/{id}/unpublish
*** Note if binary QVF is used
The binary qvf files obtain all the app objects even it is deleted (reference). So, there should be a steps to remove the redundant objects. This can easily be cross-checked with the app object metadata and do a DELETE /qrs/app/object/{id}.
And there are a few more drawbacks:
The binary qvf files obtain all the app objects even it is deleted (reference). So, there should be a steps to remove the redundant objects. This can easily be cross-checked with the app object metadata and do a DELETE /qrs/app/object/{id}.
i) If the application has referenced to the In-App pictures, the link is broken. One way to fix this is to make sure of the Engine JSON API to replace the on-premises app ID to the cloud app ID.
ii) the app properties are also lost. This needs to be fixed by PUT /qrs/app/{id}. Obviously, it requires a JSON that can be obtained by GET /qrs/app/{id}
iii) the app content is lost. And it requires to POST /qrs/appcontent/{appid}/uploadfile. The content physical files can be found in \\server-name\QlikShare\Static Content\App Content.
QS Reload Task Migration
The reload task migration is less complicated compared to the app migration. The only tricky part is to make sure the reload task is pointing to the updated app ID. And reload task always comes with trigger where
a) Schema event is based on the time schedule.
b) Composite event is used to chain up the app reload.

Overall, the below API calls will be used to obtain the metadata.
GET /qrs/reloadtask/full
GET /qrs/schemaevent/full
GET /qrs/compositeevent/full

The below API call will be used to create the reload task, schema event as well as the compositive event.
POST /qrs/reloadtask
POST /qrs/schemaevent
POST /qrs/compositeevent

Thank you for reading. I hope you find it useful. See you in the next post :)

Saturday, July 15, 2023

Qlik Sense Server Migration - API and CURL Overview

Post Index

2023-07-15

How to use CURL and QRS API

Qlik Sense Server Migration

API and CURL Overview

Migrating Qlik Sense items using APIs

There are a lot of discussions on how to perform a Qlik Sense server migration . One typical method is to use backup and restore the PostgreSQL database and then update the configuration. Instead of this traditional method, this article focuses the discussion on how to perform a Qlik Sense migration using QRS API with CURL to move QS items from on-premises servers to cloud servers.
This article provides an overview and provides an understanding about the basics for preparation. For detailed implementation, it will be discussed in future posts later (will provide the links later when they are ready). So, let's start.

Qlik Sense Migration Concept

Using QRS API to perform the migration means that the new QS server is first configured in place and then ready to accept the move-over QS items by QRS API. The new QS server should be readily configured:
  a) Share location (the shared content location)
  b) Node setup (join to the central node cluster)
   c) Service Configuration (Proxy, Engine, Scheduler, etc)

After the new server is ready, in general, the below Qlik Sense items will be involved in the migration
  a) Data Connection
  b) Stream
  c) Extension
  d) Content Libraries
  e) Applications
  f) App Objects (Base/Private, Publish, Approved)
  g) App Content
  h) ODAG Links
  i) Reload Task (including schema event, composite event)
  j) System Rules (including Security Rules, license rules, etc)
  k) more ... (like tag, custom properties and so on), depending on the usage

The above should cover majority of the usage pattern. More importantly, the QS items have inter-dependencies. For example, Stream must be ready before an application can be published to it. The above sequence can be used as a simple sequence for creation for reference.
In addition, all the active users should be synchronized from the user directory before the migration to start in order to preserve the ownership of the QS items.

Migration Control Master

Before migration, it is important to mark down the QS items that are going to be migrated. The easiest way is to make use of the QRS API to obtain the metadata directly from the Qlik Sense server. In general, it will be something similar to below, taking app as an example.
1) GET /qrs/app
This provides the condensed details
2) GET /qrs/app/full
This provides the full details.
If taking the aforementioned pattern, the below QRS API calls will be used to extract the metadata:
1) GET /qrs/dataconnection/full
The data connection to be used for loading data source data.
2) GET /qrs/stream/full
The stream that will be used to publish app.
3) GET /qrs/extension/full
The extension used for extra chart types.
4) GET /qrs/contentlibrary/full
The additional libraries to provide picture or media files.
5) GET /qrs/app/full
The QS application.
6) GET /qrs/app/object/full
The app objects including sheet, story, bookmark, etc.
7) GET /qrs/app/content/full
The in-app picture and media files.
8) GET /qrs/odaglink/full
The ODAG link that allows usage.
9) GET /qrs/odaglinkusage/full
The actual usage for the ODAG links.
10) GET /qrs/reloadtask/full
The reload task to trigger data reload in QS application.
11) GET /qrs/schemaevent/full
The reload task trigger by schedule.
12) GET /qrs/compositeevent/full
The reload task trigger by event.
13) GET /qrs/systemrule/full
The rule including security rules, license rules, etc.
14) GET /qrs/license/professionalaccessgroup/full
The group for professional license.
15) GET /qrs/license/analyzeraccessgroup/full
The group for analyzer license.
16) GET /qrs/user
The QS users.

Migration Process Overview
There are a few core steps required in the migration process. It includes:
1) Extraction
It extracts all the metadata from the on-premises server via QRS API. The result will be in JSON format and it is then required to convert into Excel as a metadata list.
2) Review
The metadata lists will then be reviewed by administrators to analyze what are required to migrate and what are redundant. Once this process is done, it becomes control master lists as a control to make sure all items marked "migrate" should be migrated to the new cloud server.
3) Prepare
Once the control master list is ready, it will then help to prepare the generation script for CURL batch file as well as the JSON for migration. This is tricky and a clear sequence of process is needed.
4) Migration
When all the CURL batches and JSON files are ready, it can then start the migration processes.
5) Verification
Once all items are then migrated to the cloud server, it is required to extract again both the on-premises metadata as well as the cloud metadata in addition to the control master. They are then matched against each other to make sure items:
1) should be left in on-premises
2) should be migrated to cloud
3) are new to cloud

Advantages of Using API

One difficulty in migrating the Qlik Sense items over to the new servers is to maintain the GUID. It is required for Qlik Sense to link up all the components internally and work together. The integrity should be maintained. If the GUID is changed, it means the corresponding dependent items or references have to be updated. Otherwise, the relationship is broken, it creates lots of disfunctionality.
If the QS items are created via QMC console, a random GUID will be created. And a lot of effort is required to match the old/new GUID and it increases the tasks to re-configure all these dependencies.
Fortunately, using QRS API, a majority of the QS items can retain their GUID which can make the migration easier and faster. Although it is not possible to maintain the application ID (also the Application Object ID, please do not confuse with the Engine Object ID, will explain further during the detailed migration process), it already smoothens the migration process to reduce the changes required.
More discussion will be provided when discussing the migration details.

Thanks for reading and I hope you find it useful. See you in the next post! :)

Labels

Monday, July 31, 2023

Reverse Engineering

Wednesday, July 19, 2023

API and CURL Detail

Saturday, July 15, 2023

API and CURL Overview

Migrating Qlik Sense items using APIs

Migration Control Master