Huffman Coding In SQLite – A Primer For Mobile Forensics

While working on revisions for my Smart Phone Course, I thought I could perhaps craft a primer that would help explain Huffman Coding in SQLite a bit better than I had explained it during the class. Below is the result of that endeavor, written between flights to Saudia Arabia. I hope that it is helpful for you.

1. Introduction

This article is designed to provide you with an introduction to a concept know as Huffman Coding. Understanding the basics of how to decode Huffman coded variable length integers within SQlite database records is of relevance to any forensic examiner seeking to test and validate forensic software or to carve for SQLite records.

The examiner is encouraged to explore the references listed after the primer for further understanding and information related to developing search algorithms.

2. What is “Huffman Coding”?

Huffman coding is an algorithm used for lossless data compression. It is used in SQLite to store the values of the record size, key and record header variable length integers. It is also used in other file formats such as PKZIP (and its variants) JPEG and MP3.

Huffman coding uses one byte for a decimal value from 0 to 127, two bytes for a value up to 16,383, three bytes up to 2097152, continuing up to a nine byte maximum.

3. SQLite Record Structure

In SQLite 64 bit variable length integers – which use Huffman coding -are used to store the value of the record size, the key and the header data of an individual record. Below is a graphical representation of a SQLite data record

SQLite Record Structure

SQLite Record Structure

The bytes that make up the individual header follow in order the Schema Layer of the database. The values of the bytes will be interpreted as in the below table. The value column refers to the hex value of the byte not the decimal value. To find the length of the bytes listed in the last two rows you need to first convert the byte to decimal and then follow the formula listed in the third right most column.

SQLite Header Values

SQLite Header Values

Remember that we said that any value above 127 decimal requires a special conversion because the decimal value has been compressed using the Huffman algorithm. This will be detailed in the following sections.

4. Huffman Coding Example

We will now look at an instance of an SMS message that employs Huffman coding to store the decimal value of the record size. The total record length does not factor in the record size byte or the bytes that make up the key (record number). The size of the record reflects the record header plus the data section. The first byte of the below record has been highlighted to show the beginning value with which we start to calculate our record size.

Header Size Byte

Header Size Byte

5. Decoding the sequence – Two methods

I will show you two methods for calculating the decimal value of the total byte length of the record. The first method, while lengthy will give you a detailed understanding of how huffman coding works within SQLite. The second method is equally valid and provides for a shortcut. I reccomend trying a few on your own using the lengthier method to fully understand the ins and outs before heading to the “shortcut”.

The variable integers in the SQLite records are stored in big endian format regardless of the host operating system. This means that the most significant bit will be stored first and the least significant bit stored last.

The first method we will be looking at converts our byte of interest from hexidecimal to binary to determine if the most significant bit is set. If this bit is set we then know that the next byte in the sequence is also part of the calculation. This byte is then converted to binary and checked to see its most significant bit is set. If it is, we proceed onto the next byte and so on.

Lets look at our above text message and the byte we have highlighted. The value of our byte is \x81. Converted to binary its value is 1000 0001. As we can see from the below graphic the most significant bit – shown in red – is set. Therefore we need to look at the next byte for our calculation

Hex 81 Most Significant Bit Set

Hex 81 Most Significant Bit Set

We now need to look at the next byte in the sequence. in this case it is \x36. Converted to binary its value is 0011 0110.

Hex 36 Most Significant Bit Not Set

As is shown in red, this byte’s most significant bit is not set. Therefore our calculation uses only the first two bytes in the record to calculate size. To do this calculation we leave off the most significant bit of each byte- shown below in green and red – and concatinate the remaining seven bits of each (starting with the binary of the first byte \x81). The resulting binary is then converted to decimal for the size of our record.

Converted Binary Values - Record Size

Converted Binary Values - Record Size

The next method uses a calculation for determining the stored compressed decimal value. You will recall from our previous discussion that a byte that is greater than 127 decimal uses 2 bytes to store its value up to decimal 16,383 and three bytes up to 2,097,151 etc. 127, 16,383, and 2,097,151 are special because they are related to powers of 2^7; i.e., 127 = (2^7-1), 16,383 = (2^14-1), and 2,097,171 = (2^21-1). Given this, when we examine the first byte of our record – the one that determines the size of the record – /x81 or decimal 129 we see that it is greater than 127 and smaller than 16,383 and will therefore use 2 variable integer bytes to store its uncompressed value. The formula to calculate the uncompressed value using two bytes in the calculation is as follows

Huffman Decoding 2 Byte Sequence

Huffman Decoding 2 Byte Sequence

The value of “x” is the value of the first byte converted to decimal and the value of “y” is the second byte converted to decimal. The calculation is continued below

Huffman Decoding 2 Byte Complete

Huffman Decoding 2 Byte Complete

As you can see from the above calculation the value obtained from this method is the same as the other method. It is unlikely that values above 16,383 will be encountered, however the formula for calculating a compressed value stored within three variable integers is as follows ( x x 128) x 16384 + ( y – 128) x 128 + z. Again the value of “x” is the value of the first byte converted to decimal, the value of “y” is the second byte converted to decimal and the value of “z” is the third byte converted to decimal.

6. Conclusion

We have now explored the basics of how SQLite stores decimal numbers over 127 in 64 bit variable integers using an algorithm called Huffman Coding. Understanding this concept will help forensic practitioners in testing and validating software used to analyse SQLite databases and in crafting searches for deleted and unallocated records that may remain extant on suspect media. The reader is encouraged to explore the references listed below for further understanding and insight into SQLite and searching for records in unallocated space.

Comments or questions are welcome and can be directed to linuxchimp at gmail dot com.

7. References

  •  Murilo Tito Pereira, “Forensic analysis of the Firefox 3 Internet history and recovery of deleted SQLite records”, Digital Investigation 5 (2009) 93-103, Elsevier

8. Acknowledgements

No article is written without help. I would like to thank Dr. Gary Kessler for his valuable eye and suggestions as well as DC Shafik Punja of the Calgary Police and Sheran Gunasekera of Zenconsult, LTD for our discsussions regarding variable length integers.



  1. cool

    Hi, I was doing research related to sqlite and this article helps me a lot. Thanks, by the way, there are two minor typos. In the 5th section, 2,097,171 = (2^21-1) should be 2,097,151 = (2^21-1) and ( x x 128) x 16384 + ( y – 128) x 128 + z should be ( x – 128) x 16384 + ( y – 128) x 128 + z

  2. Pingback: SQLite Database File Format Complete | ComTechStash