Database – the Void

SQLite with Built-in Online Compression

Sep 182011

Summary (TL;DR)

This an experimental mod of Sqlite with built-in online compression support. Design and implementation are discussed, limitation and benchmarks provided and source code as well as prebuilt DLL are included. Use the TOC to jump to the topic of interest.

Update ⇛ Source code is now hosted on GitHub

Background

Both Sqlite and MySql support compressed (and encrypted) databases. Well, more or less. Sqlite’s support is limited to read-only databases that are compressed offline, while MySql’s support is limited to compressing strings (as far as I can tell.)

While working on WikiDesk, a Wikipedia browser project, I knew the database could easily grow to 100s of gigabytes. The database of choice here is Sqlite because of it’s compactness and mobility. The English Wikipedia dump is already in the range of 100s to 1000s of gigs (depending on the dump type.) WikiDesk not only supports different Wikipedia languages, but also different projects, such as Wikinews, Wikibooks and Wiktionary, among many others in all available languages, all in the same database. Theoretically, one can import all possible Wiki content into a single database.

The opportunity of compressing this highly-redundant wiki-code mixed with Unicode text was pretty much obvious. So it was reasonable to assume others must have had a similar case and added compression support to Sqlite. My search only yielded the aforementioned cases.

A part of me was happy to have found no precedent project. I was more than happy to roll-up my sleeves and get to hacking.

Design Survey

There are many ways to go about designing a compressed database file. My main purpose, however, was to have fully-transparent, online and realtime compression support. So the design must accommodate updates and deletions as well as any other modify operation supported by Sqlite.

An obvious approach is the one used by MySql, namely to compress the fields independently. This is simple and relatively speaking straight forward. However it’d mean that LIKE couldn’t be used on compressed string fields. Collation and sorting and other features would be absent as well. In fact the fields in question couldn’t be TEXT at all. In addition, one had to explicitly compress fields, remember which is compressed and remember to uncompress before using them. Very limited I thought and probably wouldn’t be worth the effort. Another approach is to do this on a low level, such that it’d be transparent to the caller. Such an extension to Sqlite exists but this will not yield much gain on small fields. I suspect NTFS compression would give better results.

NTFS has built-in compression support. It was well worth the effort of testing it. On an English SimpleWiki dump I could compress the database file down to about 57% of its original size (see benchmarks below.) Pretty decent. However I couldn’t control it at all. I couldn’t set the chunk size, compression level or anything save for enabling and disabling it. In addition, the user could disable it and lose all the benefits. Database-level compression is more promising. A similar result can be achieved using FuseCompress or compFUSEd (on Linux), albeit, the user must install such a filesystem first.

A major problem with database files, as far as online compression is concerned, is that the database logical-structure typically stores pointers to file offsets, such that there is a one-to-one mapping between the physical and logical-structures. This is reasonable as the database is really a large and complex datastructure on disk (as opposed to memory.) The btree or rtree nodes are typically page indexes, where all pages have a predefined, database-wide fixed size. Disrupting this structure would render the file corrupted. The purpose of the fixed-size pages is to simplify the allocation and management of space. This scheme is also used by memory and disk-managers alike.

If we compress a page in the database, the page would now contain two regions: data and free-space. To utilize the free-space, we could write a portion of the next page in the free-space, and the remaining in the next page, and so on for all pages. But then we’d have to keep track of each page’s fragments somehow. To avoid that, we can leave the free-space unused, but then we’d get no net saved disk space, as the free-space would still be allocated on disk.

I could store the new indexes and offsets in some allocation table appended to the file. But I’d have to do a lot of data moving, reallocation, (de)fragmentation and whatnot just to keep track of the free ranges and so on. Obviously this approach was pretty complicated and would take much more involved design and coding. Also, Sqlite page-sizes are multiple of disk sector size for atomicity. I had to be thoroughly familiar with the Sqlite design and implementation to embark on such a largish project, if I wanted it finished and working.

The ‘be lazy’ motto seems to work well for programmers who are efficiency-oriented and hate repetitive and error-prone work. What would be the simplest approach that could work? Going back to NTFS one could learn a lesson or two on transparent compression. The secret is that NTFS can simply allocate any free inode on the disk, write the compressed data to it and update the index table. Inodes are linked lists, so it is very easy to insert/remove and modify the chain. Files, on the other hand, are arrays of bytes abstracted from the disk structure. Moving bits around in an array is much more complicated and time consuming than updating nodes in a linked-list.

What is needed is the advantage of a file-system applied on the level of files.

What if we could tell the file-system that these free-space regions of the file are really unused? NTFS supports sparse files in addition to compressed files. This could be used to our advantage. All we’d have to do is mark the free-space in each page as unused and the file-system will make them available to other files on the disk, reducing the net used disk space of the database.

The Design

Sqlite supports pages of 512-65535 bytes long. Since we can’t break a single page, the smallest compression unit must be at least 64 Kbyte long. In addition, the compression-unit of NTFS compression seems to be also 64 Kbytes. This means that a sparse range must be at least as large as a compression-unit to be deallocated from disk and marked as free. This puts a clear limitation on the amount of saving we can achieve using this design; Compression won’t save any disk space unless it reduces the size in multiples of 64 Kbytes. A multiple of 64 Kbytes is used as the compression unit, internally called a chunk. Indeed, a chunk size of 64 Kbytes would be totally useless as there could be no saving at all.

When data is written it’s first written into a memory buffer. This buffer is used to track changes to the chunk, it’s offset in the file and use to compress the data. When the chunk needs flushing the data is first compressed and the compressed data written to the chunk offset. The remainder of the chunk is marked as a sparse region. NTFS deallocates any naturally-aligned compression units that are completely sparse. Partially written units are physically allocated on disk and 0-valued bytes are written to disk.

When reading data, the complete chunk of the requested byte-offset is read, decompressed and from the buffered data the requested bytes copied back to the caller. The sparse bytes are transparently read-in as 0-valued bytes. This is done by NTFS and relieves us from tracking sparse regions.

Initially very fast compression libraries were used to avoid sacrificing too much performance. FastLz, Lz4 and MiniLzo were tested but the results weren’t very promising, compression-wise. As such the current build uses Zlib.

Implementation

The compression mod is written as a VFS Shim. This has the advantage of avoiding any modifications to the Sqlite code base.

Enabling compression must be done before opening any database files. A single function is defined as follows:

int sqlite3_compress(
    int trace,
    int compressionLevel
    );

trace can be a value between 0 and 7. When 0 tracing is disabled, larger values enable tracing of increasingly lower-level operations. Trace logs are written to stderr. -1 for default.

compressionLevel can be a value between 1 and 9, where 1 gives the fastest performance at the expense of compression ratio and 9 gives the best compression at the expense of performance. -1 for default, which is typically level-6.

To enable compression this function is simply called before calling sqlite3_open. Compression level may be changed between runs, however unless a chunk is modified, the data will not be recompressed with the new level.

Only the main database is compressed. The journal or any other temporary files aren’t compressed.

Limitations

Besides the fact that the code is in an experimental state, there are some things unsupported or even unsupportable by this mod. First and foremost only this mod can read compressed databases. The original Sqlite will declare compressed databases corrupted. However, this mod can and should detect uncompressed databases and disables compression silently (but use at your own risk.)

Since NTFS sparse file support is the key to achieving compression, the mod is literally useless on non-NTFS systems.

Sqlite is known to be quite resilient in the face of file corruption. This can no longer be supported with the same level as it is with the official release. In addition, corruptions would destroy much more data than a single page. With the compression library and the new code also comes the increased risk of crashing or being unstable.

Of the untested and probably unsupported features of Sqlite are:

Online database backup.
Multiprocess read/write.
Vacuum.
Data recovery.
Shell and 3rd-party tools.

Performance wise, there is virtually no caching implemented beyond the current chunk. This is bare-bone caching and there is a lot of room for performance improvements.

Benchmarks

An import of an English SimpleWiki dump was used as benchmark. The main table holds an auto-increment index, timestamp, the page title and the page contents (both Unicode).

256 Kbyte Chunks and Level-6 Compression (sizes in KBytes)
	Original	Sqlite Compressed
NTFS Normal	204,438 (100%)	73,296 (35.85%)
NTFS Compressed	117,460 (57.45%)	57,431 (28.09%)

1024 Kbyte Chunks and Level-9 Compression (sizes in KBytes)
	Original	Sqlite Compressed
NTFS Normal	204,438 (100%)	67,712 (33.12%)
NTFS Compressed	117,460 (57.45%)	66,220 (32.39%)

It’s quite obvious that the savings with the modified Sqlite are substantial as compared to NTFS compression on the original file. Interestingly, NTFS compression when applied on a compressed file still yields gains. This is because ~~of inefficiencies of the Zlib (deflate) compression (which is less so for level-6 than 9) and because~~ NTFS can deallocate at the level of clusters, which are 4096 bytes, as opposed to the sparse method’s compression-unit of 64 Kbytes. Since the free-regions are written as zero-bytes and they aren’t deallocated unless a complete 64 Kbyte unit is completely zeroed out, it seems reasonable to assume NTFS compression is crunching these zero-padded regions and deallocating them as it’s unit is only 4096 bytes.

It should also be noted that while statistically we should get better compression with larger chunk sizes and higher compression levels, this isn’t linear. In fact, increasing the chunk size may lead to reduced net gains in file size due to the 64 Kbyte compression-unit of NTFS. That is, if two chunks could each save a single unit (64 Kbytes,) doubling the chunk size (such that both would be compressed together as one chunk) might not be able to save 128 Kbytes, in which case the savings would be reduced from two units to a single, resulting in a 64 Kbyte larger file than we had with the original chunk-size. This heavily depends on both the data and the compression, of course.

Performance

A synthetic test done using generated text from an alphabet consisting of alpha-numerical plus symbol with random lengths of <1MB were done. Zlib seems to perform slowly on this random data (although the number of possible codes is small.) Chunk size of 256 Kbytes and compression-level of 6 was used. 50 random rows are generated and inserted with incremental Ids (two-column table,) the 50 rows are selected using the Ids and the texts compared to the original, new texts are generated with new lengths, this time of length <2MB and the rows updated. Again the 50 rows are selected by Id and compared to the updated-originals. The resultant database file is 50,686 Kbytes.

The original Sqlite code run the test in 13.3 seconds, while using default compression and no tracing (to avoid any overheads) the same test finished in 64.7 seconds (4.86x slower) resulting in a 41,184 KByte file. Both tests ran on the same generated data. The file was on a RAMDisk to minimize disk overhead.

Considering that the data was random and synthetic and insert/update rate was equal to select rates, the results are reasonable. In practice, reads are typically more frequent than writes. With proper caching this should reduce the performance overhead significantly.

Download

The code holds the same copyright claims as Sqlite, namely none. The code is experimental. Use it at your own risk.

Download the code and prebuilt DLL. This sqlite3.dll is version 3.7.7.1 amalgamation created with the default settings/flags from the amalgamation created from original sources by the original configure and make files. The compression code is added and it’s built using VS2010 Sp1 and statically liked to the runtime libraries, as such it has no dependencies.

Building

To build the code, first download a recent Sqlite version. The 3.7.7.1 amalgamation is perfect. The latest Zlib must also be downloaded and built.

Add the Zlib headers to the include path, copy the vfs_compress.c file next to sqlite sources and build. Next, build sqlite3.c amalgamation (or the original sources) and link the binaries of sqlite3, vfs_compress and Zlib to create the executable.

Future Plans

A good percentage of the official Sqlite tests pass successfully. But the corruption and format-validating tests unsurprisingly fail. Increasing the supported cases is a prime goal at this point. Getting the mod to “stable with known-limitation” status would be a major milestone. Improving performance is another goal that isn’t very difficult to attain. Having the ability to enable/disable compression on any database is also beneficial and will add more protection against misuse. It’d also be interesting to attempt supporting compression without NTFS sparse files support. This, while much more complicated, would work on any system and not on NTFS alone.

As a bonus, it’s almost trivial to add encryption on top of the compression subsystem.

Any comments, ideas, feedback and/or constructive criticism are more than welcome.

September 18, 2011
Posted by Ashod Nakashian at 9:27 am
10 Responses
Code Snippet, Programming
Tagged with: Code, Compression, Database, Design, Optimization, Sqlite
Font Size:
A A A

Revisiting OLTP and OLAP

Aug 262011

Databases are as ubiquitous as computers. They are often, erroneously, attributed to data-intensive, long-term storage solutions. Yet in reality they are used in one shape or form in most programs. From word processors to media players to web-browsers. All depend on databases to realize some of their features. This, not mentioning, e-commerce, flight booking, libraries and, of course, government intelligence data. Even when no feature requires a database per-se, user configuration and settings are often stored in databases. Where previously Windows programs depended on the Registry and Unix programs on plain-text files, increasingly new products are utilizing the portable and vastly superior SQLite instead.

Image via Wikipedia

Each application of course has its own set of requirements, use-cases and patterns. Yet, surprisingly or otherwise, there are two main patterns that can be said to parent the rest: OLTP and OLAP. Or, Online Transaction Processing and Online Analytical Processing respectively. Consequently, database design and optimization depends heavily on our processing needs and models. (Here I’m not concerned with database engine design and architecture, rather on the use-case patterns and processing models.) To get a better feel of each, let’s consider typical cases for each.

Online Transaction Processing (OLTP)

This type of applications are chiefly characterized by performing real-time, localized, mission-critical operations. That is, the operations concerned must complete virtually immediately (hence real-time processing,) they typically involve a very small number of entries independent of the remaining data (hence localized processing,) and there is zero-tolerance for data inconsistency, corruption or undefined states. The latter property is what requires transactional processing such that either the operation completely succeeds with all its sub-parts or it completely fails restoring the previous state of all involved entries.

In OLTP the operations are mostly that of Insert and Update, with occasional Delete operations for housekeeping, although typically deletion is differed for performance and/or for later data mining and reporting. The tables involved are typically normalized around the data entities.

Examples include online shopping, flight and medical data and custom data processing.

Online Analytical Processing (OLAP)

In contrast to OLTP, this type of applications are characterized by performing offline, aggregate and low-priority operations. The operations involved here can be executed at low-demand hours, often on archive databases and secondary servers (hence offline processing,) they typically aggregate a very large amount of data to generate statistical data, reports or to find outliers (hence aggregation,) and, since they are offline processing, they are designed to have low-priority, low isolation level (read uncommitted) and, since there is little or no side-effects to failing, they are designed to fail rather than potentially interfere with competing OLTP operations (if executed on the same database.)

OLAP operations are mostly Select operations with virtually no data modification (except for storing the results, if at all, typically in dedicated tables.) These operations not only aggregate large number of entries, with heavy use of aggregate functions, but they typically join a significant number of tables.

Examples include reporting, billing, statistical analysis and historic data processing.

Similarities and Differences

Since the differences between these two patterns lies in the underlying database operations they perform, it’s no surprise that there aren’t a multitude of basic patterns. The two main operation types are that of read and modify. That is, the applications that mostly read data (OLAP) typically perform massive read operations potentially across the complete data with few or no modification, and those that depend on heavy data modification care most about the integrity of the few entries they modify with little or no interest in other data.

However, one must make note of the fact that modification implies reading. This is crucial for correctly appreciating the performance characteristics of the two patterns. The three data-modifying operations, Insert, Update and Delete, all ought first perform Select operation(s) in one form or another. While this is more apparent for Update and Delete, Inserts with constraints must first validate that no constraint is violated. This often involves a lookup operation in the constraint index(es). Only in the most basic and simplest cases could an Insert operation be free of all lookup operations (when foreign key, unique or other constraints are completely absent.)

This doesn’t mean that OLTP operations are a superset of OLAP operations. To try and optimize first and foremost for OLAP with the hope that OLTP operations would naturally also run faster, thanks to the improved read operations that it could utilize, is a fallacy. The two are worlds apart and typically don’t have overlapping indexable data (more on this below.) Where they do share fields, indexes on them would most certainly be welcome to both. Unfortunately, that doesn’t materialize nearly as much as one would like.

Typical properties of OLTP vs. OLAP
	*Transactional Processing*	*Analytical Processing*
Data	Current and in-progress.	Retired, historic or archived.
Typical Operations	Insert, Update, Delete.	Select.
Function Types	Based on business requirements.	Based on data mining.
Normalization	Highly normalized, entity modeled.	Possibly denormalized to minimize joins.
Data Locality	Few tables, few entries.	Large aggregation across many tables.
Indexing Strategy	Limited to a few highly-selective fields.	Generous indexing depending on the queries.
Operation Duration	Typically very short, sub-millisecond range.	Extended over minutes and hours.
Caching Friendliness	Highly volatile data with strict persistence requirements.	Static data, very cache friendly.
Backup Requirement	Inconsistency and data loss may cost business.	All operations can be rerun, backup is redundant.
Query Complexity	Trivial to simple.	Complex to inhumane and unreadable.

Hybrids

The above descriptions are certainly not the only possibilities. A combination of both is more reasonable to expect in practice rather than the theoretical and somewhat idealized cases given above. Some such case might include functionality that must aggregate several groups of data without suffering phantom reads. Such a requirement means that not only the aggregate query must run with a high isolation level, reducing parallelism, but that it must also add indexes to finish as fast as possible to free up the entries that concurrent transactions might require. The additional indexes would incur unwelcome cost for data modification (which must also update the indexes as well.)

Optimization

Due to the differences in the operations involved, and indeed their requirements, optimization in each case is different than the other.

OLTP

To maximize performance, OLTP operations would use highly selective conditions on normalized tables in queries that are completely indexed (rather than partially indexed). This will get us to the target data as fast as possible, or, in case of insertion, will verify the absence of collisions equally quickly. Typically the primary key is used, which itself may be an alias of the internal row-id.

Next we’d want to localize locks as much as possible to maximize concurrency and minimize collateral impact. Fine grained lock hints such as row-level locks may be used. Normalization helps here by further containing the impact of the transactions as local as possible. In addition, we’d want to minimize the average row-size to improve cache-hits and avoid undue penalty. Once again, Normalization does most of the work here too.

Finally, OLTP operations would want to minimize, if not avoid, indexes, triggers, views and any operation that doesn’t help it reach and update the data as fast as possible. To that end, indexes are only useful to reach the data fast. All others would simply slow down virtually all operations, save for any selects we may have.

OLAP

For OLAP on the other hand it’d be best to work with denormalized data, to minimize or avoid joins.

Unlike OLTP, it’s very hard to predict or prioritize the conditions most useful to OLAP. This is because depending on the particular query in question, the tables and fields of importance are decided. Indeed, different queries might depend on very different fields and therefore indexes. So indexing some field might be useful to only an isolated query and not others. For a billing system not only the customer and orders are of importance, but the dates of orders and the billing period as well. Similarly, to generate sales reports, the queries involved would select on products and dates. Yet an inventory audit query might depend on a different set of fields. Hence, indexes are best decided based on the concrete queries in question.

To improve performance advanced features such as Materialized Views (aka Indexed Views) may be utilized, which are unfriendly to transactional processing.

Hybrids

From the above it’s quite clear that not only do OLTP and OLAP differ in their operations, but consequently optimizing their queries are apparently in conflict with each other. The requirements for maximizing performance in each case is simply contradictory to one another. But often both types of processing are done in our products. How can we optimize them?

Optimization for both OLTP and OLAP

It must be said from the outset that optimizing for two contradictory set of requirements can be a recipe for disaster. That is, we might end up with worse performance for most queries and even degrade the database design in hope of improving performance. With hubris some might advocate indexing all fields in query conditions, creating an uninformed mess in their wake. In fact, since OLTP operations are inherently very short lived, the overhead of updating superfluous indexes will probably go unnoticed, yet a long-running OLAP operations might get a noticeable boost that the OLAP queries get. From this the gulls in question might pat themselves for a job well done, completely oblivious to the fact that the OLTP operation in question will probably run millions of times, each time incurring the slight cost of the index updates, negating the apparent speed boost. To avoid such scenarios, we must understand that a case of compromise is perhaps unavoidable and approach the problem methodically.

Here are some design patterns that are often employed to maximize performance across both types of queries.

Data Archiving

Perhaps the simplest solution is to avoid performing both OLTP and OLAP type operations on the same data-set. To achieve this, we’ll have to split the data over two sets of tables. One, highly normalized and optimized for OLTP and another, perhaps denormalized, OLAP-optimized set.

A data-retirement scheme will be used to move the retired data from the OLTP table(s) into the archive (OLAP) tables. This may be performed using a background process, or using triggers. The latter may not be desirable as triggers would add latency to the OLTP queries, however, on the other hand, the cache hits and the avoidance of extra code and scanning of the OLTP tables might be a good compromise.

Dynamic Indexing

A number of application process data in batches. Each batch is processed in transactional fashion and once all the data has been processed completely, reports and other aggregate functions are executed on the final set. Custom data processing is perhaps a good example here, where some data is processed (for example by means of transformation, conversion or other modification) and tracked by transcations in a database. The database accounts for every data item, its states as it is modified with timestamps and warning/error codes where applicable.

Such a processing scheme has the advantage of allowing for an OLTP type optimization for the data processing stage until all the data is completely processed, whereupon OLAP-optimized indexes are dynamically added. The overhead of adding or modifying the indexes (yes, some might be dropped, while new ones added) might win us an overall significant amount of CPU time. Of course we’d need to profile and benchmark thoroughly to decide which indexes to remove and which to add.

Biting the Bullet: Conjoining Both

In some cases we need to aggregate both retired and archived data as well as current and in-transit data. If this scenario is of prime importance, then archiving the data for offline processing might simply be unhelpful, as we must also join the OLTP tables to get as-yet unarchived data. Also, we can’t work in stages and perform dynamic indexing since we have to work with both current and archived data.

For obvious reasons this is the worst of both worlds. In this case, we must very carefully balance two forces:

Indexes for optimum OLTP performance.
Indexes for the most important OLAP queries.

Here analytic profiling of the different operations, their execution pattern and respective use-cases is necessary. Once we get a rough idea of the relative importance of our database operations, we can start collecting empirical data using benchmarks to evaluate the different approaches. Since the database design, indexes and other attributes must be shared among all database operations, we must choose the minimum set of indexes etc. that give the maximum performance across all operations.

The difficultly is of course in finding this optimum point. As each index helpful to the OLAP queries, but unnecessary to the OLTP ones, is an overhead for the high-rate transactions. On the other hand, strategic indexes could boost the performance of aggregate queries by orders of magnitude.

Here one is bound to spend a good amount of time using database profiling, statistical analysis and benchmarking. (In a future post I’ll address this topic head-on.) The downside of all this is that once the priorities of our OLAP queries change, then our particular database design will perhaps be outdated too.

Conclusion

The requirements of OLTP and OLAP are as different as are their use-cases. Optimizing for both is a very tricky business, but sometimes unavoidable. If we must maximize performance, we must invest a considerable amount of time designing a well-balanced database schema and, using extensive profiling and benchmarking tools combined with analytic methods, only then can we decide on the design that maximizes performance for our particular needs.

(Anything missing? If you have something interesting on the topic to share, please don’t keep it to yourself.)