Sep 182011

Summary (TL;DR)

This an experimental mod of Sqlite with built-in online compression support. Design and implementation are discussed, limitation and benchmarks provided and source code as well as prebuilt DLL are included. Use the TOC to jump to the topic of interest.


Both Sqlite and MySql support compressed (and encrypted) databases. Well, more or less. Sqlite’s support is limited to read-only databases that are compressed offline, while MySql’s support is limited to compressing strings (as far as I can tell.)

While working on WikiDesk, a Wikipedia browser project, I knew the database could easily grow to 100s of gigabytes. The database of choice here is Sqlite because of it’s compactness and mobility. The English Wikipedia dump is already in the range of 100s to 1000s of gigs (depending on the dump type.) WikiDesk not only supports different Wikipedia languages, but also different projects, such as Wikinews, Wikibooks and Wiktionary, among many others in all available languages, all in the same database. Theoretically, one can import all possible Wiki content into a single database.

The opportunity of compressing this highly-redundant wiki-code mixed with Unicode text was pretty much obvious. So it was reasonable to assume others must have had a similar case and added compression support to Sqlite. My search only yielded the aforementioned cases.

A part of me was happy to have found no precedent project. I was more than happy to roll-up my sleeves and get to hacking.

Design Survey

There are many ways to go about designing a compressed database file. My main purpose, however, was to have fully-transparent, online and realtime compression support. So the design must accommodate updates and deletions as well as any other modify operation supported by Sqlite.

An obvious approach is the one used by MySql, namely to compress the fields independently. This is simple and relatively speaking straight forward. However it’d mean that LIKE couldn’t be used on compressed string fields. Collation and sorting and other features would be absent as well. In fact the fields in question couldn’t be TEXT at all. In addition, one had to explicitly compress fields, remember which is compressed and remember to uncompress before using them. Very limited I thought and probably wouldn’t be worth the effort. Another approach is to do this on a low level, such that it’d be transparent to the caller. Such an extension to Sqlite exists but this will not yield much gain on small fields. I suspect NTFS compression would give better results.

NTFS has built-in compression support. It was well worth the effort of testing it. On an English SimpleWiki dump I could compress the database file down to about 57% of its original size (see benchmarks below.) Pretty decent. However I couldn’t control it at all. I couldn’t set the chunk size, compression level or anything save for enabling and disabling it. In addition, the user could disable it and lose all the benefits. Database-level compression is more promising. A similar result can be achieved using FuseCompress or compFUSEd (on Linux), albeit, the user must install such a filesystem first.

A major problem with database files, as far as online compression is concerned, is that the database logical-structure typically stores pointers to file offsets, such that there is a one-to-one mapping between the physical and logical-structures. This is reasonable as the database is really a large and complex datastructure on disk (as opposed to memory.) The btree or rtree nodes are typically page indexes, where all pages have a predefined, database-wide fixed size. Disrupting this structure would render the file corrupted. The purpose of the fixed-size pages is to simplify the allocation and management of space. This scheme is also used by memory and disk-managers alike.

If we compress a page in the database, the page would now contain two regions: data and free-space. To utilize the free-space, we could write a portion of the next page in the free-space, and the remaining in the next page, and so on for all pages. But then we’d have to keep track of each page’s fragments somehow. To avoid that, we can leave the free-space unused, but then we’d get no net saved disk space, as the free-space would still be allocated on disk.

I could store the new indexes and offsets in some allocation table appended to the file. But I’d have to do a lot of data moving, reallocation, (de)fragmentation and whatnot just to keep track of the free ranges and so on. Obviously this approach was pretty complicated and would take much more involved design and coding. Also, Sqlite page-sizes are multiple of disk sector size for atomicity. I had to be thoroughly familiar with the Sqlite design and implementation to embark on such a largish project, if I wanted it finished and working.

The ‘be lazy’ motto seems to work well for programmers who are efficiency-oriented and hate repetitive and error-prone work. What would be the simplest approach that could work? Going back to NTFS one could learn a lesson or two on transparent compression. The secret is that NTFS can simply allocate any free inode on the disk, write the compressed data to it and update the index table. Inodes are linked lists, so it is very easy to insert/remove and modify the chain. Files, on the other hand, are arrays of bytes abstracted from the disk structure. Moving bits around in an array is much more complicated and time consuming than updating nodes in a linked-list.

What is needed is the advantage of a file-system applied on the level of files.

What if we could tell the file-system that these free-space regions of the file are really unused? NTFS supports sparse files in addition to compressed files. This could be used to our advantage. All we’d have to do is mark the free-space in each page as unused and the file-system will make them available to other files on the disk, reducing the net used disk space of the database.

The Design

Sqlite supports pages of 512-65535 bytes long. Since we can’t break a single page, the smallest compression unit must be at least 64 Kbyte long. In addition, the compression-unit of NTFS compression seems to be also 64 Kbytes. This means that a sparse range must be at least as large as a compression-unit to be deallocated from disk and marked as free. This puts a clear limitation on the amount of saving we can achieve using this design; Compression won’t save any disk space unless it reduces the size in multiples of 64 Kbytes. A multiple of 64 Kbytes is used as the compression unit, internally called a chunk. Indeed, a chunk size of 64 Kbytes would be totally useless as there could be no saving at all.

When data is written it’s first written into a memory buffer. This buffer is used to track changes to the chunk, it’s offset in the file and use to compress the data. When the chunk needs flushing the data is first compressed and the compressed data written to the chunk offset. The remainder of the chunk is marked as a sparse region. NTFS deallocates any naturally-aligned compression units that are completely sparse. Partially written units are physically allocated on disk and 0-valued bytes are written to disk.

When reading data, the complete chunk of the requested byte-offset is read, decompressed and from the buffered data the requested bytes copied back to the caller. The sparse bytes are transparently read-in as 0-valued bytes. This is done by NTFS and relieves us from tracking sparse regions.

Initially very fast compression libraries were used to avoid sacrificing too much performance. FastLz, Lz4 and MiniLzo were tested but the results weren’t very promising, compression-wise. As such the current build uses Zlib.


The compression mod is written as a VFS Shim. This has the advantage of avoiding any modifications to the Sqlite code base.

Enabling compression must be done before opening any database files. A single function is defined as follows:

int sqlite3_compress(
    int trace,
    int compressionLevel

trace can be a value between 0 and 7. When 0 tracing is disabled, larger values enable tracing of increasingly lower-level operations. Trace logs are written to stderr. -1 for default.

compressionLevel can be a value between 1 and 9, where 1 gives the fastest performance at the expense of compression ratio and 9 gives the best compression at the expense of performance. -1 for default, which is typically level-6.

To enable compression this function is simply called before calling sqlite3_open. Compression level may be changed between runs, however unless a chunk is modified, the data will not be recompressed with the new level.

Only the main database is compressed. The journal or any other temporary files aren’t compressed.


Besides the fact that the code is in an experimental state, there are some things unsupported or even unsupportable by this mod. First and foremost only this mod can read compressed databases. The original Sqlite will declare compressed databases corrupted. However, this mod can and should detect uncompressed databases and disables compression silently (but use at your own risk.)

Since NTFS sparse file support is the key to achieving compression, the mod is literally useless on non-NTFS systems.

Sqlite is known to be quite resilient in the face of file corruption. This can no longer be supported with the same level as it is with the official release. In addition, corruptions would destroy much more data than a single page. With the compression library and the new code also comes the increased risk of crashing or being unstable.

Of the untested and probably unsupported features of Sqlite are:

  • Online database backup.
  • Multiprocess read/write.
  • Vacuum.
  • Data recovery.
  • Shell and 3rd-party tools.

Performance wise, there is virtually no caching implemented beyond the current chunk. This is bare-bone caching and there is a lot of room for performance improvements.


An import of an English SimpleWiki dump was used as benchmark. The main table holds an auto-increment index, timestamp, the page title and the page contents (both Unicode).

256 Kbyte Chunks and Level-6 Compression (sizes in KBytes)
OriginalSqlite Compressed
NTFS Normal204,438 (100%)73,296 (35.85%)
NTFS Compressed117,460 (57.45%)57,431 (28.09%)

1024 Kbyte Chunks and Level-9 Compression (sizes in KBytes)
OriginalSqlite Compressed
NTFS Normal204,438 (100%)67,712 (33.12%)
NTFS Compressed117,460 (57.45%)66,220 (32.39%)

It’s quite obvious that the savings with the modified Sqlite are substantial as compared to NTFS compression on the original file. Interestingly, NTFS compression when applied on a compressed file still yields gains. This is because of inefficiencies of the Zlib (deflate) compression (which is less so for level-6 than 9) and because NTFS can deallocate at the level of clusters, which are 4096 bytes, as opposed to the sparse method’s compression-unit of 64 Kbytes. Since the free-regions are written as zero-bytes and they aren’t deallocated unless a complete 64 Kbyte unit is completely zeroed out, it seems reasonable to assume NTFS compression is crunching these zero-padded regions and deallocating them as it’s unit is only 4096 bytes.

It should also be noted that while statistically we should get better compression with larger chunk sizes and higher compression levels, this isn’t linear. In fact, increasing the chunk size may lead to reduced net gains in file size due to the 64 Kbyte compression-unit of NTFS. That is, if two chunks could each save a single unit (64 Kbytes,) doubling the chunk size (such that both would be compressed together as one chunk) might not be able to save 128 Kbytes, in which case the savings would be reduced from two units to a single, resulting in a 64 Kbyte larger file than we had with the original chunk-size. This heavily depends on both the data and the compression, of course.


A synthetic test done using generated text from an alphabet consisting of alpha-numerical plus symbol with random lengths of <1MB were done. Zlib seems to perform slowly on this random data (although the number of possible codes is small.) Chunk size of 256 Kbytes and compression-level of 6 was used. 50 random rows are generated and inserted with incremental Ids (two-column table,) the 50 rows are selected using the Ids and the texts compared to the original, new texts are generated with new lengths, this time of length <2MB and the rows updated. Again the 50 rows are selected by Id and compared to the updated-originals. The resultant database file is 50,686 Kbytes.

The original Sqlite code run the test in 13.3 seconds, while using default compression and no tracing (to avoid any overheads) the same test finished in 64.7 seconds (4.86x slower) resulting in a 41,184 KByte file. Both tests ran on the same generated data. The file was on a RAMDisk to minimize disk overhead.

Considering that the data was random and synthetic and insert/update rate was equal to select rates, the results are reasonable. In practice, reads are typically more frequent than writes. With proper caching this should reduce the performance overhead significantly.


The code holds the same copyright claims as Sqlite, namely none. The code is experimental. Use it at your own risk.

Download the code and prebuilt DLL. This sqlite3.dll is version amalgamation created with the default settings/flags from the amalgamation created from original sources by the original configure and make files. The compression code is added and it’s built using VS2010 Sp1 and statically liked to the runtime libraries, as such it has no dependencies.


To build the code, first download a recent Sqlite version. The amalgamation is perfect. The latest Zlib must also be downloaded and built.

Add the Zlib headers to the include path, copy the vfs_compress.c file next to sqlite sources and build. Next, build sqlite3.c amalgamation (or the original sources) and link the binaries of sqlite3, vfs_compress and Zlib to create the executable.

Future Plans

A good percentage of the official Sqlite tests pass successfully. But the corruption and format-validating tests unsurprisingly fail. Increasing the supported cases is a prime goal at this point. Getting the mod to “stable with known-limitation” status would be a major milestone. Improving performance is another goal that isn’t very difficult to attain. Having the ability to enable/disable compression on any database is also beneficial and will add more protection against misuse. It’d also be interesting to attempt supporting compression without NTFS sparse files support. This, while much more complicated, would work on any system and not on NTFS alone.

As a bonus, it’s almost trivial to add encryption on top of the compression subsystem.

Any comments, ideas, feedback and/or constructive criticism are more than welcome.

Aug 262011

Databases are as ubiquitous as computers. They are often, erroneously, attributed to data-intensive, long-term storage solutions. Yet in reality they are used in one shape or form in most programs. From word processors to media players to web-browsers. All depend on databases to realize some of their features. This, not mentioning, e-commerce, flight booking, libraries and, of course, government intelligence data. Even when no feature requires a database per-se, user configuration and settings are often stored in databases. Where previously Windows programs depended on the Registry and Unix programs on plain-text files, increasingly new products are utilizing the portable and vastly superior SQLite instead.

Image via Wikipedia

Each application of course has its own set of requirements, use-cases and patterns. Yet, surprisingly or otherwise, there are two main patterns that can be said to parent the rest: OLTP and OLAP. Or, Online Transaction Processing and Online Analytical Processing respectively. Consequently, database design and optimization depends heavily on our processing needs and models. (Here I’m not concerned with database engine design and architecture, rather on the use-case patterns and processing models.) To get a better feel of each, let’s consider typical cases for each.

Online Transaction Processing (OLTP)

This type of applications are chiefly characterized by performing real-time, localized, mission-critical operations. That is, the operations concerned must complete virtually immediately (hence real-time processing,) they typically involve a very small number of entries independent of the remaining data (hence localized processing,) and there is zero-tolerance for data inconsistency, corruption or undefined states. The latter property is what requires transactional processing such that either the operation completely succeeds with all its sub-parts or it completely fails restoring the previous state of all involved entries.

In OLTP the operations are mostly that of Insert and Update, with occasional Delete operations for housekeeping, although typically deletion is differed for performance and/or for later data mining and reporting. The tables involved are typically normalized around the data entities.

Examples include online shopping, flight and medical data and custom data processing.

Online Analytical Processing (OLAP)

In contrast to OLTP, this type of applications are characterized by performing offline, aggregate and low-priority operations. The operations involved here can be executed at low-demand hours, often on archive databases and secondary servers (hence offline processing,) they typically aggregate a very large amount of data to generate statistical data, reports or to find outliers (hence aggregation,) and, since they are offline processing, they are designed to have low-priority, low isolation level (read uncommitted) and, since there is little or no side-effects to failing, they are designed to fail rather than potentially interfere with competing OLTP operations (if executed on the same database.)

OLAP operations are mostly Select operations with virtually no data modification (except for storing the results, if at all, typically in dedicated tables.) These operations not only aggregate large number of entries, with heavy use of aggregate functions, but they typically join a significant number of tables.

Examples include reporting, billing, statistical analysis and historic data processing.

Similarities and Differences

Since the differences between these two patterns lies in the underlying database operations they perform, it’s no surprise that there aren’t a multitude of basic patterns. The two main operation types are that of read and modify. That is, the applications that mostly read data (OLAP) typically perform massive read operations potentially across the complete data with few or no modification, and those that depend on heavy data modification care most about the integrity of the few entries they modify with little or no interest in other data.

However, one must make note of the fact that modification implies reading. This is crucial for correctly appreciating the performance characteristics of the two patterns. The three data-modifying operations, Insert, Update and Delete, all ought first perform Select operation(s) in one form or another. While this is more apparent for Update and Delete, Inserts with constraints must first validate that no constraint is violated. This often involves a lookup operation in the constraint index(es). Only in the most basic and simplest cases could an Insert operation be free of all lookup operations (when foreign key, unique or other constraints are completely absent.)

This doesn’t mean that OLTP operations are a superset of OLAP operations. To try and optimize first and foremost for OLAP with the hope that OLTP operations would naturally also run faster, thanks to the improved read operations that it could utilize, is a fallacy. The two are worlds apart and typically don’t have overlapping indexable data (more on this below.) Where they do share fields, indexes on them would most certainly be welcome to both. Unfortunately, that doesn’t materialize nearly as much as one would like.

Typical properties of OLTP vs. OLAP
Transactional ProcessingAnalytical Processing
DataCurrent and in-progress.Retired, historic or archived.
Typical OperationsInsert, Update, Delete.Select.
Function TypesBased on business requirements.Based on data mining.
NormalizationHighly normalized, entity modeled.Possibly denormalized to minimize joins.
Data LocalityFew tables, few entries.Large aggregation across many tables.
Indexing StrategyLimited to a few highly-selective fields.Generous indexing depending on the queries.
Operation DurationTypically very short, sub-millisecond range.Extended over minutes and hours.
Caching FriendlinessHighly volatile data with strict persistence requirements.Static data, very cache friendly.
Backup RequirementInconsistency and data loss may cost business.All operations can be rerun, backup is redundant.
Query ComplexityTrivial to simple.Complex to inhumane and unreadable.


The above descriptions are certainly not the only possibilities. A combination of both is more reasonable to expect in practice rather than the theoretical and somewhat idealized cases given above. Some such case might include functionality that must aggregate several groups of data without suffering phantom reads. Such a requirement means that not only the aggregate query must run with a high isolation level, reducing parallelism, but that it must also add indexes to finish as fast as possible to free up the entries that concurrent transactions might require. The additional indexes would incur unwelcome cost for data modification (which must also update the indexes as well.)


Due to the differences in the operations involved, and indeed their requirements, optimization in each case is different than the other.


To maximize performance, OLTP operations would use highly selective conditions on normalized tables in queries that are completely indexed (rather than partially indexed). This will get us to the target data as fast as possible, or, in case of insertion, will verify the absence of collisions equally quickly. Typically the primary key is used, which itself may be an alias of the internal row-id.

Next we’d want to localize locks as much as possible to maximize concurrency and minimize collateral impact. Fine grained lock hints such as row-level locks may be used. Normalization helps here by further containing the impact of the transactions as local as possible. In addition, we’d want to minimize the average row-size to improve cache-hits and avoid undue penalty. Once again, Normalization does most of the work here too.

Finally, OLTP operations would want to minimize, if not avoid, indexes, triggers, views and any operation that doesn’t help it reach and update the data as fast as possible. To that end, indexes are only useful to reach the data fast. All others would simply slow down virtually all operations, save for any selects we may have.


For OLAP on the other hand it’d be best to work with denormalized data, to minimize or avoid joins.

Unlike OLTP, it’s very hard to predict or prioritize the conditions most useful to OLAP. This is because depending on the particular query in question, the tables and fields of importance are decided. Indeed, different queries might depend on very different fields and therefore indexes. So indexing some field might be useful to only an isolated query and not others. For a billing system not only the customer and orders are of importance, but the dates of orders and the billing period as well. Similarly, to generate sales reports, the queries involved would select on products and dates. Yet an inventory audit query might depend on a different set of fields. Hence, indexes are best decided based on the concrete queries in question.

To improve performance advanced features such as Materialized Views (aka Indexed Views) may be utilized, which are unfriendly to transactional processing.


From the above it’s quite clear that not only do OLTP and OLAP differ in their operations, but consequently optimizing their queries are apparently in conflict with each other. The requirements for maximizing performance in each case is simply contradictory to one another. But often both types of processing are done in our products. How can we optimize them?

Optimization for both OLTP and OLAP

It must be said from the outset that optimizing for two contradictory set of requirements can be a recipe for disaster. That is, we might end up with worse performance for most queries and even degrade the database design in hope of improving performance. With hubris some might advocate indexing all fields in query conditions, creating an uninformed mess in their wake. In fact, since OLTP operations are inherently very short lived, the overhead of updating superfluous indexes will probably go unnoticed, yet a long-running OLAP operations might get a noticeable boost that the OLAP queries get. From this the gulls in question might pat themselves for a job well done, completely oblivious to the fact that the OLTP operation in question will probably run millions of times, each time incurring the slight cost of the index updates, negating the apparent speed boost. To avoid such scenarios, we must understand that a case of compromise is perhaps unavoidable and approach the problem methodically.

Here are some design patterns that are often employed to maximize performance across both types of queries.

Data Archiving

Perhaps the simplest solution is to avoid performing both OLTP and OLAP type operations on the same data-set. To achieve this, we’ll have to split the data over two sets of tables. One, highly normalized and optimized for OLTP and another, perhaps denormalized, OLAP-optimized set.

A data-retirement scheme will be used to move the retired data from the OLTP table(s) into the archive (OLAP) tables. This may be performed using a background process, or using triggers. The latter may not be desirable as triggers would add latency to the OLTP queries, however, on the other hand, the cache hits and the avoidance of extra code and scanning of the OLTP tables might be a good compromise.

Dynamic Indexing

A number of application process data in batches. Each batch is processed in transactional fashion and once all the data has been processed completely, reports and other aggregate functions are executed on the final set. Custom data processing is perhaps a good example here, where some data is processed (for example by means of transformation, conversion or other modification) and tracked by transcations in a database. The database accounts for every data item, its states as it is modified with timestamps and warning/error codes where applicable.

Such a processing scheme has the advantage of allowing for an OLTP type optimization for the data processing stage until all the data is completely processed, whereupon OLAP-optimized indexes are dynamically added. The overhead of adding or modifying the indexes (yes, some might be dropped, while new ones added) might win us an overall significant amount of CPU time. Of course we’d need to profile and benchmark thoroughly to decide which indexes to remove and which to add.

Biting the Bullet: Conjoining Both

In some cases we need to aggregate both retired and archived data as well as current and in-transit data. If this scenario is of prime importance, then archiving the data for offline processing might simply be unhelpful, as we must also join the OLTP tables to get as-yet unarchived data. Also, we can’t work in stages and perform dynamic indexing since we have to work with both current and archived data.

For obvious reasons this is the worst of both worlds. In this case, we must very carefully balance two forces:

  1. Indexes for optimum OLTP performance.
  2. Indexes for the most important OLAP queries.

Here analytic profiling of the different operations, their execution pattern and respective use-cases is necessary. Once we get a rough idea of the relative importance of our database operations, we can start collecting empirical data using benchmarks to evaluate the different approaches. Since the database design, indexes and other attributes must be shared among all database operations, we must choose the minimum set of indexes etc. that give the maximum performance across all operations.

The difficultly is of course in finding this optimum point. As each index helpful to the OLAP queries, but unnecessary to the OLTP ones, is an overhead for the high-rate transactions. On the other hand, strategic indexes could boost the performance of aggregate queries by orders of magnitude.

Here one is bound to spend a good amount of time using database profiling, statistical analysis and benchmarking. (In a future post I’ll address this topic head-on.) The downside of all this is that once the priorities of our OLAP queries change, then our particular database design will perhaps be outdated too.


The requirements of OLTP and OLAP are as different as are their use-cases. Optimizing for both is a very tricky business, but sometimes unavoidable. If we must maximize performance, we must invest a considerable amount of time designing a well-balanced database schema and, using extensive profiling and benchmarking tools combined with analytic methods, only then can we decide on the design that maximizes performance for our particular needs.

(Anything missing? If you have something interesting on the topic to share, please don’t keep it to yourself.)

Related articles
May 302011

The story I’m about to tell is the worst case of leaky abstraction that I’ve encountered and had to resolve. Actually, it’s the most profound example that I know of. Profound here is used in a negative sense, of course. This isn’t the performance issues Joel brings as examples to leaky abstraction. Nor is it a case of getting a network exception while working with files, because the folder is now mapped to a network drive. Nor a case of getting out-of-disk-space error while writing to a file-backed memory block. This is a case of frustration and a very hard-to-catch defect. A case of a project that came too close to failing altogether, had we not figured it out in time.

http-Request from browser via webserver and back

Image via Wikipedia


Our flagship project worked with 3rd party servers which for technical reasons had to be installed on the same machine as our product. This was an obvious shortcoming and had to change. The requirements came in and we were asked to move the interfaces with all 3rd party servers into lightweight distributed processes. These new processes were dubbed Child processes, while the central process, well, Central. A Child would run on a different machine on the network, communicating with the Central process to get commands, execute them and return the results.

The design was straightforward: all processes would be running as services listening on some configurable port. The commands were simple application-level message objects, each with its own type, serialized on the wire. The operations were synchronous and we needn’t progress updates or heartbeats. We left the network design synchronous as well, for simplicity.

The developer who was assigned the communication-layer had background in web-services and proposed to use standard HTTP protocol. We thought about it and declared that while it would have some small overhead, the simplicity of reusing some library would be a plus. After all, HTTP has data recovery and is a standard protocol. And if we really cared about overhead, we’d use UDP which has no duplicate detection, data recovery or even ordered packet transmission. Plus, the developer who’d work on this feature was comfortable with HTTP. So why not?

As it turns out, HTTP was the single worst decision made on this particular project.

Since we were now transmitting potentially sensitive data, the requirements were amended to include data encryption to protect our customers’ data from network sniffers. We used a standard asymmetric encryption for extra security. This meant that we had to generate a pair of public and private keys each time we connected. We devised a protocol to communicate the key the Child must have using a symmetric encryption algorithm. We were confident this was secure enough for our needs and it wasn’t overly complicated to implement.


The project was complete when I took the product for a final set of developer white-box testing. This is something I learned to do before shipping any product, as I do have the responsibility of designing the features, I also feel responsible for looking under the hood in case there is some potential mechanical or technical issue. Much like your car mechanic would do before you go on an off-road trip.

That’s when things started to fall apart. All worked fine, except every now and then I’d get errors and the Child would disconnect. Inspecting the logs showed data-encryption exceptions. The deciphering function was failing. Every single developer who run the code verified that it worked fine without any problems whatsoever. I asked them to pay attention to this issue. They came back saying all was perfectly fine. It was only on my machine!

Mind you, I learned not to blame before I eliminate every possible culprit. And the prime suspect is, always, a bug. A developer error. So I started sniffing around. Visiting the design back and forth. Nothing made sense of the issue. The code works, almost all the time. Then it fails. Reconnect again, and it works fine… until it fails again.

Troubleshooting this issue wasn’t fun, precisely because it wasn’t fruitful. No amount of debugging helped, or, in fact, could ever help. The puzzle had to be solved by reason. Experimentation showed as much as I had already gathered from the logs. Still, I tried different scenarios. One thing was for sure, you couldn’t tell when it’ll fail next.

The Leak

Remember that HTTP is a connectionless protocol. That is, it’s designed to communicate a single request and its response and disconnect. This is the typical scenario. It holds no connections and no states, therefore, it has no session. On the web, sessions are realized by the HTTP server. An HTTP server would typically create some unique key upon login or first-request missing a key, and it would track all subsequent requests by getting the said key either in the URL or using cookies. In any event, even though a web-service may have support for sessions, the underlying protocol is still connectionless and stateless.

To improve performance, an afterthought of reusing connections was added. This is typically called Keep-Alive. The idea is that a flag is added to the HTTP header which tells the server not to close the connection immediately after responding to the request, anticipating further requests. This is reasonable, as a web-page typically loads multiple images and embedded items from the same server. The client and server supporting Keep-Alive reuse the same connection for several requests, until one of them closes the connection. What is most important in this scenario is that if either party doesn’t respect this hint, nothing would break. In fact, nothing would work any different, except, of course, for the extra connections and disconnections that would occur for each request.

Since the implementor of this feature was a savvy web developer, he always had this flag set. And so, as long as the connection wasn’t interrupted, or indeed, the underlying library we were using didn’t decide to close the connection on whim, all was well and we had no problems. However, when a new request went with a new connection, rather than an existing one, the Child’s server would accept a new socket, on a new port, rather than the previously open socket. This is what was happening on my test environment. Perhaps it was the fact that I was testing across VM images that triggered the disconnections. Anyway, this newly opened socket on the Child has no encryption details associated with it. It’s a brand-new connection. It should expect encryption key exchange. But due to implementation details, the request would have an ‘encrypted’ flag set and the Child wouldn’t mind that we negotiated no cryptographic keys. It’d go ahead and try to decipher the request, only, it couldn’t. Resulting in the logged encryption exception followed by disconnection.

Post Mortem

Once the issue was figured out, the solution was simple, albeit costly. The HTTP abstraction had leaked an ugly property that we had assumed abstracted away. At design time, we couldn’t care what protocol we used to carry our bits. Encryption was added almost as an afterthought. True that encryption does require a state. However looking at our code, the socket-level connection was abstracted by layers and layers of library code. In fact, all we had was a single static function which took a URL string for a request. We had serialized and encoded the request message in base-64 and appended to the URL, which contained the server hostname/ip and port; standard web request, really.

On the communication layer, we had this single URL construction and the request call. On the data layer, we had the encryption, serialization and data manipulation logic. On the application layer, well, there were no network details whatsoever. Most of the previous code which worked locally had remained the same, with the implementation changed to interface the new network layer. So in a sense the code evolved and adapted to its final form and it wasn’t anywhere near apparent that we had leaked a major problem into our code.

In hindsight, we should’ve taken matters into our hands and implement a session-based protocol directly. This would make sense because we’d be in complete control of all network matters. For one, with HTTP we couldn’t change the sockets to use async logic, nor could we change the buffer sizes and timeouts. Perhaps we didn’t need to, but considering the gigabytes/hour we expected to transfer, sooner or later we’d have to optimize and tune the system for performance. But, the developer assigned was inexperienced and we couldn’t afford the time overhead. Personally, I feared things would get too complicated for the assigned developer to handle. I let him pick the protocol he was most comfortable with. And that’s the real danger of leaky abstraction; everyone is tricked, including the experienced.

Indeed, we ended up rewriting the communication layer. First the HTTP code was replaced with plain sockets using TCP/IP. Next, sessions were added, such that disconnections were recoverable. That is, the data layer didn’t care whether communication was interrupted or not. We weren’t going to rely on the fact that we controlled the sockets. Disconnections were made irrelevant by design. And finally, our protocol required a strict sequence of initialization and handshake that insured correct state. Once the code was working as expected, we changed the sockets to use async interface for maximum performance.

Overall, we spent an extra 2 man/months and, as a result, the data+communication layer was sped up several times over. Still, this was one hell of a case of leaky abstraction.

Further Reading:

Many are asking why not use SSL? The answer is because HTTP was the wrong choice in the first place.
We weren’t building a web-service. This was a backend server communicating with the main product. We didn’t want to limit the protocol, features or extendability by the choice of communication details. SSL would resolve the encryption issue, but we’d have had to implement an HTTPS server. In addition, whatever application protocol we would eventually realize, it had to be connectionless. The encryption layer simply uncovered this design implication that we had overlooked, hence the leak in the communication protocol abstraction. At that point we didn’t have any control messages nor did we have requests that needed states, later we added both. In fact, we added a command to iterate over a large set of data, returning one item at a time. HTTP/S would’ve made this harder to implement, as the state would have had to be sent with each request. Control messages and heartbeats would’ve been pointless.
In short, HTTP gave us very little and the connectionless nature caused as a lot of trouble. We got rid of both. We exchanged a wrong solution with the right one, hardly reinventing the wheel, if you ask me.

QR Code Business Card