Ashod Nakashian

Oct 022011
 

I’ve just released Now Reading Redux 6.1 to the general public. NRR is a WordPress plugin for those who read and like to keep their libraries organized. The plugin tracks each book’s status as read, currently reading or yet to read. Books may be put on hold to be returned to later. Books may be reviewed and rated.

For those who don’t like sharing all their books with everyone, books may be private only and not public. Private books don’t show on the public pages and only show when the reader is logged in.

In this release a new graph is displayed in the Library page to show the books read by month for the past year. For a working example please see my library.

Get it from here. For issues and feature requests please go here.

Sep 272011
 

Haven’t we all written Fibonacci sequence generators, sorting and searching algorithms and a handful of data-structures while learning to program? What’s common in these problems is their simplicity. I almost said triviality, but there are nuances that the beginner will stumble in and, hopefully, learn from. (And isn’t that the whole point of the exercise anyway?)

Fibonacci is a great introduction to recursion. Sorting teaches us the fundamentals of algorithms, complexity and of course data-structures and searching. We learn that not all ways of organizing data are born equal. Linked lists have a leg over arrays some of the time, but their ranks are reversed at other times.

This is how we learn about our tools and how to use them and when. It’s the equivalent of painting fences. Without these mental gymnastics we’d have to open up textbooks and look up for “how to find duplicates in two sets” or wonder why the data is sometimes shuffled around, but not always (hint: dictionaries typically don’t preserve order, unless you use one that explicitly guarantees order).

The answer to “why is our product all of a sudden too slow and customers are complaining now when it was lightning fast for years?” isn’t in any textbook. However, if one assumes the customer data might have doubled, a seasoned programmer would first check for a quadratic algorithm (or worse) somewhere in the code-path.

While there is no doubt, at least in my mind, that it’s absolutely necessary to work out all these contrived problems, from writing a 4-operation calculator with an expression parser to solving wick wack woe, I think we should include a whole other genre of problems in our training kits.

The Unsolvables

There is a group of problems that, as far as we know, have no solutions. That is, the solution is known to be found only by brute force or an approximation exists using heuristics (basically trial-and-error or some form of genetic algorithm.) There is the obvious usual-suspects, the NP-Complete folk. But not only. There are algorithms that run in quadratic and polynomial time that aren’t practical beyond some size.

There is no better way to get a very real feel of what it means for an algorithm to run in quadratic time than to actually write the code and see how you age as your machine churns away hopelessly, all the while another algorithm with a lower complexity has just finished.

Take Quick sort for example. Implementing the textbook algorithm will work swiftly, that is, until you feed it sorted data. Multiplication using the school-method is simple and elegant, until you apply it to large numbers. Trying to optimize such a primitive algorithm is very instructive. It teaches you that it’s not about optimization; the complexity is inherent. Yet, we don’t use Merge sort with its O(n log n) worst case performance and default to Quick sort with its O(n2) worst case characteristic. This is not at all obvious to a beginner, and even for many experienced programmers. Theory and practice sometimes do part ways. And that’s yet another important lesson.

Without attempting to solve a travelling salesman problem or a knapsack problem we are in danger of never really understanding why complexity is important. Why some algorithms are hopeless and will get back to haunt you one day, while others seem to be stuck in their misery and can hardly be improved.

And what better way to understand how and, more importantly, why modern cryptography works than to try to factorize numbers? Searching for prime numbers is yet another long-time problem that only recently it was proved that primality testing is polynomial, and how is all that related to one-way functions.

There is also another purpose to this exercise. It’s not obvious where the difficulty of solving these unsolved problems is. At first sight almost anyone presented by such a problem will come up with some solution of sorts. It’s not until much later, and with much mental effort, that one does notice the errors of their ways. It might be a hidden complexity that they introduced in their algorithm while being oblivious to it that negated the gains they scored elsewhere. It might be that the obvious or straight-forward solution misses a very simple, but equally crucial, case. Or, it may be that their solution is broken in one way or another. Either way, it’s not enough to know about the problems, their status and move on. There is much to be learned from solving unsolved problems.

The Impossibles

But why stop there? Why stop at the travelling salesman or a variation of the knapsack problem? While we’re at it, let’s introduce non-computability and noncomputable functions.

I’m sure these topics are very well studied in some grad schools. But the average CS school undergrad would probably firmly believe the fastest growing functions are the exponential. (I still have to get a different answer during an interview.) Whatever happened to Busy Beavers? Apparently, a noncomputable can actually grow faster than any computable function! Now try to beat that in the department of slow algorithms.

Conclusion

I think it would be a great service to our industry if every once in a while the school assigns an unsolvable problem. Send the students home with a travelling salesman problem or two and ask them to bring in the numbers the next week. It’d prove much more instructive to see the 5-city problem solved in seconds while the 25-city…

And who knows, we might as well get a Dantzig moment! (Incidentally, he‘s the inventor of Simplex, the single most useful algorithm in optimization.)

Sep 252011
 

Building a large and complex such as WebKit is anything but straightforward. WebKit is the core of multiple browsers, most notably Safari and Chrome. Even the official instructions are sometimes lacking. This is mostly because such large projects are moving targets and it’s hard to keep track of all the changes in the requirements. Another issue is that most developers don’t setup build systems from scratch regularly. Probably no more than once or twice during the lifetime of a project. Even blogs, articles, forums and mailing lists that discuss building WebKit are almost completely outdated.

At any rate, these notes are up-to-date as of September 2011, I hope you find them helpful.

Background

WebKit can be built in different flavors. The differences are important, because one must setup the build system slightly differently for each flavor. There are essentially 4 different flavors: Apple Windows, WinCairo, Qt and GTK. All four require external libraries to be present for the build to succeed. I have no experience building the Qt and GTK flavors. Here I’ll cover Apple Windows and WinCairo. I do not cover building Chromium.

Build Environment

There are some common requirements, so first and foremost make sure these are setup correctly. You may follow the instructions described here. Unless you want to build using VS2005, ignore the instructions on VS2005.

Here is a summary:

  1. Install Cygwin. Use this downloader. Preferably install to C:\cygwin (I’ll assume that’s where it is). Never to a path with spaces! Make sure you install: svn, curl, unzip, make, perl, python, gcc, gcc-g++, gcc4, gcc4-g++, bison, flex, gperf.
  2. Install QuickTime SDK in the default location.
  3. Install DirectX SDK. However, if you’ll build with VS2005, don’t install anything newer than February 2010. It’s worth noting that the August 2007 version builds WebKit just fine (as of r95358 with default flags).
  4. Install Visual Studio 2005 or 2008 with all updates and Service Packs. For VS2005 please refer to the updates listed here. Personally, I suggest sticking with 2008.
  5. Install Windows SDK for Server 2003 R2 or newer. See below for more details.
  6. Download the Support Libraries from Apple. Copy the file (don’t extract, just copy the zip) into the WebKit source folder.
  7. Create WEBKITLIBRARIESDIR and WEBKITOUTPUTDIR environment variables. The values must be fully-qualified Windows paths and not Cygwin/unix paths. If you don’t want to change you system-wide path, skip this and see the script below.

Windows SDK

If you have multiple Windows SDKs (as is typically the case when installing multiple Visual Studio versions) then you will have to setup your target Visual Studio to use the SDK of choice. There are two ways to achieve this.

The first method is to change the BIN, INCLUDE and LIB directories from the Visual Studio Options dialog. From Tools > Options menu, from Projects and Solutions > VC++ Directories page, set the Executable files‘s top entry to %SDK%\bin folder, the %SDK%\lib at the top of the Library files entries and %SDK%\include to the top of the Include files‘s entries. Where %SDK% is the installation path of the SDK. Note: This will change the Visual Studio paths for all instances. So don’t make this change if other projects depend on your current settings, unless you care more about WebKit, of course.

The second method is much less intrusive. We can change the environment variables in a shell instance and pass them on to Visual Studio. This is relatively easy and convenient, but will require scripting. In the Tools/Scripts folder of WebKit there is a script called pdevenv. This Perl script is responsible for creating a Windows CMD file which ultimately runs Visual Studio. The reason for this is because Bash can’t execute the scripts necessary for setting up Visual Studio’s environment.

Here is an example of a modified pdevenv script that sets up the SDK:

#!/usr/bin/perl -w

use strict;
use warnings;
use File::Temp qw/tempfile/;
use FindBin;
use lib $FindBin::Bin;
use webkitdirs;

my ($fh, $path) = tempfile(UNLINK => 0, SUFFIX => '.cmd') or die;

chomp(my $vcBin = `cygpath -w "$FindBin::Bin/../vcbin"`);
chomp(my $scriptsPath = `cygpath -w "$FindBin::Bin"`);

# Reverse the order of these if you prefer to use VS2008.
my $vsToolsVar;
if ($ENV{'VS80COMNTOOLS'}) {
    $vsToolsVar = "VS80COMNTOOLS";
} elsif ($ENV{'VS90COMNTOOLS'}) {
    $vsToolsVar = "VS90COMNTOOLS";
} else {
    print "*************************************************************\n";
    print "Cannot find Visual Studio tools dir.\n";
    print "Please ensure that \$VS80COMNTOOLS or \$VS90COMNTOOLS\n";
    print "is set to a valid location.\n";
    print "*************************************************************\n";
    die;
}

# Comment the following to see the environment variables used.
print $fh "\@echo off\n\n";

# Setup Visual Studio.
print $fh "call \"\%" . $vsToolsVar . "\%\\vsvars32.bat\"\n\n";
print $fh "set PATH=$vcBin;$scriptsPath;\%PATH\%\n\n";

# Setup the 2003 SDK. Don't call its own SetEnv as it doesn't detect VS2008 or newer.
print $fh "Set MSSdk=C:\\Program Files\\Microsoft Platform SDK for Windows Server 2003 R2\n\n";
print $fh "Set INETSDK=%MSSdk%\n\n";
print $fh "Set MSSdk=%MSSdk%\n\n";
print $fh "Set Mstools=%MSSdk%\n\n";
print $fh "Set Lib=%MSSdk%\\Lib;%Lib%\n\n";
print $fh "Set Include=%MSSdk%\\Include;%Include%\n\n";
print $fh "Set Path=%MSSdk%\\Bin;%MSSdk%\\Bin\\WinNT;%path%\n\n";

# Run Visual Studio.
print $fh "IF EXIST \"\%VSINSTALLDIR\%\\Common7\\IDE\\devenv.com\" (devenv.com /useenv " . join(" ", @ARGV) . ") ELSE ";
print $fh "VCExpress.exe /useenv " . join(" ", @ARGV) . "\n";

close $fh;
chmod 0755, $path;
chomp($path = `cygpath -w -s '$path'`);
exec("cmd /c \"call $path\"");

Note: If you already have SDK 6.0, 6.0A, 7.0, 7.0A or 7.1, you should be fine, but be warned: DirectX conflicts with 7.x SDKs, so make sure DirectX headers come after Windows SDK. If you have 6.0 or 6.0A you should have no problems building.

Building with VS2008

  1. Install VS2008 with all Service Packs and updates. (Express version needs a bit of a push to work. Please follow the these instructions.)
  2. Run update-webkit and build-webkit at least once. Regardless of the outcome, delete the contents of WEBKITOUTPUTDIR folder, wherever it is.
  3. Run VS2008 and open C:\cygwin\home\{username}\WebKitTrunk\Source\WebKit\win\WebKit.vcproj\WebKit.sln. You’ll be prompted to convert the project files from VS2005 format to 2008. Convert and save-all.
  4. WebKit projects have no tolerance for warnings and they are set to the highest level. Since VS2008 does emit some superfluous warnings (such as not inlining a force-inline function) the build will fail with these settings. As such, we need to change these setting.
    Select every project that doesn’t end in “generated”, right-click and select Properties. In the Properties window, from the top, change the Configuration to “All Configurations”. Navigate to Configuration Properties/C++/General in the left tree and on the right-side change the Warning Level to “Off: Turn Off All Warnings (/W0)” and Treat Warnings As Errors to “No”. Click OK to close this dialog.
    If you don’t see C++/General in the left tree it means you selected a non-C++ project. You may do this one project at a time.
  5. The project files are now ready. Save all and exit Visual Studio.

Build Notes

  1. Disable Anti-Virus scanners. (Can you spell incredible?) Cygwin needs to simulate Fork, which needs to clone process memory exactly. Many AV products add system hooks to scan processes in-memory. This disrupts memory layout and DLL relocation results in differences in the memory of forked processes resulting in memory corruption. This issue may never happen even with an AV running. However, if you see fork, out-of-memory or Virtual-Memory allocation failures, try disabling AV and try again. Some don’t remove their system hooks by temporarily disabling them, so you’ll need to uninstall them completely.
  2. Don’t build the Trunk. WebKit is large, complex and fast moving. Unless you like testing the limits of your sanity, you may want to stay away from the Trunk, which can be broken and problem-laden. Instead, get a successfully-built and packaged source archive from the nightly builds. At least have one recent successful build from a nightly archive before attempting the Trunk.
  3. Place Cygwin/bin before others in Path. In the system path, place CygWin binary folder after System folders but before anything else. Example: %SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;C:\cygwin\bin;. This will insure that if you have Perl, SVN and other similar utilities installed outside of Cygwin, the Cygwin version will take precendence. NOTE: This may break other products that rely on the non-cygwin variants, in that case, you’ll have to perform this change in a build-script (see below).
  4. Place WebKit sources in your Cygwin home directory. If you install Cygwin in the root, your home should be here: C:\cygwin\home\{username}. WebKit should be in a folder directly within your home. Example: C:\cygwin\home\{username}\webkit.
  5. Add the scripts folder to the path. The webkit scripts must be reachable at all times, makes sure C:\cygwin\home\{username}\webkit\Tools\Scripts is prepended to the path.
  6. Don’t update the WinCairo code unless you want to build that flavor. It seems that running update-webkit --wincairo is changing enough in the code that it’s rendering the Apple Windows flavor broken. If this happens and you can’t build Apple Windows, then simply delete the WebKitLibraries folder, and run update-webkit (without --wincairo!).
  7. Run update-webkit at least once. Even if you don’t need to update the code, you must run this script at least once. Note that if you’ve downloaded a nightly-build package this script won’t update the code to the Trunk! It’ll only update the support libraries and other scripts.
  8. You may have to build twice before it works. The build almost always fails the first time and after deleting the WEBKITOUTPUTDIR folder, so the build script must be executed more than once to get a successful build.
  9. Look for the first failed project to fix errors. Since the WebKit build copies many source files as well as output binaries from one project to another, it’s vital that all dependencies of a sub-project succeed before it is built. That is, if a project fails, typically due to missing headers, identifiers etc, before panicking look for the first failed project and try to fix it. Most probably the errors are due to a previously broken dependency. This can get tricky, as some projects depend on external scripts that may be partially completed and stop reporting errors on subsequent builds. In such a case you won’t see failed dependencies and the sub-project in question would be the first to fail. To make sure this is the case, delete the WEBKITOUTPUTDIR folder and build again. If you get errors, note the first error you got and build again. Repeat this until you get the same error. This is the error you need to fix.
  10. Disable warning as error. If you get error C2220: warning treated as error - no 'object' file generated errors, then follow the instruction for disabling warning as error in the Building with VS2008 section above.

Here is a bash script to setup the environment, update and build. You must set your home folder name. Save this file in the webkit folder, name it build.

#! /bin/sh

echo
echo "Exporting environment varibles..."
export PATH=~/webkit/Tools/Scripts:$PATH
export WEBKITLIBRARIESDIR=C:\\cygwin\\home\\{username}\\webkit\\WebKitLibraries\\win
export WEBKITOUTPUTDIR=C:\\cygwin\\home\\{username}\\webkit\\WebKitBuild

echo
echo "Updating source tree..."
update-webkit

echo
echo "Building..."
build-webkit $*

Apple Windows

This flavor is the one used by Safari on Windows. The external libraries are copyrighted by Apple and aren’t redistributable. As such, the build system will require one to manually download an archive holding them from Apple after accepting the license agreement. This build, therefore, cannot be distributed on the internet with the external libraries. Having said that, one should at least try to get this flavor successfully built before attempting others.

To build this flavor follow these steps:

  1. Run the Cygwin shell.
  2. cd webkit
  3. ./build

WinCairo

To build the WinCairo flavor one had to manually download and install the various libraries, often having to build them manually or get a prebuilt version. Thanks to the efforts of volunteers now the update-webkit script automates all that. If you have built the Apple Windows flavor, you should be able to build WinCairo using the following:

  1. Run the Cygwin shell.
  2. cd webkit
  3. ./build –wincairo

If the build fails, try running the build script again.

However if you get “CF” identifier errors, consider updating the CF library manually. All public CF library identifiers and even filenames either begin or end with “CF”, so they are easy to spot. WebKit WinCairo uses a modified version of the original Apple sources called OpenCFLite. Download a recent version from the project’s page or get the trunk. You’ll have to bite the bullet and build the library yourself, update the headers and libs and build WebKit again.

To build OpenCFLite, open the solution in VS2005 or 2008 (depending on which you’ll use for the WebKit build). Choose the Release configuration and build the CFLite project. Ignore the mountain of warning and see if the build succeeded or not. If all goes well, in the root of the OpenCFLite source folder you should see a dist folder. Now copy the contents of dist into %WEBKITLIBRARIESDIR%\win overwriting existing files.

Delete the %WEBKITLIBRARIESDIR% folder and build Webkit WinCairo again. You should get a clean build.

Debugging

Now that you have a successful build, you may want to build the Debug version by passing --debug switch to the build-webkit script. You can execute this script directly (without running the above build script or update-webkit) or you can modify the above build script to the same effect.

With the debug build we can use the Visual Studio debugger or the ultra-powerful WinDbg. In Visual Studio simply open the solution file from C:\cygwin\home\{username}\WebKitTrunk\Source\WebKit\win\WebKit.vcproj\WebKit.sln, select one of the runner test projects as the startup and hit F5.

Happy hacking and don’t forget, you can contribute by submitting bug reports and sending patches and fixes. Good luck.

References

I couldn’t have succeeded in building WebKit without the help of numerous developers who shared their experience on the web. Here is a list of the most helpful and relevant pages. Be forewarned however, they are mostly outdated!

Sep 182011
 

Summary (TL;DR)

This an experimental mod of Sqlite with built-in online compression support. Design and implementation are discussed, limitation and benchmarks provided and source code as well as prebuilt DLL are included. Use the TOC to jump to the topic of interest.

Background

Both Sqlite and MySql support compressed (and encrypted) databases. Well, more or less. Sqlite’s support is limited to read-only databases that are compressed offline, while MySql’s support is limited to compressing strings (as far as I can tell.)

While working on WikiDesk, a Wikipedia browser project, I knew the database could easily grow to 100s of gigabytes. The database of choice here is Sqlite because of it’s compactness and mobility. The English Wikipedia dump is already in the range of 100s to 1000s of gigs (depending on the dump type.) WikiDesk not only supports different Wikipedia languages, but also different projects, such as Wikinews, Wikibooks and Wiktionary, among many others in all available languages, all in the same database. Theoretically, one can import all possible Wiki content into a single database.

The opportunity of compressing this highly-redundant wiki-code mixed with Unicode text was pretty much obvious. So it was reasonable to assume others must have had a similar case and added compression support to Sqlite. My search only yielded the aforementioned cases.

A part of me was happy to have found no precedent project. I was more than happy to roll-up my sleeves and get to hacking.

Design Survey

There are many ways to go about designing a compressed database file. My main purpose, however, was to have fully-transparent, online and realtime compression support. So the design must accommodate updates and deletions as well as any other modify operation supported by Sqlite.

An obvious approach is the one used by MySql, namely to compress the fields independently. This is simple and relatively speaking straight forward. However it’d mean that LIKE couldn’t be used on compressed string fields. Collation and sorting and other features would be absent as well. In fact the fields in question couldn’t be TEXT at all. In addition, one had to explicitly compress fields, remember which is compressed and remember to uncompress before using them. Very limited I thought and probably wouldn’t be worth the effort. Another approach is to do this on a low level, such that it’d be transparent to the caller. Such an extension to Sqlite exists but this will not yield much gain on small fields. I suspect NTFS compression would give better results.

NTFS has built-in compression support. It was well worth the effort of testing it. On an English SimpleWiki dump I could compress the database file down to about 57% of its original size (see benchmarks below.) Pretty decent. However I couldn’t control it at all. I couldn’t set the chunk size, compression level or anything save for enabling and disabling it. In addition, the user could disable it and lose all the benefits. Database-level compression is more promising. A similar result can be achieved using FuseCompress or compFUSEd (on Linux), albeit, the user must install such a filesystem first.

A major problem with database files, as far as online compression is concerned, is that the database logical-structure typically stores pointers to file offsets, such that there is a one-to-one mapping between the physical and logical-structures. This is reasonable as the database is really a large and complex datastructure on disk (as opposed to memory.) The btree or rtree nodes are typically page indexes, where all pages have a predefined, database-wide fixed size. Disrupting this structure would render the file corrupted. The purpose of the fixed-size pages is to simplify the allocation and management of space. This scheme is also used by memory and disk-managers alike.

If we compress a page in the database, the page would now contain two regions: data and free-space. To utilize the free-space, we could write a portion of the next page in the free-space, and the remaining in the next page, and so on for all pages. But then we’d have to keep track of each page’s fragments somehow. To avoid that, we can leave the free-space unused, but then we’d get no net saved disk space, as the free-space would still be allocated on disk.

I could store the new indexes and offsets in some allocation table appended to the file. But I’d have to do a lot of data moving, reallocation, (de)fragmentation and whatnot just to keep track of the free ranges and so on. Obviously this approach was pretty complicated and would take much more involved design and coding. Also, Sqlite page-sizes are multiple of disk sector size for atomicity. I had to be thoroughly familiar with the Sqlite design and implementation to embark on such a largish project, if I wanted it finished and working.

The ‘be lazy’ motto seems to work well for programmers who are efficiency-oriented and hate repetitive and error-prone work. What would be the simplest approach that could work? Going back to NTFS one could learn a lesson or two on transparent compression. The secret is that NTFS can simply allocate any free inode on the disk, write the compressed data to it and update the index table. Inodes are linked lists, so it is very easy to insert/remove and modify the chain. Files, on the other hand, are arrays of bytes abstracted from the disk structure. Moving bits around in an array is much more complicated and time consuming than updating nodes in a linked-list.

What is needed is the advantage of a file-system applied on the level of files.

What if we could tell the file-system that these free-space regions of the file are really unused? NTFS supports sparse files in addition to compressed files. This could be used to our advantage. All we’d have to do is mark the free-space in each page as unused and the file-system will make them available to other files on the disk, reducing the net used disk space of the database.

The Design

Sqlite supports pages of 512-65535 bytes long. Since we can’t break a single page, the smallest compression unit must be at least 64 Kbyte long. In addition, the compression-unit of NTFS compression seems to be also 64 Kbytes. This means that a sparse range must be at least as large as a compression-unit to be deallocated from disk and marked as free. This puts a clear limitation on the amount of saving we can achieve using this design; Compression won’t save any disk space unless it reduces the size in multiples of 64 Kbytes. A multiple of 64 Kbytes is used as the compression unit, internally called a chunk. Indeed, a chunk size of 64 Kbytes would be totally useless as there could be no saving at all.

When data is written it’s first written into a memory buffer. This buffer is used to track changes to the chunk, it’s offset in the file and use to compress the data. When the chunk needs flushing the data is first compressed and the compressed data written to the chunk offset. The remainder of the chunk is marked as a sparse region. NTFS deallocates any naturally-aligned compression units that are completely sparse. Partially written units are physically allocated on disk and 0-valued bytes are written to disk.

When reading data, the complete chunk of the requested byte-offset is read, decompressed and from the buffered data the requested bytes copied back to the caller. The sparse bytes are transparently read-in as 0-valued bytes. This is done by NTFS and relieves us from tracking sparse regions.

Initially very fast compression libraries were used to avoid sacrificing too much performance. FastLz, Lz4 and MiniLzo were tested but the results weren’t very promising, compression-wise. As such the current build uses Zlib.

Implementation

The compression mod is written as a VFS Shim. This has the advantage of avoiding any modifications to the Sqlite code base.

Enabling compression must be done before opening any database files. A single function is defined as follows:

int sqlite3_compress(
    int trace,
    int compressionLevel
    );

trace can be a value between 0 and 7. When 0 tracing is disabled, larger values enable tracing of increasingly lower-level operations. Trace logs are written to stderr. -1 for default.

compressionLevel can be a value between 1 and 9, where 1 gives the fastest performance at the expense of compression ratio and 9 gives the best compression at the expense of performance. -1 for default, which is typically level-6.

To enable compression this function is simply called before calling sqlite3_open. Compression level may be changed between runs, however unless a chunk is modified, the data will not be recompressed with the new level.

Only the main database is compressed. The journal or any other temporary files aren’t compressed.

Limitations

Besides the fact that the code is in an experimental state, there are some things unsupported or even unsupportable by this mod. First and foremost only this mod can read compressed databases. The original Sqlite will declare compressed databases corrupted. However, this mod can and should detect uncompressed databases and disables compression silently (but use at your own risk.)

Since NTFS sparse file support is the key to achieving compression, the mod is literally useless on non-NTFS systems.

Sqlite is known to be quite resilient in the face of file corruption. This can no longer be supported with the same level as it is with the official release. In addition, corruptions would destroy much more data than a single page. With the compression library and the new code also comes the increased risk of crashing or being unstable.

Of the untested and probably unsupported features of Sqlite are:

  • Online database backup.
  • Multiprocess read/write.
  • Vacuum.
  • Data recovery.
  • Shell and 3rd-party tools.

Performance wise, there is virtually no caching implemented beyond the current chunk. This is bare-bone caching and there is a lot of room for performance improvements.

Benchmarks

An import of an English SimpleWiki dump was used as benchmark. The main table holds an auto-increment index, timestamp, the page title and the page contents (both Unicode).

256 Kbyte Chunks and Level-6 Compression (sizes in KBytes)
Original Sqlite Compressed
NTFS Normal 204,438 (100%) 73,296 (35.85%)
NTFS Compressed 117,460 (57.45%) 57,431 (28.09%)

1024 Kbyte Chunks and Level-9 Compression (sizes in KBytes)
Original Sqlite Compressed
NTFS Normal 204,438 (100%) 67,712 (33.12%)
NTFS Compressed 117,460 (57.45%) 66,220 (32.39%)

It’s quite obvious that the savings with the modified Sqlite are substantial as compared to NTFS compression on the original file. Interestingly, NTFS compression when applied on a compressed file still yields gains. This is because of inefficiencies of the Zlib (deflate) compression (which is less so for level-6 than 9) and because NTFS can deallocate at the level of clusters, which are 4096 bytes, as opposed to the sparse method’s compression-unit of 64 Kbytes. Since the free-regions are written as zero-bytes and they aren’t deallocated unless a complete 64 Kbyte unit is completely zeroed out, it seems reasonable to assume NTFS compression is crunching these zero-padded regions and deallocating them as it’s unit is only 4096 bytes.

It should also be noted that while statistically we should get better compression with larger chunk sizes and higher compression levels, this isn’t linear. In fact, increasing the chunk size may lead to reduced net gains in file size due to the 64 Kbyte compression-unit of NTFS. That is, if two chunks could each save a single unit (64 Kbytes,) doubling the chunk size (such that both would be compressed together as one chunk) might not be able to save 128 Kbytes, in which case the savings would be reduced from two units to a single, resulting in a 64 Kbyte larger file than we had with the original chunk-size. This heavily depends on both the data and the compression, of course.

Performance

A synthetic test done using generated text from an alphabet consisting of alpha-numerical plus symbol with random lengths of <1MB were done. Zlib seems to perform slowly on this random data (although the number of possible codes is small.) Chunk size of 256 Kbytes and compression-level of 6 was used. 50 random rows are generated and inserted with incremental Ids (two-column table,) the 50 rows are selected using the Ids and the texts compared to the original, new texts are generated with new lengths, this time of length <2MB and the rows updated. Again the 50 rows are selected by Id and compared to the updated-originals. The resultant database file is 50,686 Kbytes.

The original Sqlite code run the test in 13.3 seconds, while using default compression and no tracing (to avoid any overheads) the same test finished in 64.7 seconds (4.86x slower) resulting in a 41,184 KByte file. Both tests ran on the same generated data. The file was on a RAMDisk to minimize disk overhead.

Considering that the data was random and synthetic and insert/update rate was equal to select rates, the results are reasonable. In practice, reads are typically more frequent than writes. With proper caching this should reduce the performance overhead significantly.

Download

The code holds the same copyright claims as Sqlite, namely none. The code is experimental. Use it at your own risk.

Download the code and prebuilt DLL. This sqlite3.dll is version 3.7.7.1 amalgamation created with the default settings/flags from the amalgamation created from original sources by the original configure and make files. The compression code is added and it’s built using VS2010 Sp1 and statically liked to the runtime libraries, as such it has no dependencies.

Building

To build the code, first download a recent Sqlite version. The 3.7.7.1 amalgamation is perfect. The latest Zlib must also be downloaded and built.

Add the Zlib headers to the include path, copy the vfs_compress.c file next to sqlite sources and build. Next, build sqlite3.c amalgamation (or the original sources) and link the binaries of sqlite3, vfs_compress and Zlib to create the executable.

Future Plans

A good percentage of the official Sqlite tests pass successfully. But the corruption and format-validating tests unsurprisingly fail. Increasing the supported cases is a prime goal at this point. Getting the mod to “stable with known-limitation” status would be a major milestone. Improving performance is another goal that isn’t very difficult to attain. Having the ability to enable/disable compression on any database is also beneficial and will add more protection against misuse. It’d also be interesting to attempt supporting compression without NTFS sparse files support. This, while much more complicated, would work on any system and not on NTFS alone.

As a bonus, it’s almost trivial to add encryption on top of the compression subsystem.

Any comments, ideas, feedback and/or constructive criticism are more than welcome.

Aug 262011
 

Databases are as ubiquitous as computers. They are often, erroneously, attributed to data-intensive, long-term storage solutions. Yet in reality they are used in one shape or form in most programs. From word processors to media players to web-browsers. All depend on databases to realize some of their features. This, not mentioning, e-commerce, flight booking, libraries and, of course, government intelligence data. Even when no feature requires a database per-se, user configuration and settings are often stored in databases. Where previously Windows programs depended on the Registry and Unix programs on plain-text files, increasingly new products are utilizing the portable and vastly superior SQLite instead.

Image via Wikipedia

Each application of course has its own set of requirements, use-cases and patterns. Yet, surprisingly or otherwise, there are two main patterns that can be said to parent the rest: OLTP and OLAP. Or, Online Transaction Processing and Online Analytical Processing respectively. Consequently, database design and optimization depends heavily on our processing needs and models. (Here I’m not concerned with database engine design and architecture, rather on the use-case patterns and processing models.) To get a better feel of each, let’s consider typical cases for each.

Online Transaction Processing (OLTP)

This type of applications are chiefly characterized by performing real-time, localized, mission-critical operations. That is, the operations concerned must complete virtually immediately (hence real-time processing,) they typically involve a very small number of entries independent of the remaining data (hence localized processing,) and there is zero-tolerance for data inconsistency, corruption or undefined states. The latter property is what requires transactional processing such that either the operation completely succeeds with all its sub-parts or it completely fails restoring the previous state of all involved entries.

In OLTP the operations are mostly that of Insert and Update, with occasional Delete operations for housekeeping, although typically deletion is differed for performance and/or for later data mining and reporting. The tables involved are typically normalized around the data entities.

Examples include online shopping, flight and medical data and custom data processing.

Online Analytical Processing (OLAP)

In contrast to OLTP, this type of applications are characterized by performing offline, aggregate and low-priority operations. The operations involved here can be executed at low-demand hours, often on archive databases and secondary servers (hence offline processing,) they typically aggregate a very large amount of data to generate statistical data, reports or to find outliers (hence aggregation,) and, since they are offline processing, they are designed to have low-priority, low isolation level (read uncommitted) and, since there is little or no side-effects to failing, they are designed to fail rather than potentially interfere with competing OLTP operations (if executed on the same database.)

OLAP operations are mostly Select operations with virtually no data modification (except for storing the results, if at all, typically in dedicated tables.) These operations not only aggregate large number of entries, with heavy use of aggregate functions, but they typically join a significant number of tables.

Examples include reporting, billing, statistical analysis and historic data processing.

Similarities and Differences

Since the differences between these two patterns lies in the underlying database operations they perform, it’s no surprise that there aren’t a multitude of basic patterns. The two main operation types are that of read and modify. That is, the applications that mostly read data (OLAP) typically perform massive read operations potentially across the complete data with few or no modification, and those that depend on heavy data modification care most about the integrity of the few entries they modify with little or no interest in other data.

However, one must make note of the fact that modification implies reading. This is crucial for correctly appreciating the performance characteristics of the two patterns. The three data-modifying operations, Insert, Update and Delete, all ought first perform Select operation(s) in one form or another. While this is more apparent for Update and Delete, Inserts with constraints must first validate that no constraint is violated. This often involves a lookup operation in the constraint index(es). Only in the most basic and simplest cases could an Insert operation be free of all lookup operations (when foreign key, unique or other constraints are completely absent.)

This doesn’t mean that OLTP operations are a superset of OLAP operations. To try and optimize first and foremost for OLAP with the hope that OLTP operations would naturally also run faster, thanks to the improved read operations that it could utilize, is a fallacy. The two are worlds apart and typically don’t have overlapping indexable data (more on this below.) Where they do share fields, indexes on them would most certainly be welcome to both. Unfortunately, that doesn’t materialize nearly as much as one would like.

Typical properties of OLTP vs. OLAP
Transactional Processing Analytical Processing
Data Current and in-progress. Retired, historic or archived.
Typical Operations Insert, Update, Delete. Select.
Function Types Based on business requirements. Based on data mining.
Normalization Highly normalized, entity modeled. Possibly denormalized to minimize joins.
Data Locality Few tables, few entries. Large aggregation across many tables.
Indexing Strategy Limited to a few highly-selective fields. Generous indexing depending on the queries.
Operation Duration Typically very short, sub-millisecond range. Extended over minutes and hours.
Caching Friendliness Highly volatile data with strict persistence requirements. Static data, very cache friendly.
Backup Requirement Inconsistency and data loss may cost business. All operations can be rerun, backup is redundant.
Query Complexity Trivial to simple. Complex to inhumane and unreadable.

Hybrids

The above descriptions are certainly not the only possibilities. A combination of both is more reasonable to expect in practice rather than the theoretical and somewhat idealized cases given above. Some such case might include functionality that must aggregate several groups of data without suffering phantom reads. Such a requirement means that not only the aggregate query must run with a high isolation level, reducing parallelism, but that it must also add indexes to finish as fast as possible to free up the entries that concurrent transactions might require. The additional indexes would incur unwelcome cost for data modification (which must also update the indexes as well.)

Optimization

Due to the differences in the operations involved, and indeed their requirements, optimization in each case is different than the other.

OLTP

To maximize performance, OLTP operations would use highly selective conditions on normalized tables in queries that are completely indexed (rather than partially indexed). This will get us to the target data as fast as possible, or, in case of insertion, will verify the absence of collisions equally quickly. Typically the primary key is used, which itself may be an alias of the internal row-id.

Next we’d want to localize locks as much as possible to maximize concurrency and minimize collateral impact. Fine grained lock hints such as row-level locks may be used. Normalization helps here by further containing the impact of the transactions as local as possible. In addition, we’d want to minimize the average row-size to improve cache-hits and avoid undue penalty. Once again, Normalization does most of the work here too.

Finally, OLTP operations would want to minimize, if not avoid, indexes, triggers, views and any operation that doesn’t help it reach and update the data as fast as possible. To that end, indexes are only useful to reach the data fast. All others would simply slow down virtually all operations, save for any selects we may have.

OLAP

For OLAP on the other hand it’d be best to work with denormalized data, to minimize or avoid joins.

Unlike OLTP, it’s very hard to predict or prioritize the conditions most useful to OLAP. This is because depending on the particular query in question, the tables and fields of importance are decided. Indeed, different queries might depend on very different fields and therefore indexes. So indexing some field might be useful to only an isolated query and not others. For a billing system not only the customer and orders are of importance, but the dates of orders and the billing period as well. Similarly, to generate sales reports, the queries involved would select on products and dates. Yet an inventory audit query might depend on a different set of fields. Hence, indexes are best decided based on the concrete queries in question.

To improve performance advanced features such as Materialized Views (aka Indexed Views) may be utilized, which are unfriendly to transactional processing.

Hybrids

From the above it’s quite clear that not only do OLTP and OLAP differ in their operations, but consequently optimizing their queries are apparently in conflict with each other. The requirements for maximizing performance in each case is simply contradictory to one another. But often both types of processing are done in our products. How can we optimize them?

Optimization for both OLTP and OLAP

It must be said from the outset that optimizing for two contradictory set of requirements can be a recipe for disaster. That is, we might end up with worse performance for most queries and even degrade the database design in hope of improving performance. With hubris some might advocate indexing all fields in query conditions, creating an uninformed mess in their wake. In fact, since OLTP operations are inherently very short lived, the overhead of updating superfluous indexes will probably go unnoticed, yet a long-running OLAP operations might get a noticeable boost that the OLAP queries get. From this the gulls in question might pat themselves for a job well done, completely oblivious to the fact that the OLTP operation in question will probably run millions of times, each time incurring the slight cost of the index updates, negating the apparent speed boost. To avoid such scenarios, we must understand that a case of compromise is perhaps unavoidable and approach the problem methodically.

Here are some design patterns that are often employed to maximize performance across both types of queries.

Data Archiving

Perhaps the simplest solution is to avoid performing both OLTP and OLAP type operations on the same data-set. To achieve this, we’ll have to split the data over two sets of tables. One, highly normalized and optimized for OLTP and another, perhaps denormalized, OLAP-optimized set.

A data-retirement scheme will be used to move the retired data from the OLTP table(s) into the archive (OLAP) tables. This may be performed using a background process, or using triggers. The latter may not be desirable as triggers would add latency to the OLTP queries, however, on the other hand, the cache hits and the avoidance of extra code and scanning of the OLTP tables might be a good compromise.

Dynamic Indexing

A number of application process data in batches. Each batch is processed in transactional fashion and once all the data has been processed completely, reports and other aggregate functions are executed on the final set. Custom data processing is perhaps a good example here, where some data is processed (for example by means of transformation, conversion or other modification) and tracked by transcations in a database. The database accounts for every data item, its states as it is modified with timestamps and warning/error codes where applicable.

Such a processing scheme has the advantage of allowing for an OLTP type optimization for the data processing stage until all the data is completely processed, whereupon OLAP-optimized indexes are dynamically added. The overhead of adding or modifying the indexes (yes, some might be dropped, while new ones added) might win us an overall significant amount of CPU time. Of course we’d need to profile and benchmark thoroughly to decide which indexes to remove and which to add.

Biting the Bullet: Conjoining Both

In some cases we need to aggregate both retired and archived data as well as current and in-transit data. If this scenario is of prime importance, then archiving the data for offline processing might simply be unhelpful, as we must also join the OLTP tables to get as-yet unarchived data. Also, we can’t work in stages and perform dynamic indexing since we have to work with both current and archived data.

For obvious reasons this is the worst of both worlds. In this case, we must very carefully balance two forces:

  1. Indexes for optimum OLTP performance.
  2. Indexes for the most important OLAP queries.

Here analytic profiling of the different operations, their execution pattern and respective use-cases is necessary. Once we get a rough idea of the relative importance of our database operations, we can start collecting empirical data using benchmarks to evaluate the different approaches. Since the database design, indexes and other attributes must be shared among all database operations, we must choose the minimum set of indexes etc. that give the maximum performance across all operations.

The difficultly is of course in finding this optimum point. As each index helpful to the OLAP queries, but unnecessary to the OLTP ones, is an overhead for the high-rate transactions. On the other hand, strategic indexes could boost the performance of aggregate queries by orders of magnitude.

Here one is bound to spend a good amount of time using database profiling, statistical analysis and benchmarking. (In a future post I’ll address this topic head-on.) The downside of all this is that once the priorities of our OLAP queries change, then our particular database design will perhaps be outdated too.

Conclusion

The requirements of OLTP and OLAP are as different as are their use-cases. Optimizing for both is a very tricky business, but sometimes unavoidable. If we must maximize performance, we must invest a considerable amount of time designing a well-balanced database schema and, using extensive profiling and benchmarking tools combined with analytic methods, only then can we decide on the design that maximizes performance for our particular needs.

(Anything missing? If you have something interesting on the topic to share, please don’t keep it to yourself.)

Related articles
Aug 102011
 

Few directors dare to leave the audience in complete darkness with a most eerie music and to do than not only once, but twice. György Ligeti‘s Atmospheres makes for an exclusive performance for the first full three minutes of one of the most audacious productions of the 20th century.

An Orion III, Pan Am's first Space Clipper, fe...

Image via Wikipedia

The movie isn’t a typical sci-fi. The slow pace, meticulously-setup scenes and larger-than-life photography takes the viewer’s breath away. Kubrick, a perfectionist in his own right, makes no compromises. The pace of the movie does perfect justice to the surreal, extravagant and colorful shots coupled with the moving music of Aram Khachaturian, Ligeti and Strauss. One can’t appreciate this work of art without first appreciating that the movie is the journey and not the finale.

Considering the stunning special effects in the movie and the human-like voice of HAL 9000, it’s quite surprising that virtually every single sci-fi before and since do not replicate this rather reasonable feat of inevitable progress. And this was done in 1968 no less. Instead, we’re left with robotic-sounding computers and other “smart” gadgets, even though some of them depict events centuries in the future. To add insult to injury, most of them simply do away with the difficulties and logistics of gravitation-free environments. Perhaps the latter is explained in pragmatic terms, but to their credit Kubrick and Clark don’t compromise on this point, instead they actually make for a great experience. The photography of the workout scene and all those involving walking about, be it to serve food on the moon-bus or to get into a shuttle, are simply worth every second on screen as did the multitudes of hours spent creating them. Of the scenes that truly standout in this regard are those where two people are standing perpendicular to each other going about their business undistributed.

No one better than Stanley Kubrick could take the masterpiece of Arthur C. Clark, yet another perfectionist, and do this good a job. Indeed, Kubrick personally must have had an appreciable influence on the development of the plot and story. Even then his work wouldn’t be complete without sharing credit for the screenplay with Clark.

Throughout the movie I suspected the music of the opening, moon-bus scene, intermission and of course the final scene were most probably that of Krzysztof Penderecki. Kubrick had made perfect use of Penderecki’s music his 1980’s The Shinning in addition to Ligeti’s Lontano, which, unbeknownst to me, was also used in the first radio series of The Hitchhiker’s Guide to the Galaxy, along with Melodien and Volumina, which I’ve listen to extensively. Penderecki’s music is no less unique and indeed eerie then that of Ligeti’s, which more than explains why some portions of the his cello concerto was used in The Excorcist, which unlike The Shinning I found lacking and somewhat ludicrous.

After finally watching the movie, I’d like to find time to read the book. The plot and themes touched in the book are simply much more profound than a movie could do justice too. Even at almost 150 minutes, the movie leaves the ending open to wild speculation and personal interpretation. Still, this is certainly one of those classics that not only sci-fi fans, but also those appreciative of grandiose and thoroughly beautiful cinematography would most certainly enjoy. Especially when watched on a large screen with a decent sound system.

Jul 172011
 

Some of us have to work a good 16 hours a day, or more. Some split this time between school and job, multiple jobs, job and hobby project or spend it on their one-and-only job or startup. After a while, waking up becomes a struggle. Disoriented, exhausted and sleep deprived. We work hard because we care. Because we want to make the best of our projects, be it personal, academic or professional. Here are some of things that I found improve this situation significantly when working on major projects for long periods. This isn’t be-all, end-all advice – there is no such thing. They are just good guidelines. I know, they are so simple and sound so obvious.

Disclaimer: I didn’t include exercise and other healthy activities. This isn’t medical or lifestyle advice. It’s just good notes to make the best of a major and important project. If this doesn’t work for you, don’t blame me.

Sleep

This is the single most important factor that can make or break. Get a good night sleep. It’s false economy to pull all-nighter and wake up late. Reverse it; sleep when you feel sleepy and wake up early. If you find it very hard to wake up and/or work, get as much done before sleep, but don’t overdo it. Try to improve your morning performance and shift your work hours towards the morning. Sure, sometimes we need to send some mail or prepare for a demo. Working for extended hours in those cases is probably fine, but make sure you make up for them soon. If possible, finish the absolute minimum before sleep and get the rest done in the morning. When we disrupt our sleep patterns by pulling all-nighters, it’s harder to make up for them.

The trick to good sleep is first and foremost to get in bed in time. Your body will give you the right cues, just pay attention. If you feel sleepy, go to bed, don’t wait 15 minutes to finish something. Be ready for it. Anticipate when your body will be ready to sleep, make sure you’ll be ready by then. Don’t do last-minute things like washing teeth, sending mails, setting up wake up alarms etc. right before stepping in bed. Get these done beforehand. If you miss that perfect time, your body will find it harder to get into deep sleep, which is the most regenerative type.

A sleepy mind with all the right answers will probably perform worse than an alert and fresh one with half the answers. Work hard only when you can afford being sleepy and slow. Never overwork before a big days like exams, interview, client meeting, project planning etc. Finish all the work at least 2 nights before the big day and make sure you get baby-sleep during the last couple of nights. Remember that our long-term memory needs deep sleep to accumulate new information. Last-night study will not only leave you sluggish and out of your zone during exam, but you won’t remember most of what you study a few hours later.

Exceptions are pretty much the norm. Plan for the long-term, not for every situation. Try to get an average of 7 hours of sleep per night during an average week. Figure out your natural average for yourself, it may be different. Sleeping also boosts our immune system.

Corollary

Alternatively, if you can’t fall asleep, don’t try hard. Either get some work done until you’re sleepy (at which point leave everything and go to bed,) or read a book in bed until you sleep (but don’t sleep with your glasses and lights on.) Don’t spend a couple of hours tossing and turning in bed, instead use that time to get something done. If your eyes are tired, try listening an audiobook or some music.

Eat

Nutrition comes next. When trying to meet a deadline we might skip a meal or two, get junk food or just go on coffee or coke. If you can plan your day, and you know it’s going to be a long one, make sure you make room for a good full-meal. If you can go out for lunch break, do so. You’ll be able to get a decent meal and give yourself a break. This will give you both physical energy and have a recreational effect. You’ll get back refreshed. Avoid going on bad diet for long periods of time. Minimize coffee if possible. Drink some fresh juice, tea, hot/cold chocolate and other beverages, including water. Caffeine, like alcohol, is diuretic. It dehydrates your body. It’s effect in increasing alertness doesn’t last very long either.

Make sure you don’t go on an empty stomach to important meetings. Being hungry makes most of us edgy and easy to get irritated. It’ll probably make you impatient as well, which isn’t a trait you want to have when making critical decisions.

Make sure your body is getting essential nutrients. Your immune system is at its weakest when stressed and sleep-deprived. Make sure you’re not malnourished as well when going on a spree. So the-daily-pizza has to make room for other -more healthy- meals. Lunch breaks with nutritious meals will more than pay back when you don’t get bed ridden for a few days on end, or drag yourself to work for a couple of weeks with a red runny nose, when the flu season hits.

Corollary

Don’t over eat! Not being hungry doesn’t mean having a 110% full stomach either. This is especially true if you have to do mental and/or physical activity (as opposed to mechanical and tedious work.) Eating too much will get you sluggish and sleepy. Moderation is the key.

Don’t Drink (too much)

It’s well known that a bit of drink after a long day is a good relaxant. This works best with soft liquors like wine, martini, champagne and beer in small quantities. Consume hard liquor or excessive soft ones at your peril. Alcohol is actually known to disrupt our sleep patterns. It’ll dehydrate you, leaving you thirsty all night and give you a nice buzzing headache in the morning. So not only you won’t get a good deep sleep, but you’ll wake up tired and hungover. The best way to use alcohol as a relaxant is, after dinning, to drink no more than 100-150ml (half a cup) and, once you feel a bit buzzed, go to bed. The difference between half a beer and a full bottle will probably cost you the next day.

If you must, drink on Fridays or when you can afford to take the next day off. But don’t drink like there’s no tomorrow.

Take a Break

During a long day as well as during a long project, make sure you get refreshing power breaks. The lunch break outside the office is one such. Try to do something unrelated, even if on the same project or subject. I take coffee breaks (but I avoid coffee as such) when I get stuck or between tasks. This forces me to get up and stretch my muscles as well as socialize. The chances that I’ll be distracted are extremely high, which is the point of the break. However, make sure you won’t be dragged into something extended. Limit the break to 15-20 minutes max but typically 10. Socializing in person is a great way to do this, but don’t get into a global warming argument! Even talking about work will be refreshing. I tend to read a few pages from a non-technical book like popular science. Watching a funny sketch, video clip or reading a blog article is also a good way to get away.

Take power-naps if it’s your thing whenever you can. For some people who like napping even dozing off for 10 minutes during the day gives them a great boost. If you can’t nap, try stretching on a sofa and relax. Reading or listening chill music can also help you get a grip during a hectic day.

Every so often, take an early leave or a day off. Go do something completely different and unlike your daily habit. Even if you stay-in and sleep or go for a walk and watch a movie, you’ll get back to your project much more refreshed and enthused.

Corollary

Be wary of getting out of your zone. If you’re making progress and things are rolling like a well-oiled machine, don’t stop! In fact, avoid distractions at all cost. Being in the zone is when we’re most efficient and productive. Switch your IM to “Busy” or “Don’t disturb” status. Check email and get back to colleagues later. Make it clear when you don’t want to be interrupted unless the building is collapsing so your colleagues will be mindful. When taking breaks or power-naps, be very aware of the time. Set alarms and go back to work when your time is up. If you’re too tired to work, then either go get some sleep, discuss work-related topics or do some other mentally undemanding and mechanical task. Any progress is better then idle chatter or web surfing (aka watching funny pics.)

Don’t Work Too Hard

In some professions keeping up with industry can be critical for success. Burying one’s head in some project for extended time might not be the wisest of decisions. Don’t neglect the rest of the world. Working too hard on your project will probably have diminishing returns beyond some point anyway. Instead, try to keep your proverbial finger on the trends pulse. Spend some time reading the news, read on similar projects, success and failure stories, blogs with insightful technical and non-technical information. Look for smart ways to take shortcuts and reuse other successful platforms or components. Look for good patterns and stories from people like yourself. Keep an eye on competition, both existing and potential. But don’t overwhelm yourself with news and obsessive competition tracking. Get back to your project and get focused.

Working too hard may not be the most efficient way to make a successful project. Be thoughtful of the alternatives. Having a well-rested and fresh mind will certainly help with this.

Bottom-line (TL;DR)

  • Sleep well: makes you fresh, active and in your zone. But if you can’t sleep, get some work done.
  • Eat well: replenish your energy and nutrients. But don’t overeat.
  • Don’t drink: not when you have to work the next day. But half a pint before sleep may help.
  • Take a break: refresh and clear your mind. But don’t get carried away.
  • Don’t work too hard: keep updated on news, competition and advice from others. But don’t overwhelm yourself.
Jul 092011
 

Threading is on the front pages more than ever, so to speak. Where only a few years ago programmers could brush the topic aside as complicated and not-very-useful, it’s increasingly harder to hide behind those excuses. Multi-core systems are the status quo and the only real way to get better performance (read: scale) is to have multi-threaded code.

So it’s no surprise that the topic is revived and there is more talk about it. And what threading if one doesn’t mention the pitfalls and problems? Consider the following question:

int x = 1;
int y = 0;

if (y == 0)
{
   // Can x be anything but 1 at this point?
}

There is no shortage of this type of questions. And there is no shortage of answers either!

The only correct answer to the above question is: It depends on the language!

What does the keyword volatile do? Same answer.

Does my code suffer from the ABA problem? Same answer (but that one is more complicated)

Similar question, same answer.

// Thread 1
x = 1;
y = 0;

// Thread 2
int a = x;
int b = y; // if b == 0, is a == 1?

Here is a pdf presentation of a lock-free hashtable in Java. Here is the video presentation. Again, if the same algorithm implemented in a different language, the result will be different. Namely, it won’t work.

Why? Because the memory models are different. C++ has a very, very different memory model than .Net than Java. (It wouldn’t be entirely wrong to say that C/C++ don’t define any memory coherency models as such.) Each in their turn probably have a different memory model from other languages. And talking about threading without knowing the memory model is like assuming you can drive at the same speed on all roads. Well, some aren’t even paved, yet some have minimum speed.

It’s very unfortunate that an already complex subject is made even more confusing by the differences in languages and their guarantees, or lack thereof. There is no shortage of confused, misleading and downright incorrect information when it comes to threading. Consider that locks don’t even protect your code against out-of-order execution. But in which language is that?

This is made even more complicated by the fact that processors and hardware architectures come with their own set of guarantees and reservations. So reading on those without taking into account what the language in question abstracts from the hardware, is yet another way to throw oneself in the rabbit hole.

So, the next time you read (or write) about threading issues, make sure you know which language is in question. If it doesn’t say, stop reading. Without that small one-word detail, everything you learn is wrong, regardless of how detailed and coherent it is. Because precise doesn’t necessarily mean accurate.

Jul 082011
 

Reading Cowboy Coding and the Pink Sombrero I’m reminded of the two occasions when I deserved that Pink Sombrero, or at least a sombrero. Since I’ll have done cowboy coding exactly twice before retiring, allow me to share the experience with you.

Bacardi mojito served in Bacardi branded glass.

Image via Wikipedia

This happened a handful of years ago. I was working for a prominent player in the electronic trading market. Our internal process dictated design review by an architect, code review by at least two developers, including one from a different team, commit, SCM build signed-off to QA, QA sign-off to Operations after testing, Ops performs pre-market testing with real currency before final deployment. Our customers are bankers and traders gambling millions on commodities around the globe, so there is a reason for such a strict process.

During a relaxed Sunday evening I got a call from Frankfurt Operations team. Something about the deployment tests failing. I was on standby for this deployment as a technical support for the Ops. The market opens Monday and there was a pre-market test that we couldn’t miss.

I’m in Yerevan. The developers responsible for the deployed component are in Mountain Time, in deep sleep. The European market-open time is in between. Immediately I was on my office workstation via remote desktop talking with a sleepy manager, VP of Software and a couple of guys from Ops. I started reading logs and code. I was promised at least one of the original coders to review my fix. After a marathon of about 7 hours we had a fix. The other guy reviewed it and, to my surprise, the VP of Software approved hotfixing my local build on the production farm.

Obviously this isn’t like a CSS fix on a live web-site. Leaving the code size and complexity aside, the module in question typically processes thousands of orders a day, each worth a few grands, some tens. Mistakes here cost a tad more than page views. Luckily, all went better then expected. The next day I went to the office and I felt like a walking million dollars. Where is that pink sombrero when you deserve it?

A couple of months went by, when again I got a similar call. This time some commodity price was missing or incorrect. The fix wasn’t very difficult. We had to find the correct price, write a small DB update script to patch the Price table and test. Only thing was that there wasn’t nearly as much time as last time. Specifically, the pre-market test had come to a close and we had to fix the issue before market-open and go live without testing.

Fortunately, I found the price of the commodity in question on a test server. Wrote the SQL to update the DB. Without blinking, I got green light to patch the production servers by the VP of Software. Still no sombrero! All went as planned, except, the price was wrong!

Early the next day at work I was already being asked by my manager and her supervisor about the fix. “Who approved it?!?” they asked impatiently and in the most irritating tone, before telling me how it blew up. Turns out the test server had old data. Everybody and their dog were in feisty mode.

And that was the last time I went cowboy coding, sans sombrero. Lesson learned, and I don’t care who approves cowboy coding… although I wish I had a Mojito while at it.

Jul 052011
 

Moses and Monotheism has been the first full-length book that I read by Sigmund Freud. While I did read a number of articles and excepts by Freud, I didn’t have the chance to read a complete work until now. There is a lot to be said about the man and his works, and indeed a lot has been said and debated.
Moses and Monotheism

Freud needs no introduction, or rather he often needs reintroduction. A very controversial and misunderstood person, if there ever was one. The father of psychoanalysis had much to say outside his prime interest and we’re lucky he published some of his more controversial materials. The reason I say this is very much evident in Moses and Monotheism.

The book is an easy read, even though it’s translated from German. The topic, however, is very controversial. In fact, I suspect it’s mostly because of his religious convictions and ideas related to religious figures and history that really condemned him to the evermore hatred of the masses. Even though I was acquainted with his pen and style, I was still very much amused and pleasantly surprised quite a number of times.

The book’s roots are in two articles Freud wrote for Imago, a scientific journal, back in the 1930’s. After escaping to England from invaded Poland, he was encouraged to publish the remaining text with the articles he had already written in a book binding. The idea is to try to analyse the psychology behind the different stories, laws and events written in the old testament. That is, the author is trying to reconstruct as much as possible what really could have happened to the Israelites in Egypt. There is a healthy doze of references to historians and other specialists which give more credit to his views.

Two reasons to really find this book very well worth the read. First and foremost, the author isn’t coy of calling out the limitations and issues with the ideas he puts forth. He’s quick to criticize himself. On so many occasions did he mention how problematic or limited some of his claims are and suggested what could be tested and what is pure guesswork. It’s always refreshing to read such sober and mature author.

Second, on a number of point he did make me pause to reflect on his explanations. I had to remind myself that this was the father of psychoanalysis after all, not a random author. Yet, I was still blown away by some of his points. Take for example a question that I spent countless hours contemplating without progress. Why do many (most?) people sympathize with figures, civilizations, lifestyle and virtually everything historic? That is, why is it almost universally accepted that there has been a great past that our ancestors enjoyed and everything is in constant decline and degeneration since? From people rejecting modern medicine in favor of “natural” and “traditional” medicine all the way to blaming modern cities and urbanization on all evils.

There is something about the past that is universally appealing. Reading Moses and Monotheism yielded a very reasonable explanation: we associate the past with our childhood, which is typically remembered as peaceful and blissful. Sounds like Freud, doesn’t it? He didn’t need to make a strong case for this explanation. Perhaps he or someone else already did that elsewhere, but as it stands, he gives something everyone can experience more or less.

While overall the book was a very good read, an almost complete century of progress in archaeology, history and psychology should have something to say about the then-untested hypotheses that the author puts forth, not to mention new evidence unknown at the time. And while none of these fields are even remotely close to mine, I could tell he’s probably already proven wrong on some of his educated guesses. Still, considering what he had to work from, he did a fascinating job overall. A thought provoking and insightful book.

QR Code Business Card