Sep 272011
 

Haven’t we all written Fibonacci sequence generators, sorting and searching algorithms and a handful of data-structures while learning to program? What’s common in these problems is their simplicity. I almost said triviality, but there are nuances that the beginner will stumble in and, hopefully, learn from. (And isn’t that the whole point of the exercise anyway?)

Fibonacci is a great introduction to recursion. Sorting teaches us the fundamentals of algorithms, complexity and of course data-structures and searching. We learn that not all ways of organizing data are born equal. Linked lists have a leg over arrays some of the time, but their ranks are reversed at other times.

This is how we learn about our tools and how to use them and when. It’s the equivalent of painting fences. Without these mental gymnastics we’d have to open up textbooks and look up for “how to find duplicates in two sets” or wonder why the data is sometimes shuffled around, but not always (hint: dictionaries typically don’t preserve order, unless you use one that explicitly guarantees order).

The answer to “why is our product all of a sudden too slow and customers are complaining now when it was lightning fast for years?” isn’t in any textbook. However, if one assumes the customer data might have doubled, a seasoned programmer would first check for a quadratic algorithm (or worse) somewhere in the code-path.

While there is no doubt, at least in my mind, that it’s absolutely necessary to work out all these contrived problems, from writing a 4-operation calculator with an expression parser to solving wick wack woe, I think we should include a whole other genre of problems in our training kits.

The Unsolvables

There is a group of problems that, as far as we know, have no solutions. That is, the solution is known to be found only by brute force or an approximation exists using heuristics (basically trial-and-error or some form of genetic algorithm.) There is the obvious usual-suspects, the NP-Complete folk. But not only. There are algorithms that run in quadratic and polynomial time that aren’t practical beyond some size.

There is no better way to get a very real feel of what it means for an algorithm to run in quadratic time than to actually write the code and see how you age as your machine churns away hopelessly, all the while another algorithm with a lower complexity has just finished.

Take Quick sort for example. Implementing the textbook algorithm will work swiftly, that is, until you feed it sorted data. Multiplication using the school-method is simple and elegant, until you apply it to large numbers. Trying to optimize such a primitive algorithm is very instructive. It teaches you that it’s not about optimization; the complexity is inherent. Yet, we don’t use Merge sort with its O(n log n) worst case performance and default to Quick sort with its O(n2) worst case characteristic. This is not at all obvious to a beginner, and even for many experienced programmers. Theory and practice sometimes do part ways. And that’s yet another important lesson.

Without attempting to solve a travelling salesman problem or a knapsack problem we are in danger of never really understanding why complexity is important. Why some algorithms are hopeless and will get back to haunt you one day, while others seem to be stuck in their misery and can hardly be improved.

And what better way to understand how and, more importantly, why modern cryptography works than to try to factorize numbers? Searching for prime numbers is yet another long-time problem that only recently it was proved that primality testing is polynomial, and how is all that related to one-way functions.

There is also another purpose to this exercise. It’s not obvious where the difficulty of solving these unsolved problems is. At first sight almost anyone presented by such a problem will come up with some solution of sorts. It’s not until much later, and with much mental effort, that one does notice the errors of their ways. It might be a hidden complexity that they introduced in their algorithm while being oblivious to it that negated the gains they scored elsewhere. It might be that the obvious or straight-forward solution misses a very simple, but equally crucial, case. Or, it may be that their solution is broken in one way or another. Either way, it’s not enough to know about the problems, their status and move on. There is much to be learned from solving unsolved problems.

The Impossibles

But why stop there? Why stop at the travelling salesman or a variation of the knapsack problem? While we’re at it, let’s introduce non-computability and noncomputable functions.

I’m sure these topics are very well studied in some grad schools. But the average CS school undergrad would probably firmly believe the fastest growing functions are the exponential. (I still have to get a different answer during an interview.) Whatever happened to Busy Beavers? Apparently, a noncomputable can actually grow faster than any computable function! Now try to beat that in the department of slow algorithms.

Conclusion

I think it would be a great service to our industry if every once in a while the school assigns an unsolvable problem. Send the students home with a travelling salesman problem or two and ask them to bring in the numbers the next week. It’d prove much more instructive to see the 5-city problem solved in seconds while the 25-city…

And who knows, we might as well get a Dantzig moment! (Incidentally, he‘s the inventor of Simplex, the single most useful algorithm in optimization.)

Sep 252011
 

Building a large and complex such as WebKit is anything but straightforward. WebKit is the core of multiple browsers, most notably Safari and Chrome. Even the official instructions are sometimes lacking. This is mostly because such large projects are moving targets and it’s hard to keep track of all the changes in the requirements. Another issue is that most developers don’t setup build systems from scratch regularly. Probably no more than once or twice during the lifetime of a project. Even blogs, articles, forums and mailing lists that discuss building WebKit are almost completely outdated.

At any rate, these notes are up-to-date as of September 2011, I hope you find them helpful.

Background

WebKit can be built in different flavors. The differences are important, because one must setup the build system slightly differently for each flavor. There are essentially 4 different flavors: Apple Windows, WinCairo, Qt and GTK. All four require external libraries to be present for the build to succeed. I have no experience building the Qt and GTK flavors. Here I’ll cover Apple Windows and WinCairo. I do not cover building Chromium.

Build Environment

There are some common requirements, so first and foremost make sure these are setup correctly. You may follow the instructions described here. Unless you want to build using VS2005, ignore the instructions on VS2005.

Here is a summary:

  1. Install Cygwin. Use this downloader. Preferably install to C:\cygwin (I’ll assume that’s where it is). Never to a path with spaces! Make sure you install: svn, curl, unzip, make, perl, python, gcc, gcc-g++, gcc4, gcc4-g++, bison, flex, gperf.
  2. Install QuickTime SDK in the default location.
  3. Install DirectX SDK. However, if you’ll build with VS2005, don’t install anything newer than February 2010. It’s worth noting that the August 2007 version builds WebKit just fine (as of r95358 with default flags).
  4. Install Visual Studio 2005 or 2008 with all updates and Service Packs. For VS2005 please refer to the updates listed here. Personally, I suggest sticking with 2008.
  5. Install Windows SDK for Server 2003 R2 or newer. See below for more details.
  6. Download the Support Libraries from Apple. Copy the file (don’t extract, just copy the zip) into the WebKit source folder.
  7. Create WEBKITLIBRARIESDIR and WEBKITOUTPUTDIR environment variables. The values must be fully-qualified Windows paths and not Cygwin/unix paths. If you don’t want to change you system-wide path, skip this and see the script below.

Windows SDK

If you have multiple Windows SDKs (as is typically the case when installing multiple Visual Studio versions) then you will have to setup your target Visual Studio to use the SDK of choice. There are two ways to achieve this.

The first method is to change the BIN, INCLUDE and LIB directories from the Visual Studio Options dialog. From Tools > Options menu, from Projects and Solutions > VC++ Directories page, set the Executable files‘s top entry to %SDK%\bin folder, the %SDK%\lib at the top of the Library files entries and %SDK%\include to the top of the Include files‘s entries. Where %SDK% is the installation path of the SDK. Note: This will change the Visual Studio paths for all instances. So don’t make this change if other projects depend on your current settings, unless you care more about WebKit, of course.

The second method is much less intrusive. We can change the environment variables in a shell instance and pass them on to Visual Studio. This is relatively easy and convenient, but will require scripting. In the Tools/Scripts folder of WebKit there is a script called pdevenv. This Perl script is responsible for creating a Windows CMD file which ultimately runs Visual Studio. The reason for this is because Bash can’t execute the scripts necessary for setting up Visual Studio’s environment.

Here is an example of a modified pdevenv script that sets up the SDK:

#!/usr/bin/perl -w

use strict;
use warnings;
use File::Temp qw/tempfile/;
use FindBin;
use lib $FindBin::Bin;
use webkitdirs;

my ($fh, $path) = tempfile(UNLINK => 0, SUFFIX => '.cmd') or die;

chomp(my $vcBin = `cygpath -w "$FindBin::Bin/../vcbin"`);
chomp(my $scriptsPath = `cygpath -w "$FindBin::Bin"`);

# Reverse the order of these if you prefer to use VS2008.
my $vsToolsVar;
if ($ENV{'VS80COMNTOOLS'}) {
    $vsToolsVar = "VS80COMNTOOLS";
} elsif ($ENV{'VS90COMNTOOLS'}) {
    $vsToolsVar = "VS90COMNTOOLS";
} else {
    print "*************************************************************\n";
    print "Cannot find Visual Studio tools dir.\n";
    print "Please ensure that \$VS80COMNTOOLS or \$VS90COMNTOOLS\n";
    print "is set to a valid location.\n";
    print "*************************************************************\n";
    die;
}

# Comment the following to see the environment variables used.
print $fh "\@echo off\n\n";

# Setup Visual Studio.
print $fh "call \"\%" . $vsToolsVar . "\%\\vsvars32.bat\"\n\n";
print $fh "set PATH=$vcBin;$scriptsPath;\%PATH\%\n\n";

# Setup the 2003 SDK. Don't call its own SetEnv as it doesn't detect VS2008 or newer.
print $fh "Set MSSdk=C:\\Program Files\\Microsoft Platform SDK for Windows Server 2003 R2\n\n";
print $fh "Set INETSDK=%MSSdk%\n\n";
print $fh "Set MSSdk=%MSSdk%\n\n";
print $fh "Set Mstools=%MSSdk%\n\n";
print $fh "Set Lib=%MSSdk%\\Lib;%Lib%\n\n";
print $fh "Set Include=%MSSdk%\\Include;%Include%\n\n";
print $fh "Set Path=%MSSdk%\\Bin;%MSSdk%\\Bin\\WinNT;%path%\n\n";

# Run Visual Studio.
print $fh "IF EXIST \"\%VSINSTALLDIR\%\\Common7\\IDE\\devenv.com\" (devenv.com /useenv " . join(" ", @ARGV) . ") ELSE ";
print $fh "VCExpress.exe /useenv " . join(" ", @ARGV) . "\n";

close $fh;
chmod 0755, $path;
chomp($path = `cygpath -w -s '$path'`);
exec("cmd /c \"call $path\"");

Note: If you already have SDK 6.0, 6.0A, 7.0, 7.0A or 7.1, you should be fine, but be warned: DirectX conflicts with 7.x SDKs, so make sure DirectX headers come after Windows SDK. If you have 6.0 or 6.0A you should have no problems building.

Building with VS2008

  1. Install VS2008 with all Service Packs and updates. (Express version needs a bit of a push to work. Please follow the these instructions.)
  2. Run update-webkit and build-webkit at least once. Regardless of the outcome, delete the contents of WEBKITOUTPUTDIR folder, wherever it is.
  3. Run VS2008 and open C:\cygwin\home\{username}\WebKitTrunk\Source\WebKit\win\WebKit.vcproj\WebKit.sln. You’ll be prompted to convert the project files from VS2005 format to 2008. Convert and save-all.
  4. WebKit projects have no tolerance for warnings and they are set to the highest level. Since VS2008 does emit some superfluous warnings (such as not inlining a force-inline function) the build will fail with these settings. As such, we need to change these setting.
    Select every project that doesn’t end in “generated”, right-click and select Properties. In the Properties window, from the top, change the Configuration to “All Configurations”. Navigate to Configuration Properties/C++/General in the left tree and on the right-side change the Warning Level to “Off: Turn Off All Warnings (/W0)” and Treat Warnings As Errors to “No”. Click OK to close this dialog.
    If you don’t see C++/General in the left tree it means you selected a non-C++ project. You may do this one project at a time.
  5. The project files are now ready. Save all and exit Visual Studio.

Build Notes

  1. Disable Anti-Virus scanners. (Can you spell incredible?) Cygwin needs to simulate Fork, which needs to clone process memory exactly. Many AV products add system hooks to scan processes in-memory. This disrupts memory layout and DLL relocation results in differences in the memory of forked processes resulting in memory corruption. This issue may never happen even with an AV running. However, if you see fork, out-of-memory or Virtual-Memory allocation failures, try disabling AV and try again. Some don’t remove their system hooks by temporarily disabling them, so you’ll need to uninstall them completely.
  2. Don’t build the Trunk. WebKit is large, complex and fast moving. Unless you like testing the limits of your sanity, you may want to stay away from the Trunk, which can be broken and problem-laden. Instead, get a successfully-built and packaged source archive from the nightly builds. At least have one recent successful build from a nightly archive before attempting the Trunk.
  3. Place Cygwin/bin before others in Path. In the system path, place CygWin binary folder after System folders but before anything else. Example: %SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;C:\cygwin\bin;. This will insure that if you have Perl, SVN and other similar utilities installed outside of Cygwin, the Cygwin version will take precendence. NOTE: This may break other products that rely on the non-cygwin variants, in that case, you’ll have to perform this change in a build-script (see below).
  4. Place WebKit sources in your Cygwin home directory. If you install Cygwin in the root, your home should be here: C:\cygwin\home\{username}. WebKit should be in a folder directly within your home. Example: C:\cygwin\home\{username}\webkit.
  5. Add the scripts folder to the path. The webkit scripts must be reachable at all times, makes sure C:\cygwin\home\{username}\webkit\Tools\Scripts is prepended to the path.
  6. Don’t update the WinCairo code unless you want to build that flavor. It seems that running update-webkit --wincairo is changing enough in the code that it’s rendering the Apple Windows flavor broken. If this happens and you can’t build Apple Windows, then simply delete the WebKitLibraries folder, and run update-webkit (without --wincairo!).
  7. Run update-webkit at least once. Even if you don’t need to update the code, you must run this script at least once. Note that if you’ve downloaded a nightly-build package this script won’t update the code to the Trunk! It’ll only update the support libraries and other scripts.
  8. You may have to build twice before it works. The build almost always fails the first time and after deleting the WEBKITOUTPUTDIR folder, so the build script must be executed more than once to get a successful build.
  9. Look for the first failed project to fix errors. Since the WebKit build copies many source files as well as output binaries from one project to another, it’s vital that all dependencies of a sub-project succeed before it is built. That is, if a project fails, typically due to missing headers, identifiers etc, before panicking look for the first failed project and try to fix it. Most probably the errors are due to a previously broken dependency. This can get tricky, as some projects depend on external scripts that may be partially completed and stop reporting errors on subsequent builds. In such a case you won’t see failed dependencies and the sub-project in question would be the first to fail. To make sure this is the case, delete the WEBKITOUTPUTDIR folder and build again. If you get errors, note the first error you got and build again. Repeat this until you get the same error. This is the error you need to fix.
  10. Disable warning as error. If you get error C2220: warning treated as error - no 'object' file generated errors, then follow the instruction for disabling warning as error in the Building with VS2008 section above.

Here is a bash script to setup the environment, update and build. You must set your home folder name. Save this file in the webkit folder, name it build.

#! /bin/sh

echo
echo "Exporting environment varibles..."
export PATH=~/webkit/Tools/Scripts:$PATH
export WEBKITLIBRARIESDIR=C:\\cygwin\\home\\{username}\\webkit\\WebKitLibraries\\win
export WEBKITOUTPUTDIR=C:\\cygwin\\home\\{username}\\webkit\\WebKitBuild

echo
echo "Updating source tree..."
update-webkit

echo
echo "Building..."
build-webkit $*

Apple Windows

This flavor is the one used by Safari on Windows. The external libraries are copyrighted by Apple and aren’t redistributable. As such, the build system will require one to manually download an archive holding them from Apple after accepting the license agreement. This build, therefore, cannot be distributed on the internet with the external libraries. Having said that, one should at least try to get this flavor successfully built before attempting others.

To build this flavor follow these steps:

  1. Run the Cygwin shell.
  2. cd webkit
  3. ./build

WinCairo

To build the WinCairo flavor one had to manually download and install the various libraries, often having to build them manually or get a prebuilt version. Thanks to the efforts of volunteers now the update-webkit script automates all that. If you have built the Apple Windows flavor, you should be able to build WinCairo using the following:

  1. Run the Cygwin shell.
  2. cd webkit
  3. ./build –wincairo

If the build fails, try running the build script again.

However if you get “CF” identifier errors, consider updating the CF library manually. All public CF library identifiers and even filenames either begin or end with “CF”, so they are easy to spot. WebKit WinCairo uses a modified version of the original Apple sources called OpenCFLite. Download a recent version from the project’s page or get the trunk. You’ll have to bite the bullet and build the library yourself, update the headers and libs and build WebKit again.

To build OpenCFLite, open the solution in VS2005 or 2008 (depending on which you’ll use for the WebKit build). Choose the Release configuration and build the CFLite project. Ignore the mountain of warning and see if the build succeeded or not. If all goes well, in the root of the OpenCFLite source folder you should see a dist folder. Now copy the contents of dist into %WEBKITLIBRARIESDIR%\win overwriting existing files.

Delete the %WEBKITLIBRARIESDIR% folder and build Webkit WinCairo again. You should get a clean build.

Debugging

Now that you have a successful build, you may want to build the Debug version by passing --debug switch to the build-webkit script. You can execute this script directly (without running the above build script or update-webkit) or you can modify the above build script to the same effect.

With the debug build we can use the Visual Studio debugger or the ultra-powerful WinDbg. In Visual Studio simply open the solution file from C:\cygwin\home\{username}\WebKitTrunk\Source\WebKit\win\WebKit.vcproj\WebKit.sln, select one of the runner test projects as the startup and hit F5.

Happy hacking and don’t forget, you can contribute by submitting bug reports and sending patches and fixes. Good luck.

References

I couldn’t have succeeded in building WebKit without the help of numerous developers who shared their experience on the web. Here is a list of the most helpful and relevant pages. Be forewarned however, they are mostly outdated!

Sep 182011
 

Summary (TL;DR)

This an experimental mod of Sqlite with built-in online compression support. Design and implementation are discussed, limitation and benchmarks provided and source code as well as prebuilt DLL are included. Use the TOC to jump to the topic of interest.

Background

Both Sqlite and MySql support compressed (and encrypted) databases. Well, more or less. Sqlite’s support is limited to read-only databases that are compressed offline, while MySql’s support is limited to compressing strings (as far as I can tell.)

While working on WikiDesk, a Wikipedia browser project, I knew the database could easily grow to 100s of gigabytes. The database of choice here is Sqlite because of it’s compactness and mobility. The English Wikipedia dump is already in the range of 100s to 1000s of gigs (depending on the dump type.) WikiDesk not only supports different Wikipedia languages, but also different projects, such as Wikinews, Wikibooks and Wiktionary, among many others in all available languages, all in the same database. Theoretically, one can import all possible Wiki content into a single database.

The opportunity of compressing this highly-redundant wiki-code mixed with Unicode text was pretty much obvious. So it was reasonable to assume others must have had a similar case and added compression support to Sqlite. My search only yielded the aforementioned cases.

A part of me was happy to have found no precedent project. I was more than happy to roll-up my sleeves and get to hacking.

Design Survey

There are many ways to go about designing a compressed database file. My main purpose, however, was to have fully-transparent, online and realtime compression support. So the design must accommodate updates and deletions as well as any other modify operation supported by Sqlite.

An obvious approach is the one used by MySql, namely to compress the fields independently. This is simple and relatively speaking straight forward. However it’d mean that LIKE couldn’t be used on compressed string fields. Collation and sorting and other features would be absent as well. In fact the fields in question couldn’t be TEXT at all. In addition, one had to explicitly compress fields, remember which is compressed and remember to uncompress before using them. Very limited I thought and probably wouldn’t be worth the effort. Another approach is to do this on a low level, such that it’d be transparent to the caller. Such an extension to Sqlite exists but this will not yield much gain on small fields. I suspect NTFS compression would give better results.

NTFS has built-in compression support. It was well worth the effort of testing it. On an English SimpleWiki dump I could compress the database file down to about 57% of its original size (see benchmarks below.) Pretty decent. However I couldn’t control it at all. I couldn’t set the chunk size, compression level or anything save for enabling and disabling it. In addition, the user could disable it and lose all the benefits. Database-level compression is more promising. A similar result can be achieved using FuseCompress or compFUSEd (on Linux), albeit, the user must install such a filesystem first.

A major problem with database files, as far as online compression is concerned, is that the database logical-structure typically stores pointers to file offsets, such that there is a one-to-one mapping between the physical and logical-structures. This is reasonable as the database is really a large and complex datastructure on disk (as opposed to memory.) The btree or rtree nodes are typically page indexes, where all pages have a predefined, database-wide fixed size. Disrupting this structure would render the file corrupted. The purpose of the fixed-size pages is to simplify the allocation and management of space. This scheme is also used by memory and disk-managers alike.

If we compress a page in the database, the page would now contain two regions: data and free-space. To utilize the free-space, we could write a portion of the next page in the free-space, and the remaining in the next page, and so on for all pages. But then we’d have to keep track of each page’s fragments somehow. To avoid that, we can leave the free-space unused, but then we’d get no net saved disk space, as the free-space would still be allocated on disk.

I could store the new indexes and offsets in some allocation table appended to the file. But I’d have to do a lot of data moving, reallocation, (de)fragmentation and whatnot just to keep track of the free ranges and so on. Obviously this approach was pretty complicated and would take much more involved design and coding. Also, Sqlite page-sizes are multiple of disk sector size for atomicity. I had to be thoroughly familiar with the Sqlite design and implementation to embark on such a largish project, if I wanted it finished and working.

The ‘be lazy’ motto seems to work well for programmers who are efficiency-oriented and hate repetitive and error-prone work. What would be the simplest approach that could work? Going back to NTFS one could learn a lesson or two on transparent compression. The secret is that NTFS can simply allocate any free inode on the disk, write the compressed data to it and update the index table. Inodes are linked lists, so it is very easy to insert/remove and modify the chain. Files, on the other hand, are arrays of bytes abstracted from the disk structure. Moving bits around in an array is much more complicated and time consuming than updating nodes in a linked-list.

What is needed is the advantage of a file-system applied on the level of files.

What if we could tell the file-system that these free-space regions of the file are really unused? NTFS supports sparse files in addition to compressed files. This could be used to our advantage. All we’d have to do is mark the free-space in each page as unused and the file-system will make them available to other files on the disk, reducing the net used disk space of the database.

The Design

Sqlite supports pages of 512-65535 bytes long. Since we can’t break a single page, the smallest compression unit must be at least 64 Kbyte long. In addition, the compression-unit of NTFS compression seems to be also 64 Kbytes. This means that a sparse range must be at least as large as a compression-unit to be deallocated from disk and marked as free. This puts a clear limitation on the amount of saving we can achieve using this design; Compression won’t save any disk space unless it reduces the size in multiples of 64 Kbytes. A multiple of 64 Kbytes is used as the compression unit, internally called a chunk. Indeed, a chunk size of 64 Kbytes would be totally useless as there could be no saving at all.

When data is written it’s first written into a memory buffer. This buffer is used to track changes to the chunk, it’s offset in the file and use to compress the data. When the chunk needs flushing the data is first compressed and the compressed data written to the chunk offset. The remainder of the chunk is marked as a sparse region. NTFS deallocates any naturally-aligned compression units that are completely sparse. Partially written units are physically allocated on disk and 0-valued bytes are written to disk.

When reading data, the complete chunk of the requested byte-offset is read, decompressed and from the buffered data the requested bytes copied back to the caller. The sparse bytes are transparently read-in as 0-valued bytes. This is done by NTFS and relieves us from tracking sparse regions.

Initially very fast compression libraries were used to avoid sacrificing too much performance. FastLz, Lz4 and MiniLzo were tested but the results weren’t very promising, compression-wise. As such the current build uses Zlib.

Implementation

The compression mod is written as a VFS Shim. This has the advantage of avoiding any modifications to the Sqlite code base.

Enabling compression must be done before opening any database files. A single function is defined as follows:

int sqlite3_compress(
    int trace,
    int compressionLevel
    );

trace can be a value between 0 and 7. When 0 tracing is disabled, larger values enable tracing of increasingly lower-level operations. Trace logs are written to stderr. -1 for default.

compressionLevel can be a value between 1 and 9, where 1 gives the fastest performance at the expense of compression ratio and 9 gives the best compression at the expense of performance. -1 for default, which is typically level-6.

To enable compression this function is simply called before calling sqlite3_open. Compression level may be changed between runs, however unless a chunk is modified, the data will not be recompressed with the new level.

Only the main database is compressed. The journal or any other temporary files aren’t compressed.

Limitations

Besides the fact that the code is in an experimental state, there are some things unsupported or even unsupportable by this mod. First and foremost only this mod can read compressed databases. The original Sqlite will declare compressed databases corrupted. However, this mod can and should detect uncompressed databases and disables compression silently (but use at your own risk.)

Since NTFS sparse file support is the key to achieving compression, the mod is literally useless on non-NTFS systems.

Sqlite is known to be quite resilient in the face of file corruption. This can no longer be supported with the same level as it is with the official release. In addition, corruptions would destroy much more data than a single page. With the compression library and the new code also comes the increased risk of crashing or being unstable.

Of the untested and probably unsupported features of Sqlite are:

  • Online database backup.
  • Multiprocess read/write.
  • Vacuum.
  • Data recovery.
  • Shell and 3rd-party tools.

Performance wise, there is virtually no caching implemented beyond the current chunk. This is bare-bone caching and there is a lot of room for performance improvements.

Benchmarks

An import of an English SimpleWiki dump was used as benchmark. The main table holds an auto-increment index, timestamp, the page title and the page contents (both Unicode).

256 Kbyte Chunks and Level-6 Compression (sizes in KBytes)
Original Sqlite Compressed
NTFS Normal 204,438 (100%) 73,296 (35.85%)
NTFS Compressed 117,460 (57.45%) 57,431 (28.09%)

1024 Kbyte Chunks and Level-9 Compression (sizes in KBytes)
Original Sqlite Compressed
NTFS Normal 204,438 (100%) 67,712 (33.12%)
NTFS Compressed 117,460 (57.45%) 66,220 (32.39%)

It’s quite obvious that the savings with the modified Sqlite are substantial as compared to NTFS compression on the original file. Interestingly, NTFS compression when applied on a compressed file still yields gains. This is because of inefficiencies of the Zlib (deflate) compression (which is less so for level-6 than 9) and because NTFS can deallocate at the level of clusters, which are 4096 bytes, as opposed to the sparse method’s compression-unit of 64 Kbytes. Since the free-regions are written as zero-bytes and they aren’t deallocated unless a complete 64 Kbyte unit is completely zeroed out, it seems reasonable to assume NTFS compression is crunching these zero-padded regions and deallocating them as it’s unit is only 4096 bytes.

It should also be noted that while statistically we should get better compression with larger chunk sizes and higher compression levels, this isn’t linear. In fact, increasing the chunk size may lead to reduced net gains in file size due to the 64 Kbyte compression-unit of NTFS. That is, if two chunks could each save a single unit (64 Kbytes,) doubling the chunk size (such that both would be compressed together as one chunk) might not be able to save 128 Kbytes, in which case the savings would be reduced from two units to a single, resulting in a 64 Kbyte larger file than we had with the original chunk-size. This heavily depends on both the data and the compression, of course.

Performance

A synthetic test done using generated text from an alphabet consisting of alpha-numerical plus symbol with random lengths of <1MB were done. Zlib seems to perform slowly on this random data (although the number of possible codes is small.) Chunk size of 256 Kbytes and compression-level of 6 was used. 50 random rows are generated and inserted with incremental Ids (two-column table,) the 50 rows are selected using the Ids and the texts compared to the original, new texts are generated with new lengths, this time of length <2MB and the rows updated. Again the 50 rows are selected by Id and compared to the updated-originals. The resultant database file is 50,686 Kbytes.

The original Sqlite code run the test in 13.3 seconds, while using default compression and no tracing (to avoid any overheads) the same test finished in 64.7 seconds (4.86x slower) resulting in a 41,184 KByte file. Both tests ran on the same generated data. The file was on a RAMDisk to minimize disk overhead.

Considering that the data was random and synthetic and insert/update rate was equal to select rates, the results are reasonable. In practice, reads are typically more frequent than writes. With proper caching this should reduce the performance overhead significantly.

Download

The code holds the same copyright claims as Sqlite, namely none. The code is experimental. Use it at your own risk.

Download the code and prebuilt DLL. This sqlite3.dll is version 3.7.7.1 amalgamation created with the default settings/flags from the amalgamation created from original sources by the original configure and make files. The compression code is added and it’s built using VS2010 Sp1 and statically liked to the runtime libraries, as such it has no dependencies.

Building

To build the code, first download a recent Sqlite version. The 3.7.7.1 amalgamation is perfect. The latest Zlib must also be downloaded and built.

Add the Zlib headers to the include path, copy the vfs_compress.c file next to sqlite sources and build. Next, build sqlite3.c amalgamation (or the original sources) and link the binaries of sqlite3, vfs_compress and Zlib to create the executable.

Future Plans

A good percentage of the official Sqlite tests pass successfully. But the corruption and format-validating tests unsurprisingly fail. Increasing the supported cases is a prime goal at this point. Getting the mod to “stable with known-limitation” status would be a major milestone. Improving performance is another goal that isn’t very difficult to attain. Having the ability to enable/disable compression on any database is also beneficial and will add more protection against misuse. It’d also be interesting to attempt supporting compression without NTFS sparse files support. This, while much more complicated, would work on any system and not on NTFS alone.

As a bonus, it’s almost trivial to add encryption on top of the compression subsystem.

Any comments, ideas, feedback and/or constructive criticism are more than welcome.

Aug 262011
 

Databases are as ubiquitous as computers. They are often, erroneously, attributed to data-intensive, long-term storage solutions. Yet in reality they are used in one shape or form in most programs. From word processors to media players to web-browsers. All depend on databases to realize some of their features. This, not mentioning, e-commerce, flight booking, libraries and, of course, government intelligence data. Even when no feature requires a database per-se, user configuration and settings are often stored in databases. Where previously Windows programs depended on the Registry and Unix programs on plain-text files, increasingly new products are utilizing the portable and vastly superior SQLite instead.

Image via Wikipedia

Each application of course has its own set of requirements, use-cases and patterns. Yet, surprisingly or otherwise, there are two main patterns that can be said to parent the rest: OLTP and OLAP. Or, Online Transaction Processing and Online Analytical Processing respectively. Consequently, database design and optimization depends heavily on our processing needs and models. (Here I’m not concerned with database engine design and architecture, rather on the use-case patterns and processing models.) To get a better feel of each, let’s consider typical cases for each.

Online Transaction Processing (OLTP)

This type of applications are chiefly characterized by performing real-time, localized, mission-critical operations. That is, the operations concerned must complete virtually immediately (hence real-time processing,) they typically involve a very small number of entries independent of the remaining data (hence localized processing,) and there is zero-tolerance for data inconsistency, corruption or undefined states. The latter property is what requires transactional processing such that either the operation completely succeeds with all its sub-parts or it completely fails restoring the previous state of all involved entries.

In OLTP the operations are mostly that of Insert and Update, with occasional Delete operations for housekeeping, although typically deletion is differed for performance and/or for later data mining and reporting. The tables involved are typically normalized around the data entities.

Examples include online shopping, flight and medical data and custom data processing.

Online Analytical Processing (OLAP)

In contrast to OLTP, this type of applications are characterized by performing offline, aggregate and low-priority operations. The operations involved here can be executed at low-demand hours, often on archive databases and secondary servers (hence offline processing,) they typically aggregate a very large amount of data to generate statistical data, reports or to find outliers (hence aggregation,) and, since they are offline processing, they are designed to have low-priority, low isolation level (read uncommitted) and, since there is little or no side-effects to failing, they are designed to fail rather than potentially interfere with competing OLTP operations (if executed on the same database.)

OLAP operations are mostly Select operations with virtually no data modification (except for storing the results, if at all, typically in dedicated tables.) These operations not only aggregate large number of entries, with heavy use of aggregate functions, but they typically join a significant number of tables.

Examples include reporting, billing, statistical analysis and historic data processing.

Similarities and Differences

Since the differences between these two patterns lies in the underlying database operations they perform, it’s no surprise that there aren’t a multitude of basic patterns. The two main operation types are that of read and modify. That is, the applications that mostly read data (OLAP) typically perform massive read operations potentially across the complete data with few or no modification, and those that depend on heavy data modification care most about the integrity of the few entries they modify with little or no interest in other data.

However, one must make note of the fact that modification implies reading. This is crucial for correctly appreciating the performance characteristics of the two patterns. The three data-modifying operations, Insert, Update and Delete, all ought first perform Select operation(s) in one form or another. While this is more apparent for Update and Delete, Inserts with constraints must first validate that no constraint is violated. This often involves a lookup operation in the constraint index(es). Only in the most basic and simplest cases could an Insert operation be free of all lookup operations (when foreign key, unique or other constraints are completely absent.)

This doesn’t mean that OLTP operations are a superset of OLAP operations. To try and optimize first and foremost for OLAP with the hope that OLTP operations would naturally also run faster, thanks to the improved read operations that it could utilize, is a fallacy. The two are worlds apart and typically don’t have overlapping indexable data (more on this below.) Where they do share fields, indexes on them would most certainly be welcome to both. Unfortunately, that doesn’t materialize nearly as much as one would like.

Typical properties of OLTP vs. OLAP
Transactional Processing Analytical Processing
Data Current and in-progress. Retired, historic or archived.
Typical Operations Insert, Update, Delete. Select.
Function Types Based on business requirements. Based on data mining.
Normalization Highly normalized, entity modeled. Possibly denormalized to minimize joins.
Data Locality Few tables, few entries. Large aggregation across many tables.
Indexing Strategy Limited to a few highly-selective fields. Generous indexing depending on the queries.
Operation Duration Typically very short, sub-millisecond range. Extended over minutes and hours.
Caching Friendliness Highly volatile data with strict persistence requirements. Static data, very cache friendly.
Backup Requirement Inconsistency and data loss may cost business. All operations can be rerun, backup is redundant.
Query Complexity Trivial to simple. Complex to inhumane and unreadable.

Hybrids

The above descriptions are certainly not the only possibilities. A combination of both is more reasonable to expect in practice rather than the theoretical and somewhat idealized cases given above. Some such case might include functionality that must aggregate several groups of data without suffering phantom reads. Such a requirement means that not only the aggregate query must run with a high isolation level, reducing parallelism, but that it must also add indexes to finish as fast as possible to free up the entries that concurrent transactions might require. The additional indexes would incur unwelcome cost for data modification (which must also update the indexes as well.)

Optimization

Due to the differences in the operations involved, and indeed their requirements, optimization in each case is different than the other.

OLTP

To maximize performance, OLTP operations would use highly selective conditions on normalized tables in queries that are completely indexed (rather than partially indexed). This will get us to the target data as fast as possible, or, in case of insertion, will verify the absence of collisions equally quickly. Typically the primary key is used, which itself may be an alias of the internal row-id.

Next we’d want to localize locks as much as possible to maximize concurrency and minimize collateral impact. Fine grained lock hints such as row-level locks may be used. Normalization helps here by further containing the impact of the transactions as local as possible. In addition, we’d want to minimize the average row-size to improve cache-hits and avoid undue penalty. Once again, Normalization does most of the work here too.

Finally, OLTP operations would want to minimize, if not avoid, indexes, triggers, views and any operation that doesn’t help it reach and update the data as fast as possible. To that end, indexes are only useful to reach the data fast. All others would simply slow down virtually all operations, save for any selects we may have.

OLAP

For OLAP on the other hand it’d be best to work with denormalized data, to minimize or avoid joins.

Unlike OLTP, it’s very hard to predict or prioritize the conditions most useful to OLAP. This is because depending on the particular query in question, the tables and fields of importance are decided. Indeed, different queries might depend on very different fields and therefore indexes. So indexing some field might be useful to only an isolated query and not others. For a billing system not only the customer and orders are of importance, but the dates of orders and the billing period as well. Similarly, to generate sales reports, the queries involved would select on products and dates. Yet an inventory audit query might depend on a different set of fields. Hence, indexes are best decided based on the concrete queries in question.

To improve performance advanced features such as Materialized Views (aka Indexed Views) may be utilized, which are unfriendly to transactional processing.

Hybrids

From the above it’s quite clear that not only do OLTP and OLAP differ in their operations, but consequently optimizing their queries are apparently in conflict with each other. The requirements for maximizing performance in each case is simply contradictory to one another. But often both types of processing are done in our products. How can we optimize them?

Optimization for both OLTP and OLAP

It must be said from the outset that optimizing for two contradictory set of requirements can be a recipe for disaster. That is, we might end up with worse performance for most queries and even degrade the database design in hope of improving performance. With hubris some might advocate indexing all fields in query conditions, creating an uninformed mess in their wake. In fact, since OLTP operations are inherently very short lived, the overhead of updating superfluous indexes will probably go unnoticed, yet a long-running OLAP operations might get a noticeable boost that the OLAP queries get. From this the gulls in question might pat themselves for a job well done, completely oblivious to the fact that the OLTP operation in question will probably run millions of times, each time incurring the slight cost of the index updates, negating the apparent speed boost. To avoid such scenarios, we must understand that a case of compromise is perhaps unavoidable and approach the problem methodically.

Here are some design patterns that are often employed to maximize performance across both types of queries.

Data Archiving

Perhaps the simplest solution is to avoid performing both OLTP and OLAP type operations on the same data-set. To achieve this, we’ll have to split the data over two sets of tables. One, highly normalized and optimized for OLTP and another, perhaps denormalized, OLAP-optimized set.

A data-retirement scheme will be used to move the retired data from the OLTP table(s) into the archive (OLAP) tables. This may be performed using a background process, or using triggers. The latter may not be desirable as triggers would add latency to the OLTP queries, however, on the other hand, the cache hits and the avoidance of extra code and scanning of the OLTP tables might be a good compromise.

Dynamic Indexing

A number of application process data in batches. Each batch is processed in transactional fashion and once all the data has been processed completely, reports and other aggregate functions are executed on the final set. Custom data processing is perhaps a good example here, where some data is processed (for example by means of transformation, conversion or other modification) and tracked by transcations in a database. The database accounts for every data item, its states as it is modified with timestamps and warning/error codes where applicable.

Such a processing scheme has the advantage of allowing for an OLTP type optimization for the data processing stage until all the data is completely processed, whereupon OLAP-optimized indexes are dynamically added. The overhead of adding or modifying the indexes (yes, some might be dropped, while new ones added) might win us an overall significant amount of CPU time. Of course we’d need to profile and benchmark thoroughly to decide which indexes to remove and which to add.

Biting the Bullet: Conjoining Both

In some cases we need to aggregate both retired and archived data as well as current and in-transit data. If this scenario is of prime importance, then archiving the data for offline processing might simply be unhelpful, as we must also join the OLTP tables to get as-yet unarchived data. Also, we can’t work in stages and perform dynamic indexing since we have to work with both current and archived data.

For obvious reasons this is the worst of both worlds. In this case, we must very carefully balance two forces:

  1. Indexes for optimum OLTP performance.
  2. Indexes for the most important OLAP queries.

Here analytic profiling of the different operations, their execution pattern and respective use-cases is necessary. Once we get a rough idea of the relative importance of our database operations, we can start collecting empirical data using benchmarks to evaluate the different approaches. Since the database design, indexes and other attributes must be shared among all database operations, we must choose the minimum set of indexes etc. that give the maximum performance across all operations.

The difficultly is of course in finding this optimum point. As each index helpful to the OLAP queries, but unnecessary to the OLTP ones, is an overhead for the high-rate transactions. On the other hand, strategic indexes could boost the performance of aggregate queries by orders of magnitude.

Here one is bound to spend a good amount of time using database profiling, statistical analysis and benchmarking. (In a future post I’ll address this topic head-on.) The downside of all this is that once the priorities of our OLAP queries change, then our particular database design will perhaps be outdated too.

Conclusion

The requirements of OLTP and OLAP are as different as are their use-cases. Optimizing for both is a very tricky business, but sometimes unavoidable. If we must maximize performance, we must invest a considerable amount of time designing a well-balanced database schema and, using extensive profiling and benchmarking tools combined with analytic methods, only then can we decide on the design that maximizes performance for our particular needs.

(Anything missing? If you have something interesting on the topic to share, please don’t keep it to yourself.)

Related articles
Jul 092011
 

Threading is on the front pages more than ever, so to speak. Where only a few years ago programmers could brush the topic aside as complicated and not-very-useful, it’s increasingly harder to hide behind those excuses. Multi-core systems are the status quo and the only real way to get better performance (read: scale) is to have multi-threaded code.

So it’s no surprise that the topic is revived and there is more talk about it. And what threading if one doesn’t mention the pitfalls and problems? Consider the following question:

int x = 1;
int y = 0;

if (y == 0)
{
   // Can x be anything but 1 at this point?
}

There is no shortage of this type of questions. And there is no shortage of answers either!

The only correct answer to the above question is: It depends on the language!

What does the keyword volatile do? Same answer.

Does my code suffer from the ABA problem? Same answer (but that one is more complicated)

Similar question, same answer.

// Thread 1
x = 1;
y = 0;

// Thread 2
int a = x;
int b = y; // if b == 0, is a == 1?

Here is a pdf presentation of a lock-free hashtable in Java. Here is the video presentation. Again, if the same algorithm implemented in a different language, the result will be different. Namely, it won’t work.

Why? Because the memory models are different. C++ has a very, very different memory model than .Net than Java. (It wouldn’t be entirely wrong to say that C/C++ don’t define any memory coherency models as such.) Each in their turn probably have a different memory model from other languages. And talking about threading without knowing the memory model is like assuming you can drive at the same speed on all roads. Well, some aren’t even paved, yet some have minimum speed.

It’s very unfortunate that an already complex subject is made even more confusing by the differences in languages and their guarantees, or lack thereof. There is no shortage of confused, misleading and downright incorrect information when it comes to threading. Consider that locks don’t even protect your code against out-of-order execution. But in which language is that?

This is made even more complicated by the fact that processors and hardware architectures come with their own set of guarantees and reservations. So reading on those without taking into account what the language in question abstracts from the hardware, is yet another way to throw oneself in the rabbit hole.

So, the next time you read (or write) about threading issues, make sure you know which language is in question. If it doesn’t say, stop reading. Without that small one-word detail, everything you learn is wrong, regardless of how detailed and coherent it is. Because precise doesn’t necessarily mean accurate.

Jul 082011
 

Reading Cowboy Coding and the Pink Sombrero I’m reminded of the two occasions when I deserved that Pink Sombrero, or at least a sombrero. Since I’ll have done cowboy coding exactly twice before retiring, allow me to share the experience with you.

Bacardi mojito served in Bacardi branded glass.

Image via Wikipedia

This happened a handful of years ago. I was working for a prominent player in the electronic trading market. Our internal process dictated design review by an architect, code review by at least two developers, including one from a different team, commit, SCM build signed-off to QA, QA sign-off to Operations after testing, Ops performs pre-market testing with real currency before final deployment. Our customers are bankers and traders gambling millions on commodities around the globe, so there is a reason for such a strict process.

During a relaxed Sunday evening I got a call from Frankfurt Operations team. Something about the deployment tests failing. I was on standby for this deployment as a technical support for the Ops. The market opens Monday and there was a pre-market test that we couldn’t miss.

I’m in Yerevan. The developers responsible for the deployed component are in Mountain Time, in deep sleep. The European market-open time is in between. Immediately I was on my office workstation via remote desktop talking with a sleepy manager, VP of Software and a couple of guys from Ops. I started reading logs and code. I was promised at least one of the original coders to review my fix. After a marathon of about 7 hours we had a fix. The other guy reviewed it and, to my surprise, the VP of Software approved hotfixing my local build on the production farm.

Obviously this isn’t like a CSS fix on a live web-site. Leaving the code size and complexity aside, the module in question typically processes thousands of orders a day, each worth a few grands, some tens. Mistakes here cost a tad more than page views. Luckily, all went better then expected. The next day I went to the office and I felt like a walking million dollars. Where is that pink sombrero when you deserve it?

A couple of months went by, when again I got a similar call. This time some commodity price was missing or incorrect. The fix wasn’t very difficult. We had to find the correct price, write a small DB update script to patch the Price table and test. Only thing was that there wasn’t nearly as much time as last time. Specifically, the pre-market test had come to a close and we had to fix the issue before market-open and go live without testing.

Fortunately, I found the price of the commodity in question on a test server. Wrote the SQL to update the DB. Without blinking, I got green light to patch the production servers by the VP of Software. Still no sombrero! All went as planned, except, the price was wrong!

Early the next day at work I was already being asked by my manager and her supervisor about the fix. “Who approved it?!?” they asked impatiently and in the most irritating tone, before telling me how it blew up. Turns out the test server had old data. Everybody and their dog were in feisty mode.

And that was the last time I went cowboy coding, sans sombrero. Lesson learned, and I don’t care who approves cowboy coding… although I wish I had a Mojito while at it.

Jul 022011
 

A customer noticed sudden and sharp increase in the database disk consumption. The alarm that went off was the low disk-space monitor. Apparently the database in question left only a few spare GBs on disk. The concerned customer opened a ticked asking two questions: Is the database growth normal and expected? and what’s the average physical storage requirements per row?

The answer to the first question had to do with their particular actions, which was case specific. However, to answer the second, one either has to keep track of each table’s schema, adding the typical/maximum size of each field, calculating indexes and their sizes, or, one could simply do the math on a typical dataset using SQL code. Obviously the latter is the simpler and preferred.

Google returns quite a number of results (1, 2, 3, 4 and 5.) For MS SQL, it seems that virtually all rely on the sp_spaceused stored proc. SQL has an undocumented sproc sp_msforeachtable which runs over each table in the database and executes a sproc passing each table’s name as param. While it isn’t at all difficult to do this manually (looping over sys.Tables is hardly a feat,) calling this one-liner is still very convenient. So no surprise that virtually all samples online just do that.

Here is an sproc that prints the total database size, reserved size, data size, index size and unused sizes. In addition, the sproc prints the same numbers for each table with the total number of rows in all tables at the end.

My prime interest wasn’t just to learn about the database size, which can be achieved using sp_spaceused without any params, nor to just learn about each table’s share, which can be done by passing the table name in question to sp_spaceused. My main purpose was to get a breakdown of the average row-size per table.

So, here is a similar script to do exactly that. The script first updates the page and row counts for the whole database (which may take a long time, so disable on production databases,) in addition, it calculates the totals and averages of each data-point for all tables and calculates the average data size (data + index) and wasted bytes (reserved + unused) per table. All the information for the tables is printed in a single join statement to return a single rowset with all the relevant data.

-- Copyright (c) 2011, Ashod Nakashian
-- All rights reserved.
-- 
-- Redistribution and use in source and binary forms, with or without modification,
-- are permitted provided that the following conditions are met:
-- 
-- o Redistributions of source code must retain the above copyright notice, 
-- this list of conditions and the following disclaimer.
-- o Redistributions in binary form must reproduce the above copyright notice, 
-- this list of conditions and the following disclaimer in the documentation and/or
-- other materials provided with the distribution.
-- o Neither the name of the author nor the names of its contributors may be used to endorse
-- or promote products derived from this software without specific prior written permission.
--
-- THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
-- EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
-- OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT 
-- SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 
-- INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-- PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
-- INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
-- LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-- OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--
-- Show physical size statistics for each table in the database.
--
SET NOCOUNT ON

-- Update all page and count stats.
-- Comment for large tables on production!
DBCC UPDATEUSAGE(0) 

-- Total DB size.
EXEC sp_spaceused

-- Per-table statistics.
DECLARE @t TABLE
( 
    [name] NVARCHAR(128),
    [rows] BIGINT, 
    [reserved] VARCHAR(18), 
    [data] VARCHAR(18), 
    [index_size] VARCHAR(18),
    [unused] VARCHAR(18)
)

-- Collect per-table data in @t.
INSERT @t EXEC sp_msForEachTable 'EXEC sp_spaceused ''?'''

-- Calculate the averages and totals.
INSERT into @t
SELECT 'Average', AVG(rows),
    CONVERT(varchar(18), AVG(CAST(SUBSTRING([reserved], 0, LEN([reserved]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), AVG(CAST(SUBSTRING([data], 0, LEN([data]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), AVG(CAST(SUBSTRING([index_size], 0, LEN([index_size]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), AVG(CAST(SUBSTRING([unused], 0, LEN([unused]) - 1) AS int))) + ' KB'
FROM   @t
UNION ALL
SELECT 'Total', SUM(rows),
    CONVERT(varchar(18), SUM(CAST(SUBSTRING([reserved], 0, LEN([reserved]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), SUM(CAST(SUBSTRING([data], 0, LEN([data]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), SUM(CAST(SUBSTRING([index_size], 0, LEN([index_size]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), SUM(CAST(SUBSTRING([unused], 0, LEN([unused]) - 1) AS int))) + ' KB'
FROM   @t

-- Holds per-row average kbytes.
DECLARE @avg TABLE
( 
    [name] NVARCHAR(128),
    [data_per_row] VARCHAR(18),
    [waste_per_row] VARCHAR(18)
)

-- Calculate the per-row average data in kbytes.
insert into @avg
select t.name, 
    CONVERT(varchar(18),
        CONVERT(decimal(20, 2),
            (CAST(SUBSTRING(t.[data], 0, LEN(t.[data]) - 1) AS float) +
             CAST(SUBSTRING(t.[index_size], 0, LEN(t.[index_size]) - 1) AS float)) 
            / NULLIF([rows], 0))) + ' KB', 
    CONVERT(varchar(18),
        CONVERT(decimal(20, 2),
            (CAST(SUBSTRING(t.[reserved], 0, LEN(t.[reserved]) - 1) AS float) +
             CAST(SUBSTRING(t.[unused], 0, LEN(t.[unused]) - 1) AS float))
            / NULLIF([rows], 0))) + ' KB'
from @t t

-- Join the two tables using the table names.
select t.name, t.rows, t.reserved, t.data, t.index_size, t.unused, a.data_per_row, a.waste_per_row
from @t t, @avg a
where t.name = a.name

There is quite a bit of data conversion and casting that isn’t necessarily very performant, but here there isn’t much choice and optimizing further is probably unnecessary. But since there are so many different ways to get the same output, I’ll leave any variations up to the readers. Suggestions and improvements are more than welcome. Please use comments to share your alternatives.

This may easily be wrapped in an sproc for convenience. I hope you find it useful and handy.

Jun 262011
 

Every so often a piece of technology comes along and changes everything. Once we experience this new way of doing things, we can no longer understand how we survived without it. After we sent our very first emails, walking to the post office to drop mail seemed unearthly. And who’d replace an IDE with a text-editor?

git icon, created for the Open Icon Library

Image via Wikipedia

Git1 didn’t seem the answer to my needs. I’ve been using Subversion (SVN) since 2006 and I’ve been a very happy camper indeed. Before that I used CVS and, although inexperienced with Version Control Systems (VCS), it was a major improvement over MS Source Safe (which I had used for almost 6 years before that.) I use SVN at home and at work. I’ve grown used and dependent on version control so much that I use SVN for my documents and other files, not just code. But Git? Why would I need Git?

When Git came to the scene there were already some Distributed VCS (DVCS) around (as opposed to centralized VCS, such as CVS and SVN.) But Linus made an impression with his Google Talk. I wanted to try this new piece of technology regardless of my needs. It was just too tasty to pass up. At the first opportunity, I installed the core tools and Git Extensions to ease my way with some visual feedback (I learn by reading and experimenting.)

Now that I’ve played around with Git for a while, and I’ve successfully moved some of my projects from SVN to Git, I can share my experience. Here is why I use Git even when not working with a team (where it’s infinitely more useful.)

Commit Often, Commit Many

Commits with half a dozen of -unrelated- changes is no stranger to us. A developer might add some new function, refactor another and rename an interface member all in the same change-set. This is counter-productive, because reviewing such unrelated code-change is artificially made more difficult than necessary. But, if the review unit is the commit unit, then developers combine multiple changes to reduce overhead and push them onto their colleagues. This is unfortunate, because the code should evolve in the best way possible, uninfluenced by unrelated artificial forces, such as tooling nuances. But more than reviewing, combined commits cause much headache and lost productivity when we need to go back in time and find a specific line of code, rollback or merge. But what if the changes were related? What if we need to make a few KLOCs of change for the code to even build successfully? The centralized VCS would recommend a branch. But unless the sub-project is long-term, branching is yet another overhead that developers try to avoid.

With Git, these problems are no more, thanks to local commits. With local commits, one can (and should) commit as often as possible. The change log no longer is anything more than a single sentence. The changes aren’t reflected anywhere, until we decide to push the changes onto the server. There is no longer a distinction between major changes and minor changes. All changes can be subdivided as much as necessary. No longer does one need to do local backups2, create personal branches or make every change visible company-wide or publically. Local commits are full-fledged VCS that doesn’t introduce new or extra work. When we’re done, we just update the repository in one push command.

If you need to keep some piece of code around, but do not wish to send it for review and commit, you need to copy it somewhere. With local commits, you can indeed commit it, with relevant commit-log. In a subsequent change-set, you can delete it, with full guarantee that you can retrieve it from Git later. Since this is done locally, no one is complaining and no one needs to review it. The code will be forever preserved in the repository when we push it. Later when we resurrect it, it will be reviewed as it becomes part of the current code. Indeed, with local commits, you can experiment with much freedom, with both the advantage of version-control and the subsequent repository preservation of your bits for posterity.

Notice that all this applies equally-well to private projects, single-developer public projects and multi-developer projects. The organizational advantages are only more valuable the more the participants.

Easy Merging

Even with local commits, sooner or later we’ll need to branch off and work on a parallel line of code. And if our project is useful to anyone, the branches will diverge faster than you can checkout. Merging code is the currency of branching. Anyone who’s tried merging should know this is more often than not painful. This is typically because what’s being merged are the tips/heads of the branches in question. These two incarnations of our code are increasingly more difficult to reconcile the more changes they had experienced in their separated lives.

But any VCS by definition has full history, which can be leveraged to improve merging. So why is this a Git advantage? Git has two things going for it. First and foremost, it has full history locally. That’s right. Your working-copy (WC) is not a copy of what you checked-out, rather it’s a clone of the repository. So while centralized VCS can take advantage of the repository’s history, for Git this information is readily in your WC. The second is that with local commits, the commit unit is typically very small, this helps merging quite a bit, as it can have higher confidence regarding where the lines moved and what was changed into what.

Overall, merging with Git is otherworldly. So far, no centralized VCS can even match the accuracy of Git’s merge output.

Explicit Exclusion

With Source Safe, CVS and SVN it’s not rare to get broken builds because of missing files. After some point in a project’s life, adding new files takes a sporadic pattern. It’s common to forget to add the new files under the VCS, only to be reminded by colleagues and broken build emails to the humiliation of the developer who missed the files, of course. If reviews are mandatory, then fixing this error involves at least another developer, who need to sign-off the new patch for committing.

This problem arises from the fact that with these traditional, centralized VCSs, files are excluded implicitly (by default) and they are opted-in when necessary. With Git, the opposite is the case: everything under the root is included by default, exclusion is the exception. This sounds very trivial, but the consequences are anything but. Not only does this save time and avoid embarrassing mistakes, but it’s also more natural. Virtually always a file within the project tree is a file necessary for the project. The exceptions are few indeed. If you think about it, most of the exceptions are files generated by tools. These are excluded by file extension and folder names in the ignore file (.gitignore for Git.) Rarely do we add any files that shouldn’t be stored and tracked by the VCS. If it’s not automatically generated during build, then it should be in the repository.

Conclusion

Git is a paradigm shift in version-control. It’s not just a set of new features, it’s a different way of organizing change-sets, and by extension writing code. Git gives us better automation and tooling, at the same time it encourages us to employ healthy and useful practices. In fact, the features outlined above, do make a good use of the distributed architecture of Git. So it’s not a coincidence that it’s so much useful even for the single-user project.

If you’re using SVN, consider copying it over to Git using git-svn and playing around. Git can synchronize with the SVN, until you decide to abandon one or the other. In addition, GitHub is a great online repository. As a learning tool, consider forking any of the countless projects and play around.

1 Git has no exclusive monopoly on the discussed advantages, however I’m reviewing my experience with Git in particular. Hg, Bazaar and others will have to wait for another time.
2 Here I’m concerned with code back up that we don’t want to discard yet, but don’t want to commit either. Data backup is still necessary.

Jun 182011
 

I realize that the original article of the same title was longer than what most would like to read. So here is an abridged version.

By now everybody and their screensaver have heard the Optimization Mantra: Don’t Do It! This is commonly wrapped in a three-rule package. The first two of which are copies of the mantra, and the third adds the wise word “Yet” to the one-and-only true rule and addresses it to the “expert.”

Premature optimization; Most of us have been there. And that’s what makes those words very familiar. Words of wisdom, if you will. We’ve all decided to do a smart trick or two before fleshing out the algorithm and even checking if it compiles, let alone checking the result, only to be dumbfounded by the output. I can figure it out! we declare… and after half a day, we’d be damned if we rewrote that stupid function from scratch. No chance, bub.

The rules are sound. No doubt. Another rule of optimization, when the time comes, is to use profilers and never, ever, make costly assumptions. And any assumption is probably costly. That, too, is sound. These are words of wisdom, no doubt. But, taken at face-value they could cause some harm.

In all but the smallest projects one must use a profiler, consult with others and especially talk with module owners, veteran developers and the architects before making any changes. The change-set must be planned, designed and well managed. The larger the project, the more this stage becomes important. No funny tricks, please.

Efficient Code != Premature Optimization

The traditional wisdom tells us to avoid premature optimization and when absolutely necessary, we should first use a profiler. But both of these can also be interpreted as follows: it’s OK to write inefficient and bloated code, and when necessary, we’ll see what the profiler comes up with.

Performance as an afterthought is very costly. Extremely so. But the alternative isn’t premature optimization. There is a very thin line between well-thought and designed code that you’d expect a professional to output and the student toy-project style coding. The latter focuses on getting the problem-of-the-moment solved, without any regards to error handling or performance or indeed maintenance.

It’s not premature optimization to use dictionary/map instead of a list or array if reading is more common. It’s not premature optimization to use an O(n) algorithm instead of the O(n2) that isn’t much more complicated than what we’ll use (if not an O(log2 n) algorithm). Similarly, moving invariant data outside a loop isn’t premature optimization.

As much as I’d hate to have a pretentious show-off in my team, who’d go around “optimizing” code by making wild guesses and random changes, without running a profiler or talking with their colleagues, I’d hate it even more if the team spent their time cleaning up after one another. It’s easy to write code without thinking more than a single step ahead. It’s easy to type some code, run it, add random trace logs (instead of properly debugging,) augment the code, run again, and repeat until the correct output is observed. As dull and dead-boring as that is.

I’m not suggesting that this extreme worse-case that I’ve described is the norm (although you’d be surprised to learn just how common it is.) My point is that there is a golden mean between “premature optimization” and “garbage coding.”

The Cost of Change

It’s well documented that the cost of change increases exponentially the later a project is in it’s development cycle. (See for example Code Complete.) This cost is sometimes overlooked, thanks to the Rule of Optimization. The rule highly discourages thinking about performance when one should at least give it a good thinking when designing.

This doesn’t suggest optimization-oriented development. Rather, having a conscious grasp of the performance implications can avoid a lot of painful change down the road. As we’ve already iterated, designing and writing efficient code doesn’t necessarily mean premature optimization. It just means we’re responsible and we are balancing the cost by investing a little early and avoiding a high cost in the future. For a real-life example see Robert O’Callahan’s post.

Conclusion

Premature optimization is a major trap. The wisdom of the community tells us to avoid experimenting on our production code and postponing optimization as much as possible. Only when the code is mature, and only when necessary should we, by the aid of a profiler, decide the hot-spots and then, and only then, very carefully optimize the code.

This strategy encourages developers to come up with inefficient, thoughtless and -often- outright ugly code. All in the name of avoiding premature optimization. Furthermore, it incorrectly assumes that profiling is a magic solution to improving performance.

There are no excuses to writing inefficient code if the alternative is available at a small or no cost. There is no excuse to not thinking the algorithm ahead of typing. No excuse to leaving old experimental bits and pieces because we might need them later, or that we’ll cleanup later when we optimize. The cost of poor design, badly performing code is very high.

Let’s optimize later, but let’s write efficient code, not optimum, just efficient, from the get-go.

Jun 172011
 
Fig. 4. Illustration of the constrained optimi...

Image via Wikipedia

Abridged version here.

By now everybody and their screensaver have heard the Optimization Mantra: Don’t Do It! This is commonly wrapped in a three-rule package (I suspect there is a word for that). The first two of which are copies of the mantra, and the third adds the wise word “Yet” to the one-and-only true rule and addresses it to the “expert.” I suspect originally the middle “rule” didn’t exist and it was later added for effect, and perhaps to get the total to the magic-number of three.

I can almost imagine Knuth after figuring out a single-character bug in a bunch of code, with coffee mugs and burger wraps (or whatever it was that was popular in the ’60s) scattered around the desk… eyes red-shot, sleepless and edgy, declaring “Premature optimization is the root of all evil.” (But in my mind he uses more graphic synonyms for ‘evil’.)

Knuth later attributed that bit about premature optimization to Tony Hoare (the author of QuickSort) thereby distorting my mental image of young Knuth swearing as he fixed his code, only later to be saved by Hoare himself who apparently doesn’t remember uttering or coining such words. (Somebody’s got bad memory… may be more than one.)

Smart Aleck

Premature optimization; Most of us have been there. And that’s what makes those words very familiar. Words of wisdom, if you will. We’ve all decided to do a smart trick or two before fleshing out the algorithm and even checking if it compiles, let alone checking the result, only to be dumbfounded by the output. I can figure it out! we declare… and after half a day, we’d be damned if we rewrote that stupid function from scratch. No chance, bub.

Probably the smarter amongst us would learn from the experience of such dogged persistence and avoid trickery the next time around. While few would admit to the less-intelligent decisions they took in the past, at least some will have learned a lesson or two when the next opportunity knocked.

The aforementioned trickery doesn’t have to be optimization trickery, mind you. Some people (read: everyone) likes to be a smart-ass every so often and show off. Sure, many end up shooting themselves in the foot and making fools of themselves. But that doesn’t stop the kids from doing a crazy jump while waving to their friends, iPod on and eating crackers, just to impress someone… who typically turns around precisely when they shouldn’t. (Did I mention skateboards?)

The rules are sound. No doubt. Another rule of optimization, when the time comes, is to use profilers and never, ever, make costly assumptions. And any assumption is probably costly. That, too, is sound. These are words of wisdom, no doubt. But, taken at face-value they could cause some harm.

Let’s take it from the top. Leaving aside the smaller projects we might have started and for years tended and developed. Most projects involve multiple developers and typically span generations of developers. They are legacy projects, in the sense of having a long and rich history. No one person can tell you this history, let alone describe all parts of the code. On such a project, if performance is an issue, you shouldn’t go about shooting in the dark and spending days or even weeks on your hunches. Such an approach will not only waste time, add a lot of noise and pollute the code-base and source repository (if you commit to the trunk, which you should never do, until done and ready to merge.)

In such a case, one must use a profiler, consult with others and especially talk with module owners, veteran developers and the architects before making any changes. The change-set must be planned, designed and well managed. The larger the project, the more this stage becomes important. No funny tricks, please.

Efficient Code != Premature Optimization

Of the standard interview questions we often ask (and get asked) are those on the data-structures and algorithms (discrete math and information theory.) I typically ask candidates to compare the data-structures in terms of performance which should cover both internal details and complexity characteristics (big O). It’s also a good opportunity to see how organized their thoughts are. We use arrays, lists and maps/dictionaries quite often and not having a good grasp of their essence is a shortcoming. As a follow-up to this I typically ask how they decide which to use. This isn’t an easy question, I realize. Some things are hard to put into words, even when we have a good understanding of them in our minds. But, interviews aren’t meant to be easy.

The worst answer I ever got was “I use List, because that’s what I get.” To which I had to ask “Where?” Apparently, the candidate worked on a legacy project that used Lists almost exclusively and bizarrely she never had a need for anything else. The best answer typically gives a description of the use-case. That is, they describe the algorithm they’ll implement, and from that, they decide which container to use.

The best answer isn’t merely a more detailed or technical answer. Not just. It’s the best answer because it’s the only answer that gives you reasons. The candidate must have thought about the problem and decided on an algorithm to use, they must’ve quantified the complexity of the algorithm (big O) and they must’ve known the performance characteristics of the different containers for the operations their algorithm needs. They have thought about the problem and their solution thoroughly before choosing containers, designing classes and whatnot.

The traditional wisdom tells us to avoid premature optimization and when absolutely necessary, we should first use a profiler. But both of these can also be interpreted as follows: it’s OK to write inefficient and bloated code, and when necessary, we’ll see what the profiler comes up with.

Performance as an afterthought is very costly. Extremely so. But I don’t recommend premature optimization. There is a very thin line between well-thought and designed code that you’d expect a professional to output and the student toy-project style coding. The latter focuses on getting the problem-of-the-moment solved, without any regards to error handling or performance or indeed maintenance. We’ve all done it; Multiple similar functions; Same function reused for different purposes with too many responsibilities; Unreasonable resource consumption and horrible performance characteristics that the author is probably oblivious to. And so on.

It’s not premature optimization to use dictionary/map instead of a list or the most common container in your language of choice. Not when we have to read items most of the time. It’s not premature optimization if we use an O(n) algorithm instead of the O(n2) that isn’t much more complicated than what we’ll use (if not an O(log2 n) algorithm). It’s not premature optimization if we refactor a function so it wouldn’t handle multiple unrelated cases. Similarly, moving invariant data outside a loop isn’t premature optimization. Nor is caching very complex calculation results that we don’t need to redo.

Regex object construction is typically an expensive operation due to the parsing and optimizations involve. Some dynamic languages allow for runtime compilation for further optimization. If the expression string isn’t modified, creating a new instance of this object multiple times isn’t smart. In C# this would be creating a Regex object with RegexOptions.Compiled and in Java a Pattern.compile() called from matches() on a string. Making the object a static member is the smartest solution and hardly a premature optimization. And the list goes on.

As much as I’d hate to have a pretentious show-off in my team, who’d go around “optimizing” code by making wild guesses and random changes, without running a profiler or talking with their colleagues, I’d hate it even more if the team spent their time cleaning up after one another. It’s easy to write code without thinking more than a single step ahead. It’s easy to type some code, run it, add random trace logs (instead of properly debugging,) augment the code, run again, and repeat until the correct output is observed.

I don’t know about you, but to me, writing and modifying code instead of designing and working out the algorithm beforehand is simply counter productive. It’s not fun either. Similarly, debugging is much more interesting and engaging than adding random trace logs until we figure out what’s going on.

I’m not suggesting that this extreme worse-case that I’ve described is the norm (although you’d be surprised to learn just how common it is.) My point is that there is a golden mean between “premature optimization” and “garbage coding.”

The Cost of Change

When it’s time to spend valuable resources on optimization (including the cost of buying profilers,) I don’t expect us to discover that we needed a hash-table instead of an array after all. Rather, I should expect the profiler to come up with more subtle insights. Information that we couldn’t easily guess (and indeed we shouldn’t need to.) I should expect the seniors in the team to have a good grasp of the performance characteristics of project, the weak points and the limitations. Surely the profiler will give us accurate information, but unless we are in a good position to make informed and educated guesses, the profiler won’t help us much. Furthermore, understanding and analyzing the profiler’s output isn’t trivial. And if we have no clue what to change, and how our changes would affect the performance, we’ll use the profiler much like the student who prints traces of variables and repeatedly makes random changes until the output is the one expected. In short, the profiler just gives us raw data, we still have to interpret it, design a change-set and have a reasonably sound expectation of improved performance. Otherwise, profiling will be pure waste.

It’s well documented that the cost of change increases exponentially the later a project is in it’s development cycle. (See for example Code Complete.) This means a design issue caught during designing or planning will cost next to nothing to fix. However, try to fix that design defect when you’re performing system-testing, after having most modules integrated and working, and you’ll find that the change cascades over all the consequent stages and work completed.

This cost is sometimes overlooked, thanks to the Rule of Optimization. The rule highly discourages thinking about performance when one should at least give it a good thinking when the business and technical requirements are finalized (as far as design is concerned) and an initial design is complete. The architecture should answer to the performance question. And at every step of the development path developers must consider the consequences of their choices, algorithms and data-structures.

This doesn’t suggest optimization-oriented development. Rather, having a conscious grasp of the performance implications can avoid a lot of painful change down the road. As we’ve already iterated, designing and writing efficient code doesn’t necessarily mean premature optimization. It just means we’re responsible and we are balancing the cost by investing a little early and avoiding a high cost in the future. For a real-life example see Robert O’Callahan’s post linked above.

I know there is a camp that by this point is probably laughing at my naïve thinking. By the time I finish optimizing or even writing efficient and clean code, they’ll say, their product would’ve shipped and the customer would’ve bought the latest and faster hardware that will offset their disadvantage in performance. “What’s the point?” they add. While this is partially true, (and it has happened before,) given the same data, the better performing product will still finish sooner on the same hardware. In addition, now that processors have stopped scaling vertically, the better designed code for concurrent scalability (horizontal scaling) will outperform even the best algorithm. This, not to mention, data outgrows hardware any day.

Conclusion

Premature optimization is a major trap. We learn by falling, getting up, dusting off and falling again. We learn by making mistakes. The wisdom of the community tells us to avoid experimenting on our production code and postponing optimization as much as possible. Only when the code is mature, and only when necessary should we, by the aid of a profiler, decide the hot-spots and then, and only then, very carefully optimize the code.

This strategy encourages developers to come up with inefficient, thoughtless and -often- outright ugly code. All in the name of avoiding premature optimization. Furthermore, it incorrectly assumes that profiling is a magic solution to improving performance. It neglects to mention how involved profiling is. Those who had no clue as to why their code is bogged down, won’t know what to change even if the profiler screamed where the slowest statement is.

There are no excuses to writing inefficient code if the alternative is available at a small or no cost. There is no excuse to not thinking the algorithm ahead of typing. No excuse to leaving old experimental bits and pieces because we might need them later, or that we’ll cleanup later when we optimize. The cost of poor design, badly performing code is very high. It’s the least maintainable code and can have a very high cost to improve.

Let’s optimize later, but let’s write efficient code, not optimum, just efficient, from the get-go.

QR Code Business Card