Mar 282015
 

Whatever you make of self-help, and whether you take this to constitute a form of it or not, I here present the top-most three means that I found over the years to get productive. I believe they are the simplest, yet most effective, of all methods to be productive and efficient. But do not limit yourself to any, this offers but good start.

Productivity is certainly relative. For our purposes a wide-enough definition would be an accomplishment what one needs, or has, to meet. That is, a task.

Self-help is an oxymoron of a special sort. It wants the benefactor, who is typically the customer of the self-help guru selling books, seminars, and workshops, to believe that they can help themselves. Assuming for the moment that one could help herself, it’s harder to imagine her doing so through seeking help from another. Perhaps self-confidence and empowerment are the first methods in the self-help industry’s repertoire.

While I don’t subscribe to the self-help movement, nor think it effective, I do believe advice borne of hard-earned experience can help the inexperienced and seasoned alike. My bookshelf betrays my bias against self-help and motivational material, which exceeds my bias against fiction and is only diminished by my bias against cargo cults. With this in mind, and with much reluctance, did I put to words these three points.

TL;DR: Read the headings and jump to the conclusion. (Bonus, read the intro.)

Introduction

One of the potent forces that underline anxiety and worry is lack of control. Control is an umbrella word that spans many aspects. We don’t have control over the things we don’t understand or comprehend. We also don’t have control over what we can’t change. Put together they make for an explosive force that drain motivation and energy. Complex tasks are notorious for being opaque and out of reach. In addition, they don’t give us clues as to where to start.

As we procrastinate and put off complex task, that often turn out to not to be nearly as hard or complex as we had thought, we lose valuable time. The impending and nearing deadline reminds us the magnitude of the task which, in turn, makes the deadline too close to be realistic. This vicious cycle is hard to break without actually getting to work.

Once we have an understanding of what we’re up against, what we need to accomplish, and where to start, we have gained — some — mastery over it. We have control. Anxiety, too, is more controlled. We feel confident not just because we’re familiar with the task and how to handle it, but we are also in a much better position to deal with uncertainty and surprises. In fact, this feeling of control and calm is so potent that it resembles the warmth we feel when the wind on a cold day winds down for a minute or two. It feels like we’re no longer cold, and relax. Forgetting that it’s still cold and the wind is bound to pick up. Don’t let the reward of breaking down tasks, planning, and organizing, as important as they are, substitute real progress on the tasks themselves. Remember that controlling anxiety is an overhead, the task still misses all the effort we ought to put into it.

No project or situation is ideal, of course, nor do all plans pan out as expected. This fuels the anxiety and gives more reason to put off tasks until past the eleventh hour, when we are guaranteed to fail. This three-point approach deals with both anxiety and procrastination by claiming control and managing tasks in a friendly way. It doesn’t try to change our habits of limiting leisure time, rather it paces the time we spend on productive tasks. It helps us understand what we have to accomplish and make us think about the steps towards that. Finally, it gives us valuable feedback to improve our future planning and to rectify biased impressions about what we spend our time on.

Ⅰ. Make a List

The first major step is the one that goes the furthest in helping us get a job done; enumerating it. By creating a list, we have to go through the mental process of identifying the steps necessary to get to our goal. This process of enumeration is, it turns out, one of the biggest hurdles in getting something done.

Before we start working on any task that we feel burdened by, we need to put it in perspective. I often find myself procrastinating and avoiding tasks that a few years ago would have been incomparably daunting, such as shopping for some hardware. I put it off longer than necessary, even though all I have to do is just browse a few candidates online, read a few reviews and compare prices and features, before hitting the magical button that will render a box at my doorstep a mere few days later. The fact that I listed what I have to do, ironically, makes the task sound as simple as it really is. But we all know how often we put off similarly simple tasks. Calling a friend, sending an email, working out, reading that book you’ve always meant to read but somehow it was uninviting, and so on with many cases.

Making a list achieves two things. First, it forces us to go through the mental process of visualizing what we have to do. This is a major effort in and of itself for more than one reason. Neuroscience tells us that by imagining or thinking about an act, our brain fires what is called mirror neurons. These neurons fire essentially exactly as when we actually carry out the act itself. Imagining a physical workout fires the neurons that would activate when we physically do the workout. This is what induces cringing when we hear about a painful incident, cry when we hear of a loss, and pull our limbs in when we see or hear of someone in harm’s way. By going through what we would have to do to get the task at hand accomplished, we literally make our brain go through it without any physical consequence. A simulation or virtual reality version of things.

The second advantage to making lists is the breakdown. Most non-trivial tasks involve multiple steps. These steps in their turn can sometimes be split into further sub-tasks or steps. This process of simplification of course is welcome. We can then avoid the large upfront cost of working on the task in one sitting or shot, which might end up taking too much time or just wasting quality time that could otherwise go into other important tasks.

I probably wouldn’t like wasting a beautiful weekend browsing online shops, say, to replace my router; it’s just too much work, at the expense of wasting an otherwise perfectly serviceable weekend, for something that isn’t nearly rewarding or fun. However, I can search for reviews of best routers of the year and quickly go through them for an initial survey of the market landscape. In another sitting, I can look up these top models on my preferred online shop to get a better picture of what buyers think and what the prices are like. In a separate sitting I can compare the features of the top 3-5 models that I think are within my budget and meet my needs. By this stage I should be almost ready to checkout and place an order. Having split the cost over a number of sittings I have gained a number of advantages. First, it wouldn’t feel like a major undertaking. Second, and more importantly, I would have much more time to think about my options. This latter point is hard to overestimate in importance. Our subconscious brain is very good at processing complex situations at a very low cost. When we consciously think about a problem we devote virtually all of our attention and focus to it. This is very costly and with limited time doesn’t yield nearly as good decisions as one would hope. Delegating to the subconscious, or “sleeping over” a decision as it’s often called, gives us valuable time to process by changing the processing faculty, which is almost like getting a second opinion of sorts.

But does sending an email really need a list? While it doesn’t necessarily have multiple parts to it to be broken down in a list, we still need to place it as a task among others that we have to do. Putting it in context makes it easier for us to see the work ahead of us and prioritize before getting busy. Another advantage is that we don’t have to send the email in a single sitting. If it’s an important email (or like this post, an elaborate one,) we probably need to treat it as a writing task. Then we can outline the main points in a sitting, flesh it out in another, and revise and polish it in a third, before we finally hit the send button.

Finally, if there are unknown steps, or the order of tasks is not clear, do not worry. Just add to the list what you think is necessary or probable to be done. Add comments to your notes so you can return to them as more information becomes available. Invariably, as we progress through a multi-stepped task, the more we learn about it and the better we understand what actions need be taken to accomplish it. Feel free to split tasks, replace them, or combine them; it’s all part of the process of organization and planning. The list will make these uncertain steps much more transparent and manageable.

Ⅱ. Limit it

One of the things that make us dread a task is the feeling of wasting quality time on something unrewarding. We’d rather watch that movie, browse the net for entertainment, play a game, etc. than to do the laundry, read a book, get some work done, or file our tax forms. The difference between these two groups is primarily their pleasure rewards. While it’s important to have clean cloths and get tax paperwork done, they are necessities that we would happily do away with if we could. The rewards they bring forth are the avoidance of negative repercussion. In comparison, playing a game or watching a movie have positive rewards and the negative repercussions, such as postponing cleaning the dishes, are minimal or could be easily justified.

Incidentally, the tasks with positive rewards are typically not called productive. This probably owes to the fact that such activity is best labelled play rather than work. At any rate, for our purposes, watching movies could also be a task, which is especially true if one is in the review business. It is up to us to decide what is a task and what isn’t, not society. But we should be conscious of the two competing groups, as there will always be tasks that we prefer to do at the expense of the one that we need, or have, to do. Procrastination is to find excuses to do the former rather than the latter.

A solution to this mental hurdle is to limit the time we are willing to spend on the more important, but less rewarding, tasks. This is in contrast to limiting the time we spend between productive tasks. It might seem more reasonable to limit the time we spend on entertainment rather than on productive tasks, but that only gives us an excuse to put entertainment first and procrastinate our way through the day.

It’s far more effective to cut a deal, so to speak, with ourselves. Spend no more than 20 to 30 minutes on the task at hand and then do anything of your choosing for another limited period. The only requirement is to prevent any distraction during those 25 minutes or so, including checking email, answering phone calls, checking your social network etc. Put your best into those few minutes and get as much done on the task. Once the time is up, switch gear to anything you like. Repeat.

This approach is often called Pomodoro after the tomato-shaped timer. Limiting time works because it puts an upper limit to our time investment and gives us something to look forward to. Once we are fully engaged with the task at hand, we might find it easier to finish it even if we overrun our time limit than to break out of the zone and be forced to start over. Because the cost of getting in and out of a zone, where we are most productive, is rather high, we avoid distractions that we might naively think instantaneous and therefore we could multitask on. A quick email check might take a second or two, but when we see a new email we can’t avoid reading the subject line, which makes us think about the sender, the topic, and what it might contain. At this point we’re practically out of our zone and have forgotten what we were doing. Going back to our task might take us a good several minutes to pick up where we’ve left of, often because we can’t remember where we had gotten and have to waste valuable time finding the exact point of departure.

This is not unlike what happens when interrupted while reading (if we don’t mark it immediately, that is). We lose track of not only the last thing we read (often the sentence is interrupted midway,) but more importantly where we were in the text. Marking the text on the screen is easier than in a printed book or even on a reader (and please, please, don’t dog ear any book — you are almost never its last reader). I’m often surprised by how off the mark I am when guessing where I was in the text when I try to resume, even when knowing the page. Like the legendary boiling frog unaware of the predicament, we too progress through a task in small increments that, like the water heating up the frog, feels seamless and continuous. We don’t notice where we are unless we step back and compare a previous stage to the current. Interruptions force us to repeat a number of steps or, worse, to jump ahead and, after wasting some more time, realize that we have skipped too ahead prematurely and promptly have to backtrack. This process is often repeated multiple times until we are back to the same mental state where we had been interrupted, only after wasting valuable time.

Ⅲ. Time it

Humans are notoriously bad at guessing and estimating. We are especially bad because of the illusion that we can pinpoint the value, duration, measure etc. of anything familiar to us. If you doubt this, try to guess the height of colleagues or friends whose heights you don’t know, but have met countless times. Write down your estimates and then ask them to measure and compare notes. Worse still is when you try to sort the heights of people you’re thoroughly familiar with. You soon realize how hard it is just to place them in relative order to one another, which should be vastly easier than putting a number on their height or weight. Try the same with virtually anything and you’ll see how short you fall from the mark. Of course we aren’t equally bad at all estimations, some are harder than others. The point is that if you were to say how much time you spent on emailing, surfing, chatting, etc. you’d find out that you aren’t accurate at all, that is, after you’ve timed these activities.

By timing how long we spend on different activities we get a more accurate picture of the costs of each activity. This enables us to better prioritize and manage them. It might feel that doing the laundry takes forever, but in reality it probably takes a comparable time to, if not less than, checking Facebook or Reddit. Even though the latter feels like a quick five-minute task, the reality is that we probably spend dozens of minutes at a stretch with no commitment for more than a few minutes. Laundry, on the other hand is certainly tedious and menial, but more probably than not limited in duration. Where the internet is open-ended and can end up taking us into its endless labyrinths and to bizarre corners, laundry, by comparison, can hardly vary much at all. Understandably, the latter’s monotony is the source of its being boring and the former’s open-endedness its source of intrigue and excitement.

By tracking the time we spent on different activities, even if imprecise and by means of checking the time before and after and mentally assessing the difference, the relative feel of how big each task is will change. I know it will take me a good 4 hours to assemble a brand-new computer from its boxed parts to getting to my mailbox, precisely because I’ve kept track every time I had to do it. Although it is a fun activity, I know by the end of it I’d be as tired as at the end of a long workday. Similarly, I know I spend far more time on email than it felt like before measuring. This made me think of ways to reduce this time. One solution that was very productive was to minimize both the number of times I hit reply and the length of my response.

Conclusion

There is no shortage of task management software or sites. But one doesn’t need anything fancy. In most cases one doesn’t need more than a simple editable list (a.k.a. a text editor, or a notepad,) and a timer. I’ve avoided making suggestions for software or sites because the research is part of the learning curve (but don’t procrastinate on it). It’s also best to find the tool one is best comfortable with. I will say thought that I’ve often used sticky notes and text editors to track daily tasks. They are as effective as the more complex project management tools, especially for short-term or daily tasks.

The above three points are as simple as one can get in terms of organization. Before you start a day’s work, go through the top things you need to accomplish and write them down. You can prioritize quickly if that is easy or given. Break down the more complex tasks into sub-tasks that you can accomplish in a stretch of 20 minutes or so. Tackle them one by one in Pomodoro sittings and keep track of how much time they are actually taking. Be conscious of distractions and make notes of them, preferably next to the tasks.

By planning, knowing where one is going, controlling the effort, and monitoring progress, we are as organized and methodical as we can be, with minimal overhead.

Try it out, and share your experience.

Mar 182013
 

I don’t need to see the source code of an API to code against. In fact, I actively discourage against depending (even psychologically) on the inner details of an implementation. The contract should be sufficient. Of course I’m assuming a well-designed API with good (at least decent) documentation. But sometimes often reality is more complicated than an ideal world.

Empty try{}

While working with System.Diagnostics.Process in the context of Parallel.ForEach things became a bit too complicated. (I’ll leave the gory details to another post.) What prompted this post was a weird pattern that I noticed while browsing Process.cs, the source code for the Process class (to untangle said complicated scenario).

RuntimeHelpers.PrepareConstrainedRegions();
try {} finally {
   retVal = NativeMethods.CreateProcessWithLogonW(
		   startInfo.UserName,
		   startInfo.Domain,
		   password,
		   logonFlags,
		   null,            // we don't need this since all the info is in commandLine
		   commandLine,
		   creationFlags,
		   environmentPtr,
		   workingDirectory,
		   startupInfo,        // pointer to STARTUPINFO
		   processInfo         // pointer to PROCESS_INFORMATION
	   );
   if (!retVal)
	  errorCode = Marshal.GetLastWin32Error();
   if ( processInfo.hProcess!= (IntPtr)0 && processInfo.hProcess!= (IntPtr)NativeMethods.INVALID_HANDLE_VALUE)
	  procSH.InitialSetHandle(processInfo.hProcess);
   if ( processInfo.hThread != (IntPtr)0 && processInfo.hThread != (IntPtr)NativeMethods.INVALID_HANDLE_VALUE)
	  threadSH.InitialSetHandle(processInfo.hThread);
}

This is from StartWithCreateProcess(), a private method of Process. So no surprises that it’s doing a bunch of Win32 native API calls. What stands out is the try{} construct. But also notice the RuntimeHelpers.PrepareConstrainedRegions() call.

Thinking of possible reasons for this, I suspected it had to do with run-time guarantees. The RuntimeHelpers.PrepareConstrainedRegions() call is a member of CER. So why the need to use empty try if we have the PrepareConstrainedRegions call? Regrettably, I confused it with the empty try clause. In reality, the empty try construct has nothing to do with CER and everything with execution interruption by means of ThreadAbortException.

A quick search hit Siddharth Uppal’s The empty try block mystery where he explains that Thread.Abort() never interrupts code in the finally clause. Well, not quite.

Thread.Abort and finally

A common interview question, after going through the semantics of try/catch/finally, is to ask the candidate if finally is always executed (since usually that’s the wording they use). Are there no scenarios where one could conceivably expect their code in finally never to get executed? A creative response is usually when we have an an “infinite loop” in the try or catch (if it gets called). Novice candidates are easily confused when they consider a premature termination of the process (or thread). After all, one would expect that there be some consistency in the behavior of code. So why shouldn’t finally always get executed, even in a process termination scenario?

It’s not difficult to see that there is a struggle of powers between the termination party and the code/process in question. If finally always executes, there would be no guarantees for termination. Yes, we cannot guarantee both that finally will always execute while guaranteeing that termination will always succeed. One or both must have weak guarantees (or at least weaker guarantees than the other). When push comes to shove, we (as users or administrators) want to have full control over our machines, so we choose to have the ultimate magic wand to kill and terminate any misbehaving (or just undesirable) process.

The story is a little bit different when it comes to aborting individual threads, however. Where on the process level the operating system can terminate it with a sweeping gesture, in the managed world things are more controlled. The CLR can see to it that any thread that is about to get terminated (by a call to Thread.Abort()) is done cleanly and respecting all the language and runtime rules. This includes executing finally blocks as well as finalizers.

ThreadAbortException weirdness

When aborting a thread, the apparent behavior is one of an exception thrown from within the thread in question. When another thread invokes Thread.Abort() on our thread, a ThreadAbortException is raised from our thread code. This is called asynchronous thread abort, as opposed to synchronous abort, when a thread invokes Thread.CurrentThread.Abort() (invariantly on itself). Other asynchronous exceptions include OutOfMemoryException and StackOverflowException.

The behavior, then, is exactly as one would expect when an exception is raised. The exception bubbles up the stack, executing catch and finally blocks as one would expect from any other exception. There are, however, a couple of crucial differences between ThreadAbortException and other Exceptions (with the exception of StackOverflowException, which can’t be caught at all). First, this exception can’t be suppressed by simply catching it – it is automatically rethrown right after exiting the catch clause that caught it. Second, throwing it does not abort the running thread (it must be done via a call to Thread.Abort()).

The source for this behavior of ThreadAbortException is the abort requested flag, which is set when Thread.Abort() is invoked (but not when it is thrown directly). CLR then checks for this flag at certain check-points and proceeds to raise the exception, which normally is raised between any two machine instructions. This guarantees that the exception will not get thrown when executing a finally block or when executing unmanaged code.

So the expectation of the novice interviewee (and Mr. Uppal’s) was right after all. Except, it wasn’t. We are back full circle to the problem between the purpose of aborting a thread, and the possibility of an ill behaved code never giving up at all. I am being too generous when I label code that wouldn’t yield to a request to abort as “ill behaved.” Because ThreadAbortException is automatically rethrown from catch causes, the only way to suppress it is to explicitly call Thread.ResetAbort() which clears the abort requested flag. This is intentional as developers are in the habit of writing catching-all clauses very frequently.

AppDomain hosting

So far we’ve assumed that we just might need to terminate a process, no questions asked. But why would one need to abort individual threads within a process? The answer lies with hosting. In environments such as IIS or SQL servers, the server should be both fast and reliable. This led to the design of compartmentalizing processes beyond threads. AppDomain groups processing units within a single process such that spawning new instances is fast (faster than spawning a complete new process,) but at the same time it’s grouped such that they can be unloaded on demand. When an AppDomain instance takes longer than the configured time (or consumes some resource more than it should,) it’s deemed ill-behaved and the server will want to unload it. Unloading includes aborting every thread within the AppDomain in question.

The problem is yet again one of conflict between guarantees. This time, though, the termination logic needs to play along, or else. When terminating a process, the complete process memory is released, along with all its system resources. If the managed code or CLR don’t do that, the operating system will. In a hosted execution environment, the host wants to have full control over the life-time of an AppDomain (with all its threads,) all the while, when it decides to purge of it, it does not want to destabilize the process or, worse, itself or the system at large. When unloading an AppDomain, the server wants to give it a chance to cleanup and release any shared resources, including files and sockets and synchronization objects (i.e. locks,) to name but a few. This is because the process will continue running, hopefully for a very long time. Hence the behavior of ThreadAbortException that calls every catch and finally as it should.

In return, any process that wants to play rough gets to call Thread.ResetAbort() and go on with its life, thereby defeating the control that the server enjoyed. The server invariantly has the upper hand, of course. After a second limit is exceeded, after invoking Thread.Abort(), in the words of Tarantino, the server may “go medieval” on the misbehaving AppDomain.

Rude Abort/Unload/Exit

When a thread requested to abort doesn’t play along, it warrants rudeness. The CLR allows a host to specify escalation policy in similar events, such that the host would escalate a normal thread abort into a rude thread abort. Similarly, a normal AppDomain unload and process exit may be escalated to a rude ones.

But we all know that the server doesn’t want to be too inconsiderate. It wouldn’t want to jeopardize its stability in the wake of this arms race between it and the hosted AppDomain. For that, it wants to have some guarantees. More stringent guarantees from the code in question that it will not misbehave again, when given half a chance. One such guarantee is that the finalization code will not have ill side-effects.

In a rude thread abort event, the CLR forgoes calling any finally blocks (except those marked as Constrained Execution Regions, or CER for short) as well as any normal finalizer. But unlike mere finally blocks, finalizers are a much more serious bunch. They are functions with consequences. Recall that finalization serves the purpose of releasing system resources. In a completely managed environment, with garbage collection and cleanup, the only resources that needs special care are those that aren’t managed. In an ideal scenario, one wouldn’t need to implement finalizers at all. The case is different when we need to wrap a system resource that is not managed (this includes native DLL invoking). All system resources that are represented by the framework are released precisely using a finalizer.

Admittedly, if we are developing standalone applications, as opposed to hosted, we don’t have to worry about the possibility of escalation and rude abort or unload. But then again, why should we worry about Thread.Abort() at all in such a case? Either our code could issue such a harsh request, which we should avoid like the plague and opt to more civil cancellation methods, such as raising events or setting shared flags (with some special care), or, our code is in a library that may be called either from a standalone application or a hosted one. Only in the latter case must we worry and prepare for the possibility of rude abort/unload.

Critical Finalization and CER

Dispose() is called in finally blocks, either manually us via the using clause. So the only correct way to dispose objects during such an upheaval is to have finalizers on these objects. And not just any finalizer, but Critical Finalizers. This is the same technique used in SafeHandle to ensure that native handles are correctly released in the event of a catastrophic failure.

When things get serious, only finalizers marked as safe are called. Unsurprisingly, attributes are used to mark methods as safe. The contract between CLR and the code is a strict one. First, we communicate how critical a function is, in the face of asynchronous exceptions by marking their reliability. Next, we void our rights to allocate memory, which isn’t trivial, since this is done transparently in some cases such as P/Invoke marshaling, locking and boxing. In addition, the only methods we can call from within a CER block are those with strong reliability guarantees. Virtual methods that aren’t prepared in advance cannot be called either.

This brings us full circle to RuntimeHelpers.PrepareConstrainedRegions(). What this call does is it tells the CLR to fully prepare the proceeding code, by allocating all necessary memory, ensuring there is sufficient stack space, JITing the code, which completely loads any assemblies we may need.

Here is a sample code that demonstrates how this works in practice. When the ReliabilityContract attribute is commented out, the try block is executed before the finally block, which fails. However, with the ReliabilityContract attribute, the PrepareConstrainedRegions() call fails to allocate all necessary memory beforehand and therefore doesn’t even attempt to execute the try clause, nor the finally, instead the exception is thrown immediately.

There are three forms to execute code in Constrained Execution Regions (from the BCL Team Blog):

  • ExecuteCodeWithGuaranteedCleanup, a stack-overflow safe form of a try/finally.
  • A try/finally block preceded immediately by a call to RuntimeHelpers.PrepareConstrainedRegions. The try block is not constrained, but all catch, finally, and fault blocks for that try are.
  • As a critical finalizer – any subclass of CriticalFinalizerObject has a finalizer that is eagerly prepared before an instance of the object is allocated.
    • A special case is SafeHandle’s ReleaseHandle method, a virtual method that is eagerly prepared before the subclass is allocated, and called from SafeHandle’s critical finalizer.

Conclusion

Normally, CLR guarantees clean execution and cleanup (via finally and finalization code) even in the face of asynchronous exceptions and thread abort. However it does preserve the right to take a harsher measure if the host escalates things. Writing code in finally blocks to avoid dealing with the possibility of asynchronous exceptions, while not the best practice, will work. When we abuse this, by reseting abort requests and spending too long in finally blocks, the host will escalate things to rude unload and will aggressively rip the AppDomain with all its threads, bypassing finally blocks, unless in a CER block, as the above code did.

So, finally blocks are not executed when a rude abort/unload is in progress (unless in CER), when the process is terminated (by the operating system), when an unhandled exception is raised (typically in unmanaged code or in the CLR) or in background threads (IsBackground == true) when all foreground threads have exited.

Nov 162011
 

Transactions are the seat-belts of SQL servers. They protect our data by promising “all or nothing”. Data consistency on any level couldn’t be guaranteed without it. In a sense transactions work because they guarantee atomicity. They work because they allow you to abandon your changes and walk away at will, completely secure in the knowledge that not only your changes thus far aren’t visible to anyone, but also that your changes didn’t go anywhere near your valuable data.

Any reasonably-sized database code, therefore, is bound to have transactions peppered all about. Yet having a flawed implementation could be as worse as not having any at all. Perhaps even worse, given the false confidence. To my shock and horror, we just discovered a similar situation in one of our products. The opportunity to get to the bottom of transactions was pretty much forced, but it was a welcome opportunity nonetheless.

Here I’ll limit this investigation to MS SQL (2005 and 2008 used for testing).

Here is what we’re trying to achieve:

  1. Understand how MS SQL transactions work.
  2. Transactions must support nesting and rollback should abort all.
  3. Develop a template that use transactions correctly with error handling.

Test Bed

Let’s start by a boilerplate database as a test-bed.

-- Create an uncreatively named database.
CREATE DATABASE DeeBee
GO
USE DeeBee
GO

-- Create a table to use for data modification.
CREATE TABLE Tee(
	-- We'll generate overflow exception at will using this field.
	Num TINYINT
)
GO

Now let’s see what happens if we do multiple inserts and one fails…

-- Sproc to do multiple inserts with failure.
CREATE PROCEDURE [Multi_Insert]
AS
	-- Normal case
	INSERT INTO Tee VALUES(1);
	-- Overflow case
	INSERT INTO Tee VALUES(2000);
	-- Normal case again
	INSERT INTO Tee VALUES(3);
GO

To execute the above sproc, run the following (repeat after each change to the sproc above):

-- Clean slate.
DELETE FROM Tee
GO

-- Execute complex statements.
EXEC Multi_Insert;
GO

-- Check the results
SELECT * FROM Tee
GO

Results (for the exec alone):

(1 row(s) affected)
Msg 220, Level 16, State 2, Procedure Multi_Insert, Line 8
Arithmetic overflow error for data type tinyint, value = 2000.
The statement has been terminated.

(1 row(s) affected)

Two things to observe here: First, unsurprisingly the first insert went as planned as can be seen in the first “1 row(s) affected” message, followed by the overflow error. Second, which is important here, is that this didn’t abort or change the flow of the sproc, rather, the 3rd insert was executed as well.

Selecting everything in the Tee gives us:

1
3

Error Handling

In many cases we wouldn’t want to resume with our operations when one fails. A quick solution might look something like this:

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
	-- Normal case
	INSERT INTO Tee VALUES(1);
	if @@ERROR <> 0
		RETURN

	-- Overflow case
	INSERT INTO Tee VALUES(2000);
	if @@ERROR <> 0
		RETURN

	-- Normal case again
	INSERT INTO Tee VALUES(3);
GO

This would work, but it only prevents further operations after the initial error, it doesn’t undo previous changes. In addition, it’s very tedious, especially with complex sprocs and looks ugly. Let’s see what we can do with exception handling…

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	-- Normal case
	INSERT INTO Tee VALUES(1);
	-- Overflow case
	INSERT INTO Tee VALUES(2000);
	-- Normal case again
	INSERT INTO Tee VALUES(3);
END TRY
BEGIN CATCH
	PRINT ERROR_MESSAGE();
END CATCH
GO

This is functionally equivalent to the previous version, however here we have a clean control-flow and no longer do we need to taint our code with error checking. The print statement isn’t necessary, I just added it to show that the same error occurs. The result of this sproc is a single insert into Tee with the value 1.

Transactions

Let’s do the same, this time using transactions only.

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN
	BEGIN TRANSACTION
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END
GO

The above sproc will result in the following, familiar, output:

(1 row(s) affected)
Msg 220, Level 16, State 2, Procedure Multi_Insert, Line 9
Arithmetic overflow error for data type tinyint, value = 2000.
The statement has been terminated.

(1 row(s) affected)

This is the exact same behavior as without the transactions! This means transactions don’t automatically add protection, recovery and undo. Not only didn’t we revert the first insert after the error, but we went on to execute the remainder of the transaction. The only way to fix this is to actually issue a ROLLBACK TRANSACTION statement. Of course we must do this by first detecting that something went wrong. Since we already know that we must do this manually, we should either check for errors or wrap the code in TRY/CATCH clauses.

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	BEGIN TRANSACTION
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	ROLLBACK TRANSACTION
	PRINT ERROR_MESSAGE();
END CATCH
GO

Results:

(1 row(s) affected)

(0 row(s) affected)
Arithmetic overflow error for data type tinyint, value = 2000.

Finally, modifications are rolled back and the table is back to a clean slate, as it was before executing the sproc.

Nested Transactions

Now that we have a working transaction with correct exception handling, let’s try to develop a version that is symmetric and can be nested. We could use the above try/transaction/catch/rollback pattern, except, it’s not robust. To see why that is so, let’s take a little bit more complex case. Suppose you used this pattern in the sprocs that do modification, and in one case you called one sproc from another.

-- Sproc that does complex operations.
CREATE PROCEDURE [SP_Complex]
AS
BEGIN TRY
	BEGIN TRANSACTION

		-- Normal case
		INSERT INTO Tee VALUES(0);

		-- Execute a bad sproc.
		EXEC Multi_Insert;

		-- Normal case again
		INSERT INTO Tee VALUES(5);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	ROLLBACK TRANSACTION
	PRINT ERROR_MESSAGE();
END CATCH
GO

The above sproc is written using the same pattern we used in Multi_Insert. We know the inner Multi_Insert will have an exception, it’ll rollback the transaction and return. But what about the outer transaction in SP_Complex? Would the first insert be rolled back? Will the last insert get executed?

Results:

(1 row(s) affected)

(1 row(s) affected)

(0 row(s) affected)
Arithmetic overflow error for data type tinyint, value = 2000.
Msg 3903, Level 16, State 1, Procedure SP_Complex, Line 19
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION.
Transaction count after EXECUTE indicates that a COMMIT or ROLLBACK TRANSACTION statement is missing. Previous count = 1, current count = 0.

Things didn’t go as planned. Checking for modifications in the Tee table shows that indeed the transaction worked as expected and all changes were rolled back. We still got an error though! To understand what’s going on, let’s see what the error was about.

First, notice that there were two inserts, one happened in the outer transaction and the second from the inner one, yet both were rolled back (the table should be empty). Then we print the overflow error, which indicates that the second insert in Multi_Insert threw. So far, so good. Then we hit an unanticipated error message. The clue to what happened is in “Transaction count after EXECUTE […]”. It seems that SQL engine is checking for transaction count after EXEC statements. Since we called ROLLBACK in Multi_Insert it seems that this has rolled back not only the inner transaction, but also the outer. This is confirmed by two facts. First, the table hasn’t been changed and all our inserts have been undone. Second, the inner ROLLBACK does match the BEGIN TRANSACTION in Multi_Insert and so does the pair in SP_Complex. So the only way we could get a mismatch error is if something in Multi_Insert had an effect larger than its scope. (Also, what’s with “Transaction count” references?)

To see what actually happened, let’s trace the execution path:

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [Multi_Insert]. Transactions: ' + Convert(varchar, @@TRANCOUNT);
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [Multi_Insert]: ' + ERROR_MESSAGE();
	ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

-- Sproc that does complex operations.
ALTER PROCEDURE [SP_Complex]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [SP_Complex]. Transactions: ' + Convert(varchar,@@TRANCOUNT);

		-- Normal case
		INSERT INTO Tee VALUES(0);

		-- Execute a bad sproc.
		EXEC Multi_Insert;

		-- Normal case again
		INSERT INTO Tee VALUES(5);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [SP_Complex]: ' + ERROR_MESSAGE();
	ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

Results:

IN [SP_Complex]. Transactions: 1

(1 row(s) affected)
IN [Multi_Insert]. Transactions: 2

(1 row(s) affected)

(0 row(s) affected)
ERR [Multi_Insert]: Arithmetic overflow error for data type tinyint, value = 2000.
Rolled back. Transactions: 0
ERR [SP_Complex]: Transaction count after EXECUTE indicates that a COMMIT or ROLLBACK TRANSACTION statement is missing. Previous count = 1, current count = 0.
Msg 3903, Level 16, State 1, Procedure SP_Complex, Line 21
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION.
Rolled back. Transactions: 0

To get the transaction count we use @@TRANCOUNT. Now if we look at the output from the trace, we’ll see the exact same thing, except now it’s obvious that the transaction count is at 2 within the inner transaction, yet after catching the overflow exception and rolling back, the transaction count has dropped to 0! This confirms it. ROLLBACK works not only on the current transaction, but escalates all the way to the top-most transaction. The references to the “Previous count = 1, current count = 0.” is regarding the transaction count before the EXEC statement and after it. The ROLLBACK in Multi_Insert made both the COMMIT and ROLLBACK statements in SP_Complex obsolete, or rather invalid.

Correct Nesting

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [Multi_Insert]. Transactions: ' + Convert(varchar, @@TRANCOUNT);
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [Multi_Insert]: ' + ERROR_MESSAGE();
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

-- Sproc that does complex operations.
ALTER PROCEDURE [SP_Complex]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [SP_Complex]. Transactions: ' + Convert(varchar, @@TRANCOUNT);

		-- Normal case
		INSERT INTO Tee VALUES(0);

		-- Execute a bad sproc.
		EXEC Multi_Insert;

		-- Normal case again
		INSERT INTO Tee VALUES(5);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [SP_Complex]: ' + ERROR_MESSAGE();
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

By checking whether or not an inner transaction has rolled back already we avoid any mismatches. This works because using try/catch errors always execute the catch clauses, preventing any statements after the error to execute. If our transactions catch an exception, yet the @@TRANCOUNT is 0, then we must conclude that an inner sproc has made more commits than the transactions it created, or it rolled back the transaction(s). In both cases we must not ROLLBACK unless we have a valid transaction, and we must always ROLLBACK if we’re in a CATCH clause if we have a valid transaction. Notice that we don’t check for @@TRANCOUNT to COMMIT TRANSACTION. The reason is because the only case where we could get to it is when the execution of the previous statements are successful. True that if somewhere a mismatched COMMIT is done, we’ll get a mismatch, but that’s a programmer error and must not be suppressed. That is, by suppressing that case, our code will not work as expected (in a transaction) because all subsequent statements will execute outside transactions, and therefore can’t be rolled back. So we let that mismatch blow up.

The Template

So, now that we have a good understanding of how transactions and error handling works, we can finally create a template that we can reuse, safe in the knowledge that it’s nestable and will behave as expected in all cases.

BEGIN TRY
	BEGIN TRANSACTION

	--
	-- YOUR CODE HERE
	--

	-- Don't check for mismatching. Such an error is fatal.
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	-- An inner sproc might have rolled-back already.
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
END CATCH

Full test-bed for reference:

IF EXISTS(SELECT * FROM sys.sysdatabases where name='DeeBee')
DROP DATABASE DeeBee
GO

-- Create an uncreatively named database.
CREATE DATABASE DeeBee
GO
USE DeeBee
GO

-- Create a table to use for data modification.
CREATE TABLE Tee(
	-- We'll generate overflow exception at will using this field.
	Num TINYINT
)
GO


-- Sproc to do multiple inserts with failure.
CREATE PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [Multi_Insert]. Transactions: ' + Convert(varchar, @@TRANCOUNT);
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [Multi_Insert]: ' + ERROR_MESSAGE();
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

-- Sproc that does complex operations.
CREATE PROCEDURE [SP_Complex]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [SP_Complex]. Transactions: ' + Convert(varchar, @@TRANCOUNT);

		-- Normal case
		INSERT INTO Tee VALUES(0);

		-- Execute a bad sproc.
		EXEC Multi_Insert;

		-- Normal case again
		INSERT INTO Tee VALUES(5);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [SP_Complex]: ' + ERROR_MESSAGE();
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

-- Clean slate.
DELETE FROM Tee
GO

-- Execute complex statements.
--EXEC Multi_Insert;
EXEC SP_Complex;
GO

-- Check the results
SELECT * FROM Tee
GO

Conclusion

Error handling is often tricky and full of gotcha’s. We should be careful not to make any unwarranted assumptions, especially when working cross-languages. The error handling, and more importantly transactions, in SQL don’t necessarily behave like your favorite programming language. By carefully reading documentation, experimenting and adding traces we can get insight and develop robust solutions.

However the above isn’t the last word on error handling or transactions. Your needs might be different and your use-cases might dictate other priorities. Perhaps my approach misses some subtle case that you can’t afford to ignore. Or may be this is good enough for a boilerplate template. Either way, comments, corrections and suggestions are more than welcome.

Aug 262011
 

Databases are as ubiquitous as computers. They are often, erroneously, attributed to data-intensive, long-term storage solutions. Yet in reality they are used in one shape or form in most programs. From word processors to media players to web-browsers. All depend on databases to realize some of their features. This, not mentioning, e-commerce, flight booking, libraries and, of course, government intelligence data. Even when no feature requires a database per-se, user configuration and settings are often stored in databases. Where previously Windows programs depended on the Registry and Unix programs on plain-text files, increasingly new products are utilizing the portable and vastly superior SQLite instead.

Image via Wikipedia

Each application of course has its own set of requirements, use-cases and patterns. Yet, surprisingly or otherwise, there are two main patterns that can be said to parent the rest: OLTP and OLAP. Or, Online Transaction Processing and Online Analytical Processing respectively. Consequently, database design and optimization depends heavily on our processing needs and models. (Here I’m not concerned with database engine design and architecture, rather on the use-case patterns and processing models.) To get a better feel of each, let’s consider typical cases for each.

Online Transaction Processing (OLTP)

This type of applications are chiefly characterized by performing real-time, localized, mission-critical operations. That is, the operations concerned must complete virtually immediately (hence real-time processing,) they typically involve a very small number of entries independent of the remaining data (hence localized processing,) and there is zero-tolerance for data inconsistency, corruption or undefined states. The latter property is what requires transactional processing such that either the operation completely succeeds with all its sub-parts or it completely fails restoring the previous state of all involved entries.

In OLTP the operations are mostly that of Insert and Update, with occasional Delete operations for housekeeping, although typically deletion is differed for performance and/or for later data mining and reporting. The tables involved are typically normalized around the data entities.

Examples include online shopping, flight and medical data and custom data processing.

Online Analytical Processing (OLAP)

In contrast to OLTP, this type of applications are characterized by performing offline, aggregate and low-priority operations. The operations involved here can be executed at low-demand hours, often on archive databases and secondary servers (hence offline processing,) they typically aggregate a very large amount of data to generate statistical data, reports or to find outliers (hence aggregation,) and, since they are offline processing, they are designed to have low-priority, low isolation level (read uncommitted) and, since there is little or no side-effects to failing, they are designed to fail rather than potentially interfere with competing OLTP operations (if executed on the same database.)

OLAP operations are mostly Select operations with virtually no data modification (except for storing the results, if at all, typically in dedicated tables.) These operations not only aggregate large number of entries, with heavy use of aggregate functions, but they typically join a significant number of tables.

Examples include reporting, billing, statistical analysis and historic data processing.

Similarities and Differences

Since the differences between these two patterns lies in the underlying database operations they perform, it’s no surprise that there aren’t a multitude of basic patterns. The two main operation types are that of read and modify. That is, the applications that mostly read data (OLAP) typically perform massive read operations potentially across the complete data with few or no modification, and those that depend on heavy data modification care most about the integrity of the few entries they modify with little or no interest in other data.

However, one must make note of the fact that modification implies reading. This is crucial for correctly appreciating the performance characteristics of the two patterns. The three data-modifying operations, Insert, Update and Delete, all ought first perform Select operation(s) in one form or another. While this is more apparent for Update and Delete, Inserts with constraints must first validate that no constraint is violated. This often involves a lookup operation in the constraint index(es). Only in the most basic and simplest cases could an Insert operation be free of all lookup operations (when foreign key, unique or other constraints are completely absent.)

This doesn’t mean that OLTP operations are a superset of OLAP operations. To try and optimize first and foremost for OLAP with the hope that OLTP operations would naturally also run faster, thanks to the improved read operations that it could utilize, is a fallacy. The two are worlds apart and typically don’t have overlapping indexable data (more on this below.) Where they do share fields, indexes on them would most certainly be welcome to both. Unfortunately, that doesn’t materialize nearly as much as one would like.

Typical properties of OLTP vs. OLAP
Transactional Processing Analytical Processing
Data Current and in-progress. Retired, historic or archived.
Typical Operations Insert, Update, Delete. Select.
Function Types Based on business requirements. Based on data mining.
Normalization Highly normalized, entity modeled. Possibly denormalized to minimize joins.
Data Locality Few tables, few entries. Large aggregation across many tables.
Indexing Strategy Limited to a few highly-selective fields. Generous indexing depending on the queries.
Operation Duration Typically very short, sub-millisecond range. Extended over minutes and hours.
Caching Friendliness Highly volatile data with strict persistence requirements. Static data, very cache friendly.
Backup Requirement Inconsistency and data loss may cost business. All operations can be rerun, backup is redundant.
Query Complexity Trivial to simple. Complex to inhumane and unreadable.

Hybrids

The above descriptions are certainly not the only possibilities. A combination of both is more reasonable to expect in practice rather than the theoretical and somewhat idealized cases given above. Some such case might include functionality that must aggregate several groups of data without suffering phantom reads. Such a requirement means that not only the aggregate query must run with a high isolation level, reducing parallelism, but that it must also add indexes to finish as fast as possible to free up the entries that concurrent transactions might require. The additional indexes would incur unwelcome cost for data modification (which must also update the indexes as well.)

Optimization

Due to the differences in the operations involved, and indeed their requirements, optimization in each case is different than the other.

OLTP

To maximize performance, OLTP operations would use highly selective conditions on normalized tables in queries that are completely indexed (rather than partially indexed). This will get us to the target data as fast as possible, or, in case of insertion, will verify the absence of collisions equally quickly. Typically the primary key is used, which itself may be an alias of the internal row-id.

Next we’d want to localize locks as much as possible to maximize concurrency and minimize collateral impact. Fine grained lock hints such as row-level locks may be used. Normalization helps here by further containing the impact of the transactions as local as possible. In addition, we’d want to minimize the average row-size to improve cache-hits and avoid undue penalty. Once again, Normalization does most of the work here too.

Finally, OLTP operations would want to minimize, if not avoid, indexes, triggers, views and any operation that doesn’t help it reach and update the data as fast as possible. To that end, indexes are only useful to reach the data fast. All others would simply slow down virtually all operations, save for any selects we may have.

OLAP

For OLAP on the other hand it’d be best to work with denormalized data, to minimize or avoid joins.

Unlike OLTP, it’s very hard to predict or prioritize the conditions most useful to OLAP. This is because depending on the particular query in question, the tables and fields of importance are decided. Indeed, different queries might depend on very different fields and therefore indexes. So indexing some field might be useful to only an isolated query and not others. For a billing system not only the customer and orders are of importance, but the dates of orders and the billing period as well. Similarly, to generate sales reports, the queries involved would select on products and dates. Yet an inventory audit query might depend on a different set of fields. Hence, indexes are best decided based on the concrete queries in question.

To improve performance advanced features such as Materialized Views (aka Indexed Views) may be utilized, which are unfriendly to transactional processing.

Hybrids

From the above it’s quite clear that not only do OLTP and OLAP differ in their operations, but consequently optimizing their queries are apparently in conflict with each other. The requirements for maximizing performance in each case is simply contradictory to one another. But often both types of processing are done in our products. How can we optimize them?

Optimization for both OLTP and OLAP

It must be said from the outset that optimizing for two contradictory set of requirements can be a recipe for disaster. That is, we might end up with worse performance for most queries and even degrade the database design in hope of improving performance. With hubris some might advocate indexing all fields in query conditions, creating an uninformed mess in their wake. In fact, since OLTP operations are inherently very short lived, the overhead of updating superfluous indexes will probably go unnoticed, yet a long-running OLAP operations might get a noticeable boost that the OLAP queries get. From this the gulls in question might pat themselves for a job well done, completely oblivious to the fact that the OLTP operation in question will probably run millions of times, each time incurring the slight cost of the index updates, negating the apparent speed boost. To avoid such scenarios, we must understand that a case of compromise is perhaps unavoidable and approach the problem methodically.

Here are some design patterns that are often employed to maximize performance across both types of queries.

Data Archiving

Perhaps the simplest solution is to avoid performing both OLTP and OLAP type operations on the same data-set. To achieve this, we’ll have to split the data over two sets of tables. One, highly normalized and optimized for OLTP and another, perhaps denormalized, OLAP-optimized set.

A data-retirement scheme will be used to move the retired data from the OLTP table(s) into the archive (OLAP) tables. This may be performed using a background process, or using triggers. The latter may not be desirable as triggers would add latency to the OLTP queries, however, on the other hand, the cache hits and the avoidance of extra code and scanning of the OLTP tables might be a good compromise.

Dynamic Indexing

A number of application process data in batches. Each batch is processed in transactional fashion and once all the data has been processed completely, reports and other aggregate functions are executed on the final set. Custom data processing is perhaps a good example here, where some data is processed (for example by means of transformation, conversion or other modification) and tracked by transcations in a database. The database accounts for every data item, its states as it is modified with timestamps and warning/error codes where applicable.

Such a processing scheme has the advantage of allowing for an OLTP type optimization for the data processing stage until all the data is completely processed, whereupon OLAP-optimized indexes are dynamically added. The overhead of adding or modifying the indexes (yes, some might be dropped, while new ones added) might win us an overall significant amount of CPU time. Of course we’d need to profile and benchmark thoroughly to decide which indexes to remove and which to add.

Biting the Bullet: Conjoining Both

In some cases we need to aggregate both retired and archived data as well as current and in-transit data. If this scenario is of prime importance, then archiving the data for offline processing might simply be unhelpful, as we must also join the OLTP tables to get as-yet unarchived data. Also, we can’t work in stages and perform dynamic indexing since we have to work with both current and archived data.

For obvious reasons this is the worst of both worlds. In this case, we must very carefully balance two forces:

  1. Indexes for optimum OLTP performance.
  2. Indexes for the most important OLAP queries.

Here analytic profiling of the different operations, their execution pattern and respective use-cases is necessary. Once we get a rough idea of the relative importance of our database operations, we can start collecting empirical data using benchmarks to evaluate the different approaches. Since the database design, indexes and other attributes must be shared among all database operations, we must choose the minimum set of indexes etc. that give the maximum performance across all operations.

The difficultly is of course in finding this optimum point. As each index helpful to the OLAP queries, but unnecessary to the OLTP ones, is an overhead for the high-rate transactions. On the other hand, strategic indexes could boost the performance of aggregate queries by orders of magnitude.

Here one is bound to spend a good amount of time using database profiling, statistical analysis and benchmarking. (In a future post I’ll address this topic head-on.) The downside of all this is that once the priorities of our OLAP queries change, then our particular database design will perhaps be outdated too.

Conclusion

The requirements of OLTP and OLAP are as different as are their use-cases. Optimizing for both is a very tricky business, but sometimes unavoidable. If we must maximize performance, we must invest a considerable amount of time designing a well-balanced database schema and, using extensive profiling and benchmarking tools combined with analytic methods, only then can we decide on the design that maximizes performance for our particular needs.

(Anything missing? If you have something interesting on the topic to share, please don’t keep it to yourself.)

Related articles
Jun 262011
 

Every so often a piece of technology comes along and changes everything. Once we experience this new way of doing things, we can no longer understand how we survived without it. After we sent our very first emails, walking to the post office to drop mail seemed unearthly. And who’d replace an IDE with a text-editor?

git icon, created for the Open Icon Library

Image via Wikipedia

Git1 didn’t seem the answer to my needs. I’ve been using Subversion (SVN) since 2006 and I’ve been a very happy camper indeed. Before that I used CVS and, although inexperienced with Version Control Systems (VCS), it was a major improvement over MS Source Safe (which I had used for almost 6 years before that.) I use SVN at home and at work. I’ve grown used and dependent on version control so much that I use SVN for my documents and other files, not just code. But Git? Why would I need Git?

When Git came to the scene there were already some Distributed VCS (DVCS) around (as opposed to centralized VCS, such as CVS and SVN.) But Linus made an impression with his Google Talk. I wanted to try this new piece of technology regardless of my needs. It was just too tasty to pass up. At the first opportunity, I installed the core tools and Git Extensions to ease my way with some visual feedback (I learn by reading and experimenting.)

Now that I’ve played around with Git for a while, and I’ve successfully moved some of my projects from SVN to Git, I can share my experience. Here is why I use Git even when not working with a team (where it’s infinitely more useful.)

Commit Often, Commit Many

Commits with half a dozen of -unrelated- changes is no stranger to us. A developer might add some new function, refactor another and rename an interface member all in the same change-set. This is counter-productive, because reviewing such unrelated code-change is artificially made more difficult than necessary. But, if the review unit is the commit unit, then developers combine multiple changes to reduce overhead and push them onto their colleagues. This is unfortunate, because the code should evolve in the best way possible, uninfluenced by unrelated artificial forces, such as tooling nuances. But more than reviewing, combined commits cause much headache and lost productivity when we need to go back in time and find a specific line of code, rollback or merge. But what if the changes were related? What if we need to make a few KLOCs of change for the code to even build successfully? The centralized VCS would recommend a branch. But unless the sub-project is long-term, branching is yet another overhead that developers try to avoid.

With Git, these problems are no more, thanks to local commits. With local commits, one can (and should) commit as often as possible. The change log no longer is anything more than a single sentence. The changes aren’t reflected anywhere, until we decide to push the changes onto the server. There is no longer a distinction between major changes and minor changes. All changes can be subdivided as much as necessary. No longer does one need to do local backups2, create personal branches or make every change visible company-wide or publically. Local commits are full-fledged VCS that doesn’t introduce new or extra work. When we’re done, we just update the repository in one push command.

If you need to keep some piece of code around, but do not wish to send it for review and commit, you need to copy it somewhere. With local commits, you can indeed commit it, with relevant commit-log. In a subsequent change-set, you can delete it, with full guarantee that you can retrieve it from Git later. Since this is done locally, no one is complaining and no one needs to review it. The code will be forever preserved in the repository when we push it. Later when we resurrect it, it will be reviewed as it becomes part of the current code. Indeed, with local commits, you can experiment with much freedom, with both the advantage of version-control and the subsequent repository preservation of your bits for posterity.

Notice that all this applies equally-well to private projects, single-developer public projects and multi-developer projects. The organizational advantages are only more valuable the more the participants.

Easy Merging

Even with local commits, sooner or later we’ll need to branch off and work on a parallel line of code. And if our project is useful to anyone, the branches will diverge faster than you can checkout. Merging code is the currency of branching. Anyone who’s tried merging should know this is more often than not painful. This is typically because what’s being merged are the tips/heads of the branches in question. These two incarnations of our code are increasingly more difficult to reconcile the more changes they had experienced in their separated lives.

But any VCS by definition has full history, which can be leveraged to improve merging. So why is this a Git advantage? Git has two things going for it. First and foremost, it has full history locally. That’s right. Your working-copy (WC) is not a copy of what you checked-out, rather it’s a clone of the repository. So while centralized VCS can take advantage of the repository’s history, for Git this information is readily in your WC. The second is that with local commits, the commit unit is typically very small, this helps merging quite a bit, as it can have higher confidence regarding where the lines moved and what was changed into what.

Overall, merging with Git is otherworldly. So far, no centralized VCS can even match the accuracy of Git’s merge output.

Explicit Exclusion

With Source Safe, CVS and SVN it’s not rare to get broken builds because of missing files. After some point in a project’s life, adding new files takes a sporadic pattern. It’s common to forget to add the new files under the VCS, only to be reminded by colleagues and broken build emails to the humiliation of the developer who missed the files, of course. If reviews are mandatory, then fixing this error involves at least another developer, who need to sign-off the new patch for committing.

This problem arises from the fact that with these traditional, centralized VCSs, files are excluded implicitly (by default) and they are opted-in when necessary. With Git, the opposite is the case: everything under the root is included by default, exclusion is the exception. This sounds very trivial, but the consequences are anything but. Not only does this save time and avoid embarrassing mistakes, but it’s also more natural. Virtually always a file within the project tree is a file necessary for the project. The exceptions are few indeed. If you think about it, most of the exceptions are files generated by tools. These are excluded by file extension and folder names in the ignore file (.gitignore for Git.) Rarely do we add any files that shouldn’t be stored and tracked by the VCS. If it’s not automatically generated during build, then it should be in the repository.

Conclusion

Git is a paradigm shift in version-control. It’s not just a set of new features, it’s a different way of organizing change-sets, and by extension writing code. Git gives us better automation and tooling, at the same time it encourages us to employ healthy and useful practices. In fact, the features outlined above, do make a good use of the distributed architecture of Git. So it’s not a coincidence that it’s so much useful even for the single-user project.

If you’re using SVN, consider copying it over to Git using git-svn and playing around. Git can synchronize with the SVN, until you decide to abandon one or the other. In addition, GitHub is a great online repository. As a learning tool, consider forking any of the countless projects and play around.

1 Git has no exclusive monopoly on the discussed advantages, however I’m reviewing my experience with Git in particular. Hg, Bazaar and others will have to wait for another time.
2 Here I’m concerned with code back up that we don’t want to discard yet, but don’t want to commit either. Data backup is still necessary.

QR Code Business Card