Transactions are the seat-belts of SQL servers. They protect our data by promising “all or nothing”. Data consistency on any level couldn’t be guaranteed without it. In a sense transactions work because they guarantee atomicity. They work because they allow you to abandon your changes and walk away at will, completely secure in the knowledge that not only your changes thus far aren’t visible to anyone, but also that your changes didn’t go anywhere near your valuable data.

Any reasonably-sized database code, therefore, is bound to have transactions peppered all about. Yet having a flawed implementation could be as worse as not having any at all. Perhaps even worse, given the false confidence. To my shock and horror, we just discovered a similar situation in one of our products. The opportunity to get to the bottom of transactions was pretty much forced, but it was a welcome opportunity nonetheless.

Here I’ll limit this investigation to MS SQL (2005 and 2008 used for testing).

Here is what we’re trying to achieve:

  1. Understand how MS SQL transactions work.
  2. Transactions must support nesting and rollback should abort all.
  3. Develop a template that use transactions correctly with error handling.

Test Bed

Let’s start by a boilerplate database as a test-bed.

-- Create an uncreatively named database.
CREATE DATABASE DeeBee
GO
USE DeeBee
GO

-- Create a table to use for data modification.
CREATE TABLE Tee(
	-- We'll generate overflow exception at will using this field.
	Num TINYINT
)
GO

Now let’s see what happens if we do multiple inserts and one fails…

-- Sproc to do multiple inserts with failure.
CREATE PROCEDURE [Multi_Insert]
AS
	-- Normal case
	INSERT INTO Tee VALUES(1);
	-- Overflow case
	INSERT INTO Tee VALUES(2000);
	-- Normal case again
	INSERT INTO Tee VALUES(3);
GO

To execute the above sproc, run the following (repeat after each change to the sproc above):

-- Clean slate.
DELETE FROM Tee
GO

-- Execute complex statements.
EXEC Multi_Insert;
GO

-- Check the results
SELECT * FROM Tee
GO

Results (for the exec alone):

(1 row(s) affected)
Msg 220, Level 16, State 2, Procedure Multi_Insert, Line 8
Arithmetic overflow error for data type tinyint, value = 2000.
The statement has been terminated.

(1 row(s) affected)

Two things to observe here: First, unsurprisingly the first insert went as planned as can be seen in the first “1 row(s) affected” message, followed by the overflow error. Second, which is important here, is that this didn’t abort or change the flow of the sproc, rather, the 3rd insert was executed as well.

Selecting everything in the Tee gives us:

1
3

Error Handling

In many cases we wouldn’t want to resume with our operations when one fails. A quick solution might look something like this:

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
	-- Normal case
	INSERT INTO Tee VALUES(1);
	if @@ERROR <> 0
		RETURN

	-- Overflow case
	INSERT INTO Tee VALUES(2000);
	if @@ERROR <> 0
		RETURN

	-- Normal case again
	INSERT INTO Tee VALUES(3);
GO

This would work, but it only prevents further operations after the initial error, it doesn’t undo previous changes. In addition, it’s very tedious, especially with complex sprocs and looks ugly. Let’s see what we can do with exception handling…

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	-- Normal case
	INSERT INTO Tee VALUES(1);
	-- Overflow case
	INSERT INTO Tee VALUES(2000);
	-- Normal case again
	INSERT INTO Tee VALUES(3);
END TRY
BEGIN CATCH
	PRINT ERROR_MESSAGE();
END CATCH
GO

This is functionally equivalent to the previous version, however here we have a clean control-flow and no longer do we need to taint our code with error checking. The print statement isn’t necessary, I just added it to show that the same error occurs. The result of this sproc is a single insert into Tee with the value 1.

Transactions

Let’s do the same, this time using transactions only.

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN
	BEGIN TRANSACTION
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END
GO

The above sproc will result in the following, familiar, output:

(1 row(s) affected)
Msg 220, Level 16, State 2, Procedure Multi_Insert, Line 9
Arithmetic overflow error for data type tinyint, value = 2000.
The statement has been terminated.

(1 row(s) affected)

This is the exact same behavior as without the transactions! This means transactions don’t automatically add protection, recovery and undo. Not only didn’t we revert the first insert after the error, but we went on to execute the remainder of the transaction. The only way to fix this is to actually issue a ROLLBACK TRANSACTION statement. Of course we must do this by first detecting that something went wrong. Since we already know that we must do this manually, we should either check for errors or wrap the code in TRY/CATCH clauses.

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	BEGIN TRANSACTION
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	ROLLBACK TRANSACTION
	PRINT ERROR_MESSAGE();
END CATCH
GO

Results:

(1 row(s) affected)

(0 row(s) affected)
Arithmetic overflow error for data type tinyint, value = 2000.

Finally, modifications are rolled back and the table is back to a clean slate, as it was before executing the sproc.

Nested Transactions

Now that we have a working transaction with correct exception handling, let’s try to develop a version that is symmetric and can be nested. We could use the above try/transaction/catch/rollback pattern, except, it’s not robust. To see why that is so, let’s take a little bit more complex case. Suppose you used this pattern in the sprocs that do modification, and in one case you called one sproc from another.

-- Sproc that does complex operations.
CREATE PROCEDURE [SP_Complex]
AS
BEGIN TRY
	BEGIN TRANSACTION

		-- Normal case
		INSERT INTO Tee VALUES(0);

		-- Execute a bad sproc.
		EXEC Multi_Insert;

		-- Normal case again
		INSERT INTO Tee VALUES(5);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	ROLLBACK TRANSACTION
	PRINT ERROR_MESSAGE();
END CATCH
GO

The above sproc is written using the same pattern we used in Multi_Insert. We know the inner Multi_Insert will have an exception, it’ll rollback the transaction and return. But what about the outer transaction in SP_Complex? Would the first insert be rolled back? Will the last insert get executed?

Results:

(1 row(s) affected)

(1 row(s) affected)

(0 row(s) affected)
Arithmetic overflow error for data type tinyint, value = 2000.
Msg 3903, Level 16, State 1, Procedure SP_Complex, Line 19
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION.
Transaction count after EXECUTE indicates that a COMMIT or ROLLBACK TRANSACTION statement is missing. Previous count = 1, current count = 0.

Things didn’t go as planned. Checking for modifications in the Tee table shows that indeed the transaction worked as expected and all changes were rolled back. We still got an error though! To understand what’s going on, let’s see what the error was about.

First, notice that there were two inserts, one happened in the outer transaction and the second from the inner one, yet both were rolled back (the table should be empty). Then we print the overflow error, which indicates that the second insert in Multi_Insert threw. So far, so good. Then we hit an unanticipated error message. The clue to what happened is in “Transaction count after EXECUTE [...]“. It seems that SQL engine is checking for transaction count after EXEC statements. Since we called ROLLBACK in Multi_Insert it seems that this has rolled back not only the inner transaction, but also the outer. This is confirmed by two facts. First, the table hasn’t been changed and all our inserts have been undone. Second, the inner ROLLBACK does match the BEGIN TRANSACTION in Multi_Insert and so does the pair in SP_Complex. So the only way we could get a mismatch error is if something in Multi_Insert had an effect larger than its scope. (Also, what’s with “Transaction count” references?)

To see what actually happened, let’s trace the execution path:

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [Multi_Insert]. Transactions: ' + Convert(varchar, @@TRANCOUNT);
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [Multi_Insert]: ' + ERROR_MESSAGE();
	ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

-- Sproc that does complex operations.
ALTER PROCEDURE [SP_Complex]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [SP_Complex]. Transactions: ' + Convert(varchar,@@TRANCOUNT);

		-- Normal case
		INSERT INTO Tee VALUES(0);

		-- Execute a bad sproc.
		EXEC Multi_Insert;

		-- Normal case again
		INSERT INTO Tee VALUES(5);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [SP_Complex]: ' + ERROR_MESSAGE();
	ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

Results:

IN [SP_Complex]. Transactions: 1

(1 row(s) affected)
IN [Multi_Insert]. Transactions: 2

(1 row(s) affected)

(0 row(s) affected)
ERR [Multi_Insert]: Arithmetic overflow error for data type tinyint, value = 2000.
Rolled back. Transactions: 0
ERR [SP_Complex]: Transaction count after EXECUTE indicates that a COMMIT or ROLLBACK TRANSACTION statement is missing. Previous count = 1, current count = 0.
Msg 3903, Level 16, State 1, Procedure SP_Complex, Line 21
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION.
Rolled back. Transactions: 0

To get the transaction count we use @@TRANCOUNT. Now if we look at the output from the trace, we’ll see the exact same thing, except now it’s obvious that the transaction count is at 2 within the inner transaction, yet after catching the overflow exception and rolling back, the transaction count has dropped to 0! This confirms it. ROLLBACK works not only on the current transaction, but escalates all the way to the top-most transaction. The references to the “Previous count = 1, current count = 0.” is regarding the transaction count before the EXEC statement and after it. The ROLLBACK in Multi_Insert made both the COMMIT and ROLLBACK statements in SP_Complex obsolete, or rather invalid.

Correct Nesting

-- Sproc to do multiple inserts with failure.
ALTER PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [Multi_Insert]. Transactions: ' + Convert(varchar, @@TRANCOUNT);
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [Multi_Insert]: ' + ERROR_MESSAGE();
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

-- Sproc that does complex operations.
ALTER PROCEDURE [SP_Complex]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [SP_Complex]. Transactions: ' + Convert(varchar, @@TRANCOUNT);

		-- Normal case
		INSERT INTO Tee VALUES(0);

		-- Execute a bad sproc.
		EXEC Multi_Insert;

		-- Normal case again
		INSERT INTO Tee VALUES(5);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [SP_Complex]: ' + ERROR_MESSAGE();
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

By checking whether or not an inner transaction has rolled back already we avoid any mismatches. This works because using try/catch errors always execute the catch clauses, preventing any statements after the error to execute. If our transactions catch an exception, yet the @@TRANCOUNT is 0, then we must conclude that an inner sproc has made more commits than the transactions it created, or it rolled back the transaction(s). In both cases we must not ROLLBACK unless we have a valid transaction, and we must always ROLLBACK if we’re in a CATCH clause if we have a valid transaction. Notice that we don’t check for @@TRANCOUNT to COMMIT TRANSACTION. The reason is because the only case where we could get to it is when the execution of the previous statements are successful. True that if somewhere a mismatched COMMIT is done, we’ll get a mismatch, but that’s a programmer error and must not be suppressed. That is, by suppressing that case, our code will not work as expected (in a transaction) because all subsequent statements will execute outside transactions, and therefore can’t be rolled back. So we let that mismatch blow up.

The Template

So, now that we have a good understanding of how transactions and error handling works, we can finally create a template that we can reuse, safe in the knowledge that it’s nestable and will behave as expected in all cases.

BEGIN TRY
	BEGIN TRANSACTION

	--
	-- YOUR CODE HERE
	--

	-- Don't check for mismatching. Such an error is fatal.
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	-- An inner sproc might have rolled-back already.
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
END CATCH

Full test-bed for reference:

IF EXISTS(SELECT * FROM sys.sysdatabases where name='DeeBee')
DROP DATABASE DeeBee
GO

-- Create an uncreatively named database.
CREATE DATABASE DeeBee
GO
USE DeeBee
GO

-- Create a table to use for data modification.
CREATE TABLE Tee(
	-- We'll generate overflow exception at will using this field.
	Num TINYINT
)
GO

-- Sproc to do multiple inserts with failure.
CREATE PROCEDURE [Multi_Insert]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [Multi_Insert]. Transactions: ' + Convert(varchar, @@TRANCOUNT);
		-- Normal case
		INSERT INTO Tee VALUES(1);
		-- Overflow case
		INSERT INTO Tee VALUES(2000);
		-- Normal case again
		INSERT INTO Tee VALUES(3);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [Multi_Insert]: ' + ERROR_MESSAGE();
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

-- Sproc that does complex operations.
CREATE PROCEDURE [SP_Complex]
AS
BEGIN TRY
	BEGIN TRANSACTION
		PRINT 'IN [SP_Complex]. Transactions: ' + Convert(varchar, @@TRANCOUNT);

		-- Normal case
		INSERT INTO Tee VALUES(0);

		-- Execute a bad sproc.
		EXEC Multi_Insert;

		-- Normal case again
		INSERT INTO Tee VALUES(5);
	COMMIT TRANSACTION
END TRY
BEGIN CATCH
	PRINT 'ERR [SP_Complex]: ' + ERROR_MESSAGE();
	IF (@@TRANCOUNT > 0)
		ROLLBACK TRANSACTION
	PRINT 'Rolled back. Transactions: ' + Convert(varchar, @@TRANCOUNT);
END CATCH
GO

-- Clean slate.
DELETE FROM Tee
GO

-- Execute complex statements.
--EXEC Multi_Insert;
EXEC SP_Complex;
GO

-- Check the results
SELECT * FROM Tee
GO

Conclusion

Error handling is often tricky and full of gotcha’s. We should be careful not to make any unwarranted assumptions, especially when working cross-languages. The error handling, and more importantly transactions, in SQL don’t necessarily behave like your favorite programming language. By carefully reading documentation, experimenting and adding traces we can get insight and develop robust solutions.

However the above isn’t the last word on error handling or transactions. Your needs might be different and your use-cases might dictate other priorities. Perhaps my approach misses some subtle case that you can’t afford to ignore. Or may be this is good enough for a boilerplate template. Either way, comments, corrections and suggestions are more than welcome.

Share
 

Databases are as ubiquitous as computers. They are often, erroneously, attributed to data-intensive, long-term storage solutions. Yet in reality they are used in one shape or form in most programs. From word processors to media players to web-browsers. All depend on databases to realize some of their features. This, not mentioning, e-commerce, flight booking, libraries and, of course, government intelligence data. Even when no feature requires a database per-se, user configuration and settings are often stored in databases. Where previously Windows programs depended on the Registry and Unix programs on plain-text files, increasingly new products are utilizing the portable and vastly superior SQLite instead.

Image via Wikipedia

Each application of course has its own set of requirements, use-cases and patterns. Yet, surprisingly or otherwise, there are two main patterns that can be said to parent the rest: OLTP and OLAP. Or, Online Transaction Processing and Online Analytical Processing respectively. Consequently, database design and optimization depends heavily on our processing needs and models. (Here I’m not concerned with database engine design and architecture, rather on the use-case patterns and processing models.) To get a better feel of each, let’s consider typical cases for each.

Online Transaction Processing (OLTP)

This type of applications are chiefly characterized by performing real-time, localized, mission-critical operations. That is, the operations concerned must complete virtually immediately (hence real-time processing,) they typically involve a very small number of entries independent of the remaining data (hence localized processing,) and there is zero-tolerance for data inconsistency, corruption or undefined states. The latter property is what requires transactional processing such that either the operation completely succeeds with all its sub-parts or it completely fails restoring the previous state of all involved entries.

In OLTP the operations are mostly that of Insert and Update, with occasional Delete operations for housekeeping, although typically deletion is differed for performance and/or for later data mining and reporting. The tables involved are typically normalized around the data entities.

Examples include online shopping, flight and medical data and custom data processing.

Online Analytical Processing (OLAP)

In contrast to OLTP, this type of applications are characterized by performing offline, aggregate and low-priority operations. The operations involved here can be executed at low-demand hours, often on archive databases and secondary servers (hence offline processing,) they typically aggregate a very large amount of data to generate statistical data, reports or to find outliers (hence aggregation,) and, since they are offline processing, they are designed to have low-priority, low isolation level (read uncommitted) and, since there is little or no side-effects to failing, they are designed to fail rather than potentially interfere with competing OLTP operations (if executed on the same database.)

OLAP operations are mostly Select operations with virtually no data modification (except for storing the results, if at all, typically in dedicated tables.) These operations not only aggregate large number of entries, with heavy use of aggregate functions, but they typically join a significant number of tables.

Examples include reporting, billing, statistical analysis and historic data processing.

Similarities and Differences

Since the differences between these two patterns lies in the underlying database operations they perform, it’s no surprise that there aren’t a multitude of basic patterns. The two main operation types are that of read and modify. That is, the applications that mostly read data (OLAP) typically perform massive read operations potentially across the complete data with few or no modification, and those that depend on heavy data modification care most about the integrity of the few entries they modify with little or no interest in other data.

However, one must make note of the fact that modification implies reading. This is crucial for correctly appreciating the performance characteristics of the two patterns. The three data-modifying operations, Insert, Update and Delete, all ought first perform Select operation(s) in one form or another. While this is more apparent for Update and Delete, Inserts with constraints must first validate that no constraint is violated. This often involves a lookup operation in the constraint index(es). Only in the most basic and simplest cases could an Insert operation be free of all lookup operations (when foreign key, unique or other constraints are completely absent.)

This doesn’t mean that OLTP operations are a superset of OLAP operations. To try and optimize first and foremost for OLAP with the hope that OLTP operations would naturally also run faster, thanks to the improved read operations that it could utilize, is a fallacy. The two are worlds apart and typically don’t have overlapping indexable data (more on this below.) Where they do share fields, indexes on them would most certainly be welcome to both. Unfortunately, that doesn’t materialize nearly as much as one would like.

Typical properties of OLTP vs. OLAP
Transactional ProcessingAnalytical Processing
DataCurrent and in-progress.Retired, historic or archived.
Typical OperationsInsert, Update, Delete.Select.
Function TypesBased on business requirements.Based on data mining.
NormalizationHighly normalized, entity modeled.Possibly denormalized to minimize joins.
Data LocalityFew tables, few entries.Large aggregation across many tables.
Indexing StrategyLimited to a few highly-selective fields.Generous indexing depending on the queries.
Operation DurationTypically very short, sub-millisecond range.Extended over minutes and hours.
Caching FriendlinessHighly volatile data with strict persistence requirements.Static data, very cache friendly.
Backup RequirementInconsistency and data loss may cost business.All operations can be rerun, backup is redundant.
Query ComplexityTrivial to simple.Complex to inhumane and unreadable.

Hybrids

The above descriptions are certainly not the only possibilities. A combination of both is more reasonable to expect in practice rather than the theoretical and somewhat idealized cases given above. Some such case might include functionality that must aggregate several groups of data without suffering phantom reads. Such a requirement means that not only the aggregate query must run with a high isolation level, reducing parallelism, but that it must also add indexes to finish as fast as possible to free up the entries that concurrent transactions might require. The additional indexes would incur unwelcome cost for data modification (which must also update the indexes as well.)

Optimization

Due to the differences in the operations involved, and indeed their requirements, optimization in each case is different than the other.

OLTP

To maximize performance, OLTP operations would use highly selective conditions on normalized tables in queries that are completely indexed (rather than partially indexed). This will get us to the target data as fast as possible, or, in case of insertion, will verify the absence of collisions equally quickly. Typically the primary key is used, which itself may be an alias of the internal row-id.

Next we’d want to localize locks as much as possible to maximize concurrency and minimize collateral impact. Fine grained lock hints such as row-level locks may be used. Normalization helps here by further containing the impact of the transactions as local as possible. In addition, we’d want to minimize the average row-size to improve cache-hits and avoid undue penalty. Once again, Normalization does most of the work here too.

Finally, OLTP operations would want to minimize, if not avoid, indexes, triggers, views and any operation that doesn’t help it reach and update the data as fast as possible. To that end, indexes are only useful to reach the data fast. All others would simply slow down virtually all operations, save for any selects we may have.

OLAP

For OLAP on the other hand it’d be best to work with denormalized data, to minimize or avoid joins.

Unlike OLTP, it’s very hard to predict or prioritize the conditions most useful to OLAP. This is because depending on the particular query in question, the tables and fields of importance are decided. Indeed, different queries might depend on very different fields and therefore indexes. So indexing some field might be useful to only an isolated query and not others. For a billing system not only the customer and orders are of importance, but the dates of orders and the billing period as well. Similarly, to generate sales reports, the queries involved would select on products and dates. Yet an inventory audit query might depend on a different set of fields. Hence, indexes are best decided based on the concrete queries in question.

To improve performance advanced features such as Materialized Views (aka Indexed Views) may be utilized, which are unfriendly to transactional processing.

Hybrids

From the above it’s quite clear that not only do OLTP and OLAP differ in their operations, but consequently optimizing their queries are apparently in conflict with each other. The requirements for maximizing performance in each case is simply contradictory to one another. But often both types of processing are done in our products. How can we optimize them?

Optimization for both OLTP and OLAP

It must be said from the outset that optimizing for two contradictory set of requirements can be a recipe for disaster. That is, we might end up with worse performance for most queries and even degrade the database design in hope of improving performance. With hubris some might advocate indexing all fields in query conditions, creating an uninformed mess in their wake. In fact, since OLTP operations are inherently very short lived, the overhead of updating superfluous indexes will probably go unnoticed, yet a long-running OLAP operations might get a noticeable boost that the OLAP queries get. From this the gulls in question might pat themselves for a job well done, completely oblivious to the fact that the OLTP operation in question will probably run millions of times, each time incurring the slight cost of the index updates, negating the apparent speed boost. To avoid such scenarios, we must understand that a case of compromise is perhaps unavoidable and approach the problem methodically.

Here are some design patterns that are often employed to maximize performance across both types of queries.

Data Archiving

Perhaps the simplest solution is to avoid performing both OLTP and OLAP type operations on the same data-set. To achieve this, we’ll have to split the data over two sets of tables. One, highly normalized and optimized for OLTP and another, perhaps denormalized, OLAP-optimized set.

A data-retirement scheme will be used to move the retired data from the OLTP table(s) into the archive (OLAP) tables. This may be performed using a background process, or using triggers. The latter may not be desirable as triggers would add latency to the OLTP queries, however, on the other hand, the cache hits and the avoidance of extra code and scanning of the OLTP tables might be a good compromise.

Dynamic Indexing

A number of application process data in batches. Each batch is processed in transactional fashion and once all the data has been processed completely, reports and other aggregate functions are executed on the final set. Custom data processing is perhaps a good example here, where some data is processed (for example by means of transformation, conversion or other modification) and tracked by transcations in a database. The database accounts for every data item, its states as it is modified with timestamps and warning/error codes where applicable.

Such a processing scheme has the advantage of allowing for an OLTP type optimization for the data processing stage until all the data is completely processed, whereupon OLAP-optimized indexes are dynamically added. The overhead of adding or modifying the indexes (yes, some might be dropped, while new ones added) might win us an overall significant amount of CPU time. Of course we’d need to profile and benchmark thoroughly to decide which indexes to remove and which to add.

Biting the Bullet: Conjoining Both

In some cases we need to aggregate both retired and archived data as well as current and in-transit data. If this scenario is of prime importance, then archiving the data for offline processing might simply be unhelpful, as we must also join the OLTP tables to get as-yet unarchived data. Also, we can’t work in stages and perform dynamic indexing since we have to work with both current and archived data.

For obvious reasons this is the worst of both worlds. In this case, we must very carefully balance two forces:

  1. Indexes for optimum OLTP performance.
  2. Indexes for the most important OLAP queries.

Here analytic profiling of the different operations, their execution pattern and respective use-cases is necessary. Once we get a rough idea of the relative importance of our database operations, we can start collecting empirical data using benchmarks to evaluate the different approaches. Since the database design, indexes and other attributes must be shared among all database operations, we must choose the minimum set of indexes etc. that give the maximum performance across all operations.

The difficultly is of course in finding this optimum point. As each index helpful to the OLAP queries, but unnecessary to the OLTP ones, is an overhead for the high-rate transactions. On the other hand, strategic indexes could boost the performance of aggregate queries by orders of magnitude.

Here one is bound to spend a good amount of time using database profiling, statistical analysis and benchmarking. (In a future post I’ll address this topic head-on.) The downside of all this is that once the priorities of our OLAP queries change, then our particular database design will perhaps be outdated too.

Conclusion

The requirements of OLTP and OLAP are as different as are their use-cases. Optimizing for both is a very tricky business, but sometimes unavoidable. If we must maximize performance, we must invest a considerable amount of time designing a well-balanced database schema and, using extensive profiling and benchmarking tools combined with analytic methods, only then can we decide on the design that maximizes performance for our particular needs.

(Anything missing? If you have something interesting on the topic to share, please don’t keep it to yourself.)

Related articles
Share
 

Every so often a piece of technology comes along and changes everything. Once we experience this new way of doing things, we can no longer understand how we survived without it. After we sent our very first emails, walking to the post office to drop mail seemed unearthly. And who’d replace an IDE with a text-editor?

git icon, created for the Open Icon Library

Image via Wikipedia

Git1 didn’t seem the answer to my needs. I’ve been using Subversion (SVN) since 2006 and I’ve been a very happy camper indeed. Before that I used CVS and, although inexperienced with Version Control Systems (VCS), it was a major improvement over MS Source Safe (which I had used for almost 6 years before that.) I use SVN at home and at work. I’ve grown used and dependent on version control so much that I use SVN for my documents and other files, not just code. But Git? Why would I need Git?

When Git came to the scene there were already some Distributed VCS (DVCS) around (as opposed to centralized VCS, such as CVS and SVN.) But Linus made an impression with his Google Talk. I wanted to try this new piece of technology regardless of my needs. It was just too tasty to pass up. At the first opportunity, I installed the core tools and Git Extensions to ease my way with some visual feedback (I learn by reading and experimenting.)

Now that I’ve played around with Git for a while, and I’ve successfully moved some of my projects from SVN to Git, I can share my experience. Here is why I use Git even when not working with a team (where it’s infinitely more useful.)

Commit Often, Commit Many

Commits with half a dozen of -unrelated- changes is no stranger to us. A developer might add some new function, refactor another and rename an interface member all in the same change-set. This is counter-productive, because reviewing such unrelated code-change is artificially made more difficult than necessary. But, if the review unit is the commit unit, then developers combine multiple changes to reduce overhead and push them onto their colleagues. This is unfortunate, because the code should evolve in the best way possible, uninfluenced by unrelated artificial forces, such as tooling nuances. But more than reviewing, combined commits cause much headache and lost productivity when we need to go back in time and find a specific line of code, rollback or merge. But what if the changes were related? What if we need to make a few KLOCs of change for the code to even build successfully? The centralized VCS would recommend a branch. But unless the sub-project is long-term, branching is yet another overhead that developers try to avoid.

With Git, these problems are no more, thanks to local commits. With local commits, one can (and should) commit as often as possible. The change log no longer is anything more than a single sentence. The changes aren’t reflected anywhere, until we decide to push the changes onto the server. There is no longer a distinction between major changes and minor changes. All changes can be subdivided as much as necessary. No longer does one need to do local backups2, create personal branches or make every change visible company-wide or publically. Local commits are full-fledged VCS that doesn’t introduce new or extra work. When we’re done, we just update the repository in one push command.

If you need to keep some piece of code around, but do not wish to send it for review and commit, you need to copy it somewhere. With local commits, you can indeed commit it, with relevant commit-log. In a subsequent change-set, you can delete it, with full guarantee that you can retrieve it from Git later. Since this is done locally, no one is complaining and no one needs to review it. The code will be forever preserved in the repository when we push it. Later when we resurrect it, it will be reviewed as it becomes part of the current code. Indeed, with local commits, you can experiment with much freedom, with both the advantage of version-control and the subsequent repository preservation of your bits for posterity.

Notice that all this applies equally-well to private projects, single-developer public projects and multi-developer projects. The organizational advantages are only more valuable the more the participants.

Easy Merging

Even with local commits, sooner or later we’ll need to branch off and work on a parallel line of code. And if our project is useful to anyone, the branches will diverge faster than you can checkout. Merging code is the currency of branching. Anyone who’s tried merging should know this is more often than not painful. This is typically because what’s being merged are the tips/heads of the branches in question. These two incarnations of our code are increasingly more difficult to reconcile the more changes they had experienced in their separated lives.

But any VCS by definition has full history, which can be leveraged to improve merging. So why is this a Git advantage? Git has two things going for it. First and foremost, it has full history locally. That’s right. Your working-copy (WC) is not a copy of what you checked-out, rather it’s a clone of the repository. So while centralized VCS can take advantage of the repository’s history, for Git this information is readily in your WC. The second is that with local commits, the commit unit is typically very small, this helps merging quite a bit, as it can have higher confidence regarding where the lines moved and what was changed into what.

Overall, merging with Git is otherworldly. So far, no centralized VCS can even match the accuracy of Git’s merge output.

Explicit Exclusion

With Source Safe, CVS and SVN it’s not rare to get broken builds because of missing files. After some point in a project’s life, adding new files takes a sporadic pattern. It’s common to forget to add the new files under the VCS, only to be reminded by colleagues and broken build emails to the humiliation of the developer who missed the files, of course. If reviews are mandatory, then fixing this error involves at least another developer, who need to sign-off the new patch for committing.

This problem arises from the fact that with these traditional, centralized VCSs, files are excluded implicitly (by default) and they are opted-in when necessary. With Git, the opposite is the case: everything under the root is included by default, exclusion is the exception. This sounds very trivial, but the consequences are anything but. Not only does this save time and avoid embarrassing mistakes, but it’s also more natural. Virtually always a file within the project tree is a file necessary for the project. The exceptions are few indeed. If you think about it, most of the exceptions are files generated by tools. These are excluded by file extension and folder names in the ignore file (.gitignore for Git.) Rarely do we add any files that shouldn’t be stored and tracked by the VCS. If it’s not automatically generated during build, then it should be in the repository.

Conclusion

Git is a paradigm shift in version-control. It’s not just a set of new features, it’s a different way of organizing change-sets, and by extension writing code. Git gives us better automation and tooling, at the same time it encourages us to employ healthy and useful practices. In fact, the features outlined above, do make a good use of the distributed architecture of Git. So it’s not a coincidence that it’s so much useful even for the single-user project.

If you’re using SVN, consider copying it over to Git using git-svn and playing around. Git can synchronize with the SVN, until you decide to abandon one or the other. In addition, GitHub is a great online repository. As a learning tool, consider forking any of the countless projects and play around.

1 Git has no exclusive monopoly on the discussed advantages, however I’m reviewing my experience with Git in particular. Hg, Bazaar and others will have to wait for another time.
2 Here I’m concerned with code back up that we don’t want to discard yet, but don’t want to commit either. Data backup is still necessary.

Share
QR Code Business Card