Ashod Nakashian

Jul 042011
 

The BBC just released 60 years worth of the Reith Lectures. Since 1948, each year (except ’77 and ’92) a prominent speaker is invited to deliver a series of lectures on a relevant and debated topic of the time. The first year’s lecturer was no other than Bertrand Russel who gave 6 lectures on Authority and the Individual.

The series isn’t unlike the Messenger Lectures of Cornell University. Albeit, the latter apparently doesn’t make the lectures publicly available. That is, save of the great lectures delivered by Richard Feynman in 1964 which Microsoft restored to showcase their Silverlight technology and its video features. Project Tuva, as Microsoft calls it, refers to Feynman’s and, his longtime friend, Ralph Leighton‘s attempt to travel to Tuva. The project which the two friends dubbed Tuva or Bust is documented in Ralph’s book by the same name.

There is a wealth of historic and once-in-a-lifetime lectures and public appearances by eminent figures archived away collecting dust. BBC isn’t the first to make freely available what could only be useful and of value the more it proliferates. The topics of the 20th century are by and large the topics of the 21st. This isn’t simply because our most pressing issues have backdrops in the previous century, not just, but also because most issues are fundamentally the same.

Even in the case of Project Tuva, where a commercial institution chose to promote and advertise its product by restoring and releasing to public what could otherwise be buried by time and misc discarded tapes and equipment. For it isn’t at all important how the message is conveyed, so long that it reaches our ears and minds. More and more institutions, organizations and governments should sponsor similar efforts. In fact, donations to start a new web-based series should be well worth the effort. What used to be highly costly to make publicly available in the past, now costs only fractions of cents per person to download from across the globe. Indeed, utilizing peer-based distributed networks such as BitTorrent, the cost could drop to near zero (on average.) TED is perhaps the best example of a similar model, although they rent a real venue with a rather elaborate and fancy stage. At TED the social aspect is as important as the ideas shared, which is enjoyed by a lucky (and wealthy) few. But the more the better.

The Reith Lectures are available for download some including transcripts as well. The list features names from all fields. Most notably are physicist Robert Oppenheimer (1953,) geneticist Steve Jones (1993,) neuroscientist Vilayanur Ramachandran (2003,) and astronomer Martin Rees (2010.) I’m very happy to report that it reads ‘Indefinitely’ next to the availability tag.

Jul 022011
 

A customer noticed sudden and sharp increase in the database disk consumption. The alarm that went off was the low disk-space monitor. Apparently the database in question left only a few spare GBs on disk. The concerned customer opened a ticked asking two questions: Is the database growth normal and expected? and what’s the average physical storage requirements per row?

The answer to the first question had to do with their particular actions, which was case specific. However, to answer the second, one either has to keep track of each table’s schema, adding the typical/maximum size of each field, calculating indexes and their sizes, or, one could simply do the math on a typical dataset using SQL code. Obviously the latter is the simpler and preferred.

Google returns quite a number of results (1, 2, 3, 4 and 5.) For MS SQL, it seems that virtually all rely on the sp_spaceused stored proc. SQL has an undocumented sproc sp_msforeachtable which runs over each table in the database and executes a sproc passing each table’s name as param. While it isn’t at all difficult to do this manually (looping over sys.Tables is hardly a feat,) calling this one-liner is still very convenient. So no surprise that virtually all samples online just do that.

Here is an sproc that prints the total database size, reserved size, data size, index size and unused sizes. In addition, the sproc prints the same numbers for each table with the total number of rows in all tables at the end.

My prime interest wasn’t just to learn about the database size, which can be achieved using sp_spaceused without any params, nor to just learn about each table’s share, which can be done by passing the table name in question to sp_spaceused. My main purpose was to get a breakdown of the average row-size per table.

So, here is a similar script to do exactly that. The script first updates the page and row counts for the whole database (which may take a long time, so disable on production databases,) in addition, it calculates the totals and averages of each data-point for all tables and calculates the average data size (data + index) and wasted bytes (reserved + unused) per table. All the information for the tables is printed in a single join statement to return a single rowset with all the relevant data.

-- Copyright (c) 2011, Ashod Nakashian
-- All rights reserved.
-- 
-- Redistribution and use in source and binary forms, with or without modification,
-- are permitted provided that the following conditions are met:
-- 
-- o Redistributions of source code must retain the above copyright notice, 
-- this list of conditions and the following disclaimer.
-- o Redistributions in binary form must reproduce the above copyright notice, 
-- this list of conditions and the following disclaimer in the documentation and/or
-- other materials provided with the distribution.
-- o Neither the name of the author nor the names of its contributors may be used to endorse
-- or promote products derived from this software without specific prior written permission.
--
-- THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
-- EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
-- OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT 
-- SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 
-- INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-- PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
-- INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
-- LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-- OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--
-- Show physical size statistics for each table in the database.
--
SET NOCOUNT ON

-- Update all page and count stats.
-- Comment for large tables on production!
DBCC UPDATEUSAGE(0) 

-- Total DB size.
EXEC sp_spaceused

-- Per-table statistics.
DECLARE @t TABLE
( 
    [name] NVARCHAR(128),
    [rows] BIGINT, 
    [reserved] VARCHAR(18), 
    [data] VARCHAR(18), 
    [index_size] VARCHAR(18),
    [unused] VARCHAR(18)
)

-- Collect per-table data in @t.
INSERT @t EXEC sp_msForEachTable 'EXEC sp_spaceused ''?'''

-- Calculate the averages and totals.
INSERT into @t
SELECT 'Average', AVG(rows),
    CONVERT(varchar(18), AVG(CAST(SUBSTRING([reserved], 0, LEN([reserved]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), AVG(CAST(SUBSTRING([data], 0, LEN([data]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), AVG(CAST(SUBSTRING([index_size], 0, LEN([index_size]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), AVG(CAST(SUBSTRING([unused], 0, LEN([unused]) - 1) AS int))) + ' KB'
FROM   @t
UNION ALL
SELECT 'Total', SUM(rows),
    CONVERT(varchar(18), SUM(CAST(SUBSTRING([reserved], 0, LEN([reserved]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), SUM(CAST(SUBSTRING([data], 0, LEN([data]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), SUM(CAST(SUBSTRING([index_size], 0, LEN([index_size]) - 1) AS int))) + ' KB',
    CONVERT(varchar(18), SUM(CAST(SUBSTRING([unused], 0, LEN([unused]) - 1) AS int))) + ' KB'
FROM   @t

-- Holds per-row average kbytes.
DECLARE @avg TABLE
( 
    [name] NVARCHAR(128),
    [data_per_row] VARCHAR(18),
    [waste_per_row] VARCHAR(18)
)

-- Calculate the per-row average data in kbytes.
insert into @avg
select t.name, 
    CONVERT(varchar(18),
        CONVERT(decimal(20, 2),
            (CAST(SUBSTRING(t.[data], 0, LEN(t.[data]) - 1) AS float) +
             CAST(SUBSTRING(t.[index_size], 0, LEN(t.[index_size]) - 1) AS float)) 
            / NULLIF([rows], 0))) + ' KB', 
    CONVERT(varchar(18),
        CONVERT(decimal(20, 2),
            (CAST(SUBSTRING(t.[reserved], 0, LEN(t.[reserved]) - 1) AS float) +
             CAST(SUBSTRING(t.[unused], 0, LEN(t.[unused]) - 1) AS float))
            / NULLIF([rows], 0))) + ' KB'
from @t t

-- Join the two tables using the table names.
select t.name, t.rows, t.reserved, t.data, t.index_size, t.unused, a.data_per_row, a.waste_per_row
from @t t, @avg a
where t.name = a.name

There is quite a bit of data conversion and casting that isn’t necessarily very performant, but here there isn’t much choice and optimizing further is probably unnecessary. But since there are so many different ways to get the same output, I’ll leave any variations up to the readers. Suggestions and improvements are more than welcome. Please use comments to share your alternatives.

This may easily be wrapped in an sproc for convenience. I hope you find it useful and handy.

Jun 262011
 

Every so often a piece of technology comes along and changes everything. Once we experience this new way of doing things, we can no longer understand how we survived without it. After we sent our very first emails, walking to the post office to drop mail seemed unearthly. And who’d replace an IDE with a text-editor?

git icon, created for the Open Icon Library

Image via Wikipedia

Git1 didn’t seem the answer to my needs. I’ve been using Subversion (SVN) since 2006 and I’ve been a very happy camper indeed. Before that I used CVS and, although inexperienced with Version Control Systems (VCS), it was a major improvement over MS Source Safe (which I had used for almost 6 years before that.) I use SVN at home and at work. I’ve grown used and dependent on version control so much that I use SVN for my documents and other files, not just code. But Git? Why would I need Git?

When Git came to the scene there were already some Distributed VCS (DVCS) around (as opposed to centralized VCS, such as CVS and SVN.) But Linus made an impression with his Google Talk. I wanted to try this new piece of technology regardless of my needs. It was just too tasty to pass up. At the first opportunity, I installed the core tools and Git Extensions to ease my way with some visual feedback (I learn by reading and experimenting.)

Now that I’ve played around with Git for a while, and I’ve successfully moved some of my projects from SVN to Git, I can share my experience. Here is why I use Git even when not working with a team (where it’s infinitely more useful.)

Commit Often, Commit Many

Commits with half a dozen of -unrelated- changes is no stranger to us. A developer might add some new function, refactor another and rename an interface member all in the same change-set. This is counter-productive, because reviewing such unrelated code-change is artificially made more difficult than necessary. But, if the review unit is the commit unit, then developers combine multiple changes to reduce overhead and push them onto their colleagues. This is unfortunate, because the code should evolve in the best way possible, uninfluenced by unrelated artificial forces, such as tooling nuances. But more than reviewing, combined commits cause much headache and lost productivity when we need to go back in time and find a specific line of code, rollback or merge. But what if the changes were related? What if we need to make a few KLOCs of change for the code to even build successfully? The centralized VCS would recommend a branch. But unless the sub-project is long-term, branching is yet another overhead that developers try to avoid.

With Git, these problems are no more, thanks to local commits. With local commits, one can (and should) commit as often as possible. The change log no longer is anything more than a single sentence. The changes aren’t reflected anywhere, until we decide to push the changes onto the server. There is no longer a distinction between major changes and minor changes. All changes can be subdivided as much as necessary. No longer does one need to do local backups2, create personal branches or make every change visible company-wide or publically. Local commits are full-fledged VCS that doesn’t introduce new or extra work. When we’re done, we just update the repository in one push command.

If you need to keep some piece of code around, but do not wish to send it for review and commit, you need to copy it somewhere. With local commits, you can indeed commit it, with relevant commit-log. In a subsequent change-set, you can delete it, with full guarantee that you can retrieve it from Git later. Since this is done locally, no one is complaining and no one needs to review it. The code will be forever preserved in the repository when we push it. Later when we resurrect it, it will be reviewed as it becomes part of the current code. Indeed, with local commits, you can experiment with much freedom, with both the advantage of version-control and the subsequent repository preservation of your bits for posterity.

Notice that all this applies equally-well to private projects, single-developer public projects and multi-developer projects. The organizational advantages are only more valuable the more the participants.

Easy Merging

Even with local commits, sooner or later we’ll need to branch off and work on a parallel line of code. And if our project is useful to anyone, the branches will diverge faster than you can checkout. Merging code is the currency of branching. Anyone who’s tried merging should know this is more often than not painful. This is typically because what’s being merged are the tips/heads of the branches in question. These two incarnations of our code are increasingly more difficult to reconcile the more changes they had experienced in their separated lives.

But any VCS by definition has full history, which can be leveraged to improve merging. So why is this a Git advantage? Git has two things going for it. First and foremost, it has full history locally. That’s right. Your working-copy (WC) is not a copy of what you checked-out, rather it’s a clone of the repository. So while centralized VCS can take advantage of the repository’s history, for Git this information is readily in your WC. The second is that with local commits, the commit unit is typically very small, this helps merging quite a bit, as it can have higher confidence regarding where the lines moved and what was changed into what.

Overall, merging with Git is otherworldly. So far, no centralized VCS can even match the accuracy of Git’s merge output.

Explicit Exclusion

With Source Safe, CVS and SVN it’s not rare to get broken builds because of missing files. After some point in a project’s life, adding new files takes a sporadic pattern. It’s common to forget to add the new files under the VCS, only to be reminded by colleagues and broken build emails to the humiliation of the developer who missed the files, of course. If reviews are mandatory, then fixing this error involves at least another developer, who need to sign-off the new patch for committing.

This problem arises from the fact that with these traditional, centralized VCSs, files are excluded implicitly (by default) and they are opted-in when necessary. With Git, the opposite is the case: everything under the root is included by default, exclusion is the exception. This sounds very trivial, but the consequences are anything but. Not only does this save time and avoid embarrassing mistakes, but it’s also more natural. Virtually always a file within the project tree is a file necessary for the project. The exceptions are few indeed. If you think about it, most of the exceptions are files generated by tools. These are excluded by file extension and folder names in the ignore file (.gitignore for Git.) Rarely do we add any files that shouldn’t be stored and tracked by the VCS. If it’s not automatically generated during build, then it should be in the repository.

Conclusion

Git is a paradigm shift in version-control. It’s not just a set of new features, it’s a different way of organizing change-sets, and by extension writing code. Git gives us better automation and tooling, at the same time it encourages us to employ healthy and useful practices. In fact, the features outlined above, do make a good use of the distributed architecture of Git. So it’s not a coincidence that it’s so much useful even for the single-user project.

If you’re using SVN, consider copying it over to Git using git-svn and playing around. Git can synchronize with the SVN, until you decide to abandon one or the other. In addition, GitHub is a great online repository. As a learning tool, consider forking any of the countless projects and play around.

1 Git has no exclusive monopoly on the discussed advantages, however I’m reviewing my experience with Git in particular. Hg, Bazaar and others will have to wait for another time.
2 Here I’m concerned with code back up that we don’t want to discard yet, but don’t want to commit either. Data backup is still necessary.

Jun 232011
 

We live in a time where communication is evermore effortless and taken for granted. So much so, that the audience is impatient to get to the point and the authors need say more in less.

I learned this the hard way. My most recent article, which weighed in at ~2200 words, was quickly buried when submitted on a social site. I could tell it was the length had something to do with it. I wasted no time; I shredded 2/3rd of the article and came up with an abridged version. At 800+ words, at least one person complained that it’s not abridged enough. Yet, where the original got less than 40 views, the abridged version got over 2000 hits in the first day and translated into Japanese.

This is very unfortunate. Because, at one extreme, one should just state their conclusions as tersely as possible, and on the other, one should write a book-load to make well-founded arguments. The latter is when the topic you’re trying to tackle is complicated, controversial, highly-misunderstood or all of the above. You have no much choice but to go at length stating where you’re coming from and where your arguments lead. What about the other extreme? When can or should one be terse? Hard to say, but one thing is for sure: being concise and articulate are exceedingly difficult.

Yet, fortunately, there are those who appreciate a well fleshed-out article. The same article, unabridged, seems to have made the front page of DZone.com, where the article is republished, and from there over 600 hits followed to this site (2300 more on dzone.)

But how long is too long? Turns out it depends on the subject and the target audience. On the web, I suspect most typically want to get the gist in under 400 words. Should every lengthy article get an abridged version? Probably not. But if one wants to be heard, one should be mindful of their target audience. You can have the most insightful things to say, yet if no one has the patience to listen, then you might as well do something different… or rather, do it differently.

Jun 182011
 

I realize that the original article of the same title was longer than what most would like to read. So here is an abridged version.

By now everybody and their screensaver have heard the Optimization Mantra: Don’t Do It! This is commonly wrapped in a three-rule package. The first two of which are copies of the mantra, and the third adds the wise word “Yet” to the one-and-only true rule and addresses it to the “expert.”

Premature optimization; Most of us have been there. And that’s what makes those words very familiar. Words of wisdom, if you will. We’ve all decided to do a smart trick or two before fleshing out the algorithm and even checking if it compiles, let alone checking the result, only to be dumbfounded by the output. I can figure it out! we declare… and after half a day, we’d be damned if we rewrote that stupid function from scratch. No chance, bub.

The rules are sound. No doubt. Another rule of optimization, when the time comes, is to use profilers and never, ever, make costly assumptions. And any assumption is probably costly. That, too, is sound. These are words of wisdom, no doubt. But, taken at face-value they could cause some harm.

In all but the smallest projects one must use a profiler, consult with others and especially talk with module owners, veteran developers and the architects before making any changes. The change-set must be planned, designed and well managed. The larger the project, the more this stage becomes important. No funny tricks, please.

Efficient Code != Premature Optimization

The traditional wisdom tells us to avoid premature optimization and when absolutely necessary, we should first use a profiler. But both of these can also be interpreted as follows: it’s OK to write inefficient and bloated code, and when necessary, we’ll see what the profiler comes up with.

Performance as an afterthought is very costly. Extremely so. But the alternative isn’t premature optimization. There is a very thin line between well-thought and designed code that you’d expect a professional to output and the student toy-project style coding. The latter focuses on getting the problem-of-the-moment solved, without any regards to error handling or performance or indeed maintenance.

It’s not premature optimization to use dictionary/map instead of a list or array if reading is more common. It’s not premature optimization to use an O(n) algorithm instead of the O(n2) that isn’t much more complicated than what we’ll use (if not an O(log2 n) algorithm). Similarly, moving invariant data outside a loop isn’t premature optimization.

As much as I’d hate to have a pretentious show-off in my team, who’d go around “optimizing” code by making wild guesses and random changes, without running a profiler or talking with their colleagues, I’d hate it even more if the team spent their time cleaning up after one another. It’s easy to write code without thinking more than a single step ahead. It’s easy to type some code, run it, add random trace logs (instead of properly debugging,) augment the code, run again, and repeat until the correct output is observed. As dull and dead-boring as that is.

I’m not suggesting that this extreme worse-case that I’ve described is the norm (although you’d be surprised to learn just how common it is.) My point is that there is a golden mean between “premature optimization” and “garbage coding.”

The Cost of Change

It’s well documented that the cost of change increases exponentially the later a project is in it’s development cycle. (See for example Code Complete.) This cost is sometimes overlooked, thanks to the Rule of Optimization. The rule highly discourages thinking about performance when one should at least give it a good thinking when designing.

This doesn’t suggest optimization-oriented development. Rather, having a conscious grasp of the performance implications can avoid a lot of painful change down the road. As we’ve already iterated, designing and writing efficient code doesn’t necessarily mean premature optimization. It just means we’re responsible and we are balancing the cost by investing a little early and avoiding a high cost in the future. For a real-life example see Robert O’Callahan’s post.

Conclusion

Premature optimization is a major trap. The wisdom of the community tells us to avoid experimenting on our production code and postponing optimization as much as possible. Only when the code is mature, and only when necessary should we, by the aid of a profiler, decide the hot-spots and then, and only then, very carefully optimize the code.

This strategy encourages developers to come up with inefficient, thoughtless and -often- outright ugly code. All in the name of avoiding premature optimization. Furthermore, it incorrectly assumes that profiling is a magic solution to improving performance.

There are no excuses to writing inefficient code if the alternative is available at a small or no cost. There is no excuse to not thinking the algorithm ahead of typing. No excuse to leaving old experimental bits and pieces because we might need them later, or that we’ll cleanup later when we optimize. The cost of poor design, badly performing code is very high.

Let’s optimize later, but let’s write efficient code, not optimum, just efficient, from the get-go.

Jun 172011
 
Fig. 4. Illustration of the constrained optimi...

Image via Wikipedia

Abridged version here.

By now everybody and their screensaver have heard the Optimization Mantra: Don’t Do It! This is commonly wrapped in a three-rule package (I suspect there is a word for that). The first two of which are copies of the mantra, and the third adds the wise word “Yet” to the one-and-only true rule and addresses it to the “expert.” I suspect originally the middle “rule” didn’t exist and it was later added for effect, and perhaps to get the total to the magic-number of three.

I can almost imagine Knuth after figuring out a single-character bug in a bunch of code, with coffee mugs and burger wraps (or whatever it was that was popular in the ’60s) scattered around the desk… eyes red-shot, sleepless and edgy, declaring “Premature optimization is the root of all evil.” (But in my mind he uses more graphic synonyms for ‘evil’.)

Knuth later attributed that bit about premature optimization to Tony Hoare (the author of QuickSort) thereby distorting my mental image of young Knuth swearing as he fixed his code, only later to be saved by Hoare himself who apparently doesn’t remember uttering or coining such words. (Somebody’s got bad memory… may be more than one.)

Smart Aleck

Premature optimization; Most of us have been there. And that’s what makes those words very familiar. Words of wisdom, if you will. We’ve all decided to do a smart trick or two before fleshing out the algorithm and even checking if it compiles, let alone checking the result, only to be dumbfounded by the output. I can figure it out! we declare… and after half a day, we’d be damned if we rewrote that stupid function from scratch. No chance, bub.

Probably the smarter amongst us would learn from the experience of such dogged persistence and avoid trickery the next time around. While few would admit to the less-intelligent decisions they took in the past, at least some will have learned a lesson or two when the next opportunity knocked.

The aforementioned trickery doesn’t have to be optimization trickery, mind you. Some people (read: everyone) likes to be a smart-ass every so often and show off. Sure, many end up shooting themselves in the foot and making fools of themselves. But that doesn’t stop the kids from doing a crazy jump while waving to their friends, iPod on and eating crackers, just to impress someone… who typically turns around precisely when they shouldn’t. (Did I mention skateboards?)

The rules are sound. No doubt. Another rule of optimization, when the time comes, is to use profilers and never, ever, make costly assumptions. And any assumption is probably costly. That, too, is sound. These are words of wisdom, no doubt. But, taken at face-value they could cause some harm.

Let’s take it from the top. Leaving aside the smaller projects we might have started and for years tended and developed. Most projects involve multiple developers and typically span generations of developers. They are legacy projects, in the sense of having a long and rich history. No one person can tell you this history, let alone describe all parts of the code. On such a project, if performance is an issue, you shouldn’t go about shooting in the dark and spending days or even weeks on your hunches. Such an approach will not only waste time, add a lot of noise and pollute the code-base and source repository (if you commit to the trunk, which you should never do, until done and ready to merge.)

In such a case, one must use a profiler, consult with others and especially talk with module owners, veteran developers and the architects before making any changes. The change-set must be planned, designed and well managed. The larger the project, the more this stage becomes important. No funny tricks, please.

Efficient Code != Premature Optimization

Of the standard interview questions we often ask (and get asked) are those on the data-structures and algorithms (discrete math and information theory.) I typically ask candidates to compare the data-structures in terms of performance which should cover both internal details and complexity characteristics (big O). It’s also a good opportunity to see how organized their thoughts are. We use arrays, lists and maps/dictionaries quite often and not having a good grasp of their essence is a shortcoming. As a follow-up to this I typically ask how they decide which to use. This isn’t an easy question, I realize. Some things are hard to put into words, even when we have a good understanding of them in our minds. But, interviews aren’t meant to be easy.

The worst answer I ever got was “I use List, because that’s what I get.” To which I had to ask “Where?” Apparently, the candidate worked on a legacy project that used Lists almost exclusively and bizarrely she never had a need for anything else. The best answer typically gives a description of the use-case. That is, they describe the algorithm they’ll implement, and from that, they decide which container to use.

The best answer isn’t merely a more detailed or technical answer. Not just. It’s the best answer because it’s the only answer that gives you reasons. The candidate must have thought about the problem and decided on an algorithm to use, they must’ve quantified the complexity of the algorithm (big O) and they must’ve known the performance characteristics of the different containers for the operations their algorithm needs. They have thought about the problem and their solution thoroughly before choosing containers, designing classes and whatnot.

The traditional wisdom tells us to avoid premature optimization and when absolutely necessary, we should first use a profiler. But both of these can also be interpreted as follows: it’s OK to write inefficient and bloated code, and when necessary, we’ll see what the profiler comes up with.

Performance as an afterthought is very costly. Extremely so. But I don’t recommend premature optimization. There is a very thin line between well-thought and designed code that you’d expect a professional to output and the student toy-project style coding. The latter focuses on getting the problem-of-the-moment solved, without any regards to error handling or performance or indeed maintenance. We’ve all done it; Multiple similar functions; Same function reused for different purposes with too many responsibilities; Unreasonable resource consumption and horrible performance characteristics that the author is probably oblivious to. And so on.

It’s not premature optimization to use dictionary/map instead of a list or the most common container in your language of choice. Not when we have to read items most of the time. It’s not premature optimization if we use an O(n) algorithm instead of the O(n2) that isn’t much more complicated than what we’ll use (if not an O(log2 n) algorithm). It’s not premature optimization if we refactor a function so it wouldn’t handle multiple unrelated cases. Similarly, moving invariant data outside a loop isn’t premature optimization. Nor is caching very complex calculation results that we don’t need to redo.

Regex object construction is typically an expensive operation due to the parsing and optimizations involve. Some dynamic languages allow for runtime compilation for further optimization. If the expression string isn’t modified, creating a new instance of this object multiple times isn’t smart. In C# this would be creating a Regex object with RegexOptions.Compiled and in Java a Pattern.compile() called from matches() on a string. Making the object a static member is the smartest solution and hardly a premature optimization. And the list goes on.

As much as I’d hate to have a pretentious show-off in my team, who’d go around “optimizing” code by making wild guesses and random changes, without running a profiler or talking with their colleagues, I’d hate it even more if the team spent their time cleaning up after one another. It’s easy to write code without thinking more than a single step ahead. It’s easy to type some code, run it, add random trace logs (instead of properly debugging,) augment the code, run again, and repeat until the correct output is observed.

I don’t know about you, but to me, writing and modifying code instead of designing and working out the algorithm beforehand is simply counter productive. It’s not fun either. Similarly, debugging is much more interesting and engaging than adding random trace logs until we figure out what’s going on.

I’m not suggesting that this extreme worse-case that I’ve described is the norm (although you’d be surprised to learn just how common it is.) My point is that there is a golden mean between “premature optimization” and “garbage coding.”

The Cost of Change

When it’s time to spend valuable resources on optimization (including the cost of buying profilers,) I don’t expect us to discover that we needed a hash-table instead of an array after all. Rather, I should expect the profiler to come up with more subtle insights. Information that we couldn’t easily guess (and indeed we shouldn’t need to.) I should expect the seniors in the team to have a good grasp of the performance characteristics of project, the weak points and the limitations. Surely the profiler will give us accurate information, but unless we are in a good position to make informed and educated guesses, the profiler won’t help us much. Furthermore, understanding and analyzing the profiler’s output isn’t trivial. And if we have no clue what to change, and how our changes would affect the performance, we’ll use the profiler much like the student who prints traces of variables and repeatedly makes random changes until the output is the one expected. In short, the profiler just gives us raw data, we still have to interpret it, design a change-set and have a reasonably sound expectation of improved performance. Otherwise, profiling will be pure waste.

It’s well documented that the cost of change increases exponentially the later a project is in it’s development cycle. (See for example Code Complete.) This means a design issue caught during designing or planning will cost next to nothing to fix. However, try to fix that design defect when you’re performing system-testing, after having most modules integrated and working, and you’ll find that the change cascades over all the consequent stages and work completed.

This cost is sometimes overlooked, thanks to the Rule of Optimization. The rule highly discourages thinking about performance when one should at least give it a good thinking when the business and technical requirements are finalized (as far as design is concerned) and an initial design is complete. The architecture should answer to the performance question. And at every step of the development path developers must consider the consequences of their choices, algorithms and data-structures.

This doesn’t suggest optimization-oriented development. Rather, having a conscious grasp of the performance implications can avoid a lot of painful change down the road. As we’ve already iterated, designing and writing efficient code doesn’t necessarily mean premature optimization. It just means we’re responsible and we are balancing the cost by investing a little early and avoiding a high cost in the future. For a real-life example see Robert O’Callahan’s post linked above.

I know there is a camp that by this point is probably laughing at my naïve thinking. By the time I finish optimizing or even writing efficient and clean code, they’ll say, their product would’ve shipped and the customer would’ve bought the latest and faster hardware that will offset their disadvantage in performance. “What’s the point?” they add. While this is partially true, (and it has happened before,) given the same data, the better performing product will still finish sooner on the same hardware. In addition, now that processors have stopped scaling vertically, the better designed code for concurrent scalability (horizontal scaling) will outperform even the best algorithm. This, not to mention, data outgrows hardware any day.

Conclusion

Premature optimization is a major trap. We learn by falling, getting up, dusting off and falling again. We learn by making mistakes. The wisdom of the community tells us to avoid experimenting on our production code and postponing optimization as much as possible. Only when the code is mature, and only when necessary should we, by the aid of a profiler, decide the hot-spots and then, and only then, very carefully optimize the code.

This strategy encourages developers to come up with inefficient, thoughtless and -often- outright ugly code. All in the name of avoiding premature optimization. Furthermore, it incorrectly assumes that profiling is a magic solution to improving performance. It neglects to mention how involved profiling is. Those who had no clue as to why their code is bogged down, won’t know what to change even if the profiler screamed where the slowest statement is.

There are no excuses to writing inefficient code if the alternative is available at a small or no cost. There is no excuse to not thinking the algorithm ahead of typing. No excuse to leaving old experimental bits and pieces because we might need them later, or that we’ll cleanup later when we optimize. The cost of poor design, badly performing code is very high. It’s the least maintainable code and can have a very high cost to improve.

Let’s optimize later, but let’s write efficient code, not optimum, just efficient, from the get-go.

Jun 052011
 
Stack of books in Gould's Book Arcade, Newtown...

Image via Wikipedia

To read is to fly: it is to soar to a point of vantage which gives a view over wide terrains of history, human variety, ideas, shared experience and the fruits of many inquiries.
A. C. Grayling

There is hardly any activity that you can both perform on your own and alone, yet simultaneously share the experience with someone else. Paradoxical reading is. Books have been likened to many things, not least a good friend. And indeed a good book is at least as good a friend, but perhaps even more. Reading a well-written book takes you on journeys across the ages and worlds. But that’s not the real magic of books or reading. The magic is in seeing the world anew. Seeing the world from the eyes of a complete stranger… or may be an old and dear friend. For a good book is well worth returning to and reading over and over.

I’ll say it here; one of the dreams that I wish to realize at old(er) age is, having read all the books I wished to read, to reread my favorites. There would be very little to compete with that personal joy of mine but to have all the books I loved to read the first time, to read over again. Like revisiting your childhood playground, like planing a reunion with schoolmates, I’ll look foreword to reopening those old pages again.

But why reading? With all the technology we enjoy nowadays, why can’t other media completely replace reading? But of course they have, to a significant extent, replaced books and reading. But I can’t find any other form of media that can present thought as good as the written word can. That is, the best way to preserve and present thoughts is to use language. Whether spoken or written, language is the best tool we have to communicate our thoughts. And while the spoken word can give new depths to the words uttered, primarily by changing the tone, volume and enunciation, writing them gives the audience much more degrees of freedom in consuming the material.

I do acknowledge that there is a whole category of concepts that we can hardly describe by words. We may choose to call these concepts the language singularities; where language as we know it breaks down. All forms of art can be said to have evolved, to lesser or more degrees, to fill this cleft in our language. But even then, art without context is too abstract to communicate unambiguously thoughts and ideas and complex concepts. It does a great job of communicating the aspects of our thoughts that we still can’t speak or write, in only (if you’d forgive the pun) so many words. Art is complimentary to language, but can never replace it. Language is more precise and more rich and, perhaps most importantly, can describe what can’t be. Using words, you can discuss paradoxes and other-worldly what-if scenarios. We can even talk about objects that we can’t create physically because they’re logically impossible or physics as we know it doesn’t allow for such objects to be. Like thinking about something being nowhere. Or a curved path being shorter than a straight one.

Give me a man or woman who has read a thousand books and you give me an interesting companion. Give me a man or woman who has read perhaps three and you give me a dangerous enemy indeed.
– Anne Rice, The Witching Hour

Books serve more than one purpose. If it weren’t for books civilization as we know it wouldn’t exist. More accurately, I should say that if it weren’t for the written word, passing knowledge across generations would’ve been near impossible. Thanks to the scraps we inherited, we know not only what happened in the past, not only what some thought created, but we also know how some were forged, plagiarized and even distorted. We know how the powerful rewrote history. We even know why many, many texts didn’t survive. In some cases the lifetime of the then-paper technology was as short as a hundred or may be two-hundred years. But we also know that the important texts were copied and recopied by scribes. And indeed, the heretical, competing, unapproved texts were systematically sought and destroyed, and forever perished.

This is precisely what we lose when we don’t read. By not reading, not only we don’t get to know what generations upon generations thought and did, but we also don’t get to know what there isn’t to know. That is, when we read, we know much more than what’s written; we also know what’s not written about. This may sound tautological, but it’s not. It’s easy to assume and guess, say, what the old Egyptians knew and could do. However, it’s a completely different thing to read what they wrote and discover there is not a single word of advanced technology beyond their age and time. It’s sobering to know what’s missing from the historical record. Granted, there have been systematic distortions by rivals left and right, and we can never know for a fact that what’s missing didn’t really exist. However we will know that it’s missing, probably because it didn’t exist. At the very least, when we do make claims, we’ll know how much it’s backed by evidential facts or in many cases, the lack thereof. And this is why books are important. Hardly can one know anything without sharing what others claim to know.

Some hold a single book and revere it as the most important book. The first and the last. The only book worth of reading. They challenge others to find anything comparable to the beauty and wisdom of their book of choice. They challenge others to come up with anything even remotely similar to the words written in their book. Invariantly, I ask them, how do you know? How would you know where that single book stands without reading anything else?

While many powerful social movements were at least in part fueled by fiction (Adventures of Huckleberry Finn on racial issues for example,) the fact remains that fiction needs to maintain an entertainment aspect. This quality of being entertaining, to my perception, compromises the integrity of the material. Put differently, to know what’s factual and what’s artistic, one has to work very hard, which will probably rob the book of its fun. I prefer to read fiction for the entertainment value and artistic and cultural characteristics, but I get my info and facts from nonfiction. Indeed, I find fiction disarraying when I’m attempting lucidity.

May 302011
 

The story I’m about to tell is the worst case of leaky abstraction that I’ve encountered and had to resolve. Actually, it’s the most profound example that I know of. Profound here is used in a negative sense, of course. This isn’t the performance issues Joel brings as examples to leaky abstraction. Nor is it a case of getting a network exception while working with files, because the folder is now mapped to a network drive. Nor a case of getting out-of-disk-space error while writing to a file-backed memory block. This is a case of frustration and a very hard-to-catch defect. A case of a project that came too close to failing altogether, had we not figured it out in time.

http-Request from browser via webserver and back

Image via Wikipedia

Background

Our flagship project worked with 3rd party servers which for technical reasons had to be installed on the same machine as our product. This was an obvious shortcoming and had to change. The requirements came in and we were asked to move the interfaces with all 3rd party servers into lightweight distributed processes. These new processes were dubbed Child processes, while the central process, well, Central. A Child would run on a different machine on the network, communicating with the Central process to get commands, execute them and return the results.

The design was straightforward: all processes would be running as services listening on some configurable port. The commands were simple application-level message objects, each with its own type, serialized on the wire. The operations were synchronous and we needn’t progress updates or heartbeats. We left the network design synchronous as well, for simplicity.

The developer who was assigned the communication-layer had background in web-services and proposed to use standard HTTP protocol. We thought about it and declared that while it would have some small overhead, the simplicity of reusing some library would be a plus. After all, HTTP has data recovery and is a standard protocol. And if we really cared about overhead, we’d use UDP which has no duplicate detection, data recovery or even ordered packet transmission. Plus, the developer who’d work on this feature was comfortable with HTTP. So why not?

As it turns out, HTTP was the single worst decision made on this particular project.

Since we were now transmitting potentially sensitive data, the requirements were amended to include data encryption to protect our customers’ data from network sniffers. We used a standard asymmetric encryption for extra security. This meant that we had to generate a pair of public and private keys each time we connected. We devised a protocol to communicate the key the Child must have using a symmetric encryption algorithm. We were confident this was secure enough for our needs and it wasn’t overly complicated to implement.

Trouble

The project was complete when I took the product for a final set of developer white-box testing. This is something I learned to do before shipping any product, as I do have the responsibility of designing the features, I also feel responsible for looking under the hood in case there is some potential mechanical or technical issue. Much like your car mechanic would do before you go on an off-road trip.

That’s when things started to fall apart. All worked fine, except every now and then I’d get errors and the Child would disconnect. Inspecting the logs showed data-encryption exceptions. The deciphering function was failing. Every single developer who run the code verified that it worked fine without any problems whatsoever. I asked them to pay attention to this issue. They came back saying all was perfectly fine. It was only on my machine!

Mind you, I learned not to blame before I eliminate every possible culprit. And the prime suspect is, always, a bug. A developer error. So I started sniffing around. Visiting the design back and forth. Nothing made sense of the issue. The code works, almost all the time. Then it fails. Reconnect again, and it works fine… until it fails again.

Troubleshooting this issue wasn’t fun, precisely because it wasn’t fruitful. No amount of debugging helped, or, in fact, could ever help. The puzzle had to be solved by reason. Experimentation showed as much as I had already gathered from the logs. Still, I tried different scenarios. One thing was for sure, you couldn’t tell when it’ll fail next.

The Leak

Remember that HTTP is a connectionless protocol. That is, it’s designed to communicate a single request and its response and disconnect. This is the typical scenario. It holds no connections and no states, therefore, it has no session. On the web, sessions are realized by the HTTP server. An HTTP server would typically create some unique key upon login or first-request missing a key, and it would track all subsequent requests by getting the said key either in the URL or using cookies. In any event, even though a web-service may have support for sessions, the underlying protocol is still connectionless and stateless.

To improve performance, an afterthought of reusing connections was added. This is typically called Keep-Alive. The idea is that a flag is added to the HTTP header which tells the server not to close the connection immediately after responding to the request, anticipating further requests. This is reasonable, as a web-page typically loads multiple images and embedded items from the same server. The client and server supporting Keep-Alive reuse the same connection for several requests, until one of them closes the connection. What is most important in this scenario is that if either party doesn’t respect this hint, nothing would break. In fact, nothing would work any different, except, of course, for the extra connections and disconnections that would occur for each request.

Since the implementor of this feature was a savvy web developer, he always had this flag set. And so, as long as the connection wasn’t interrupted, or indeed, the underlying library we were using didn’t decide to close the connection on whim, all was well and we had no problems. However, when a new request went with a new connection, rather than an existing one, the Child’s server would accept a new socket, on a new port, rather than the previously open socket. This is what was happening on my test environment. Perhaps it was the fact that I was testing across VM images that triggered the disconnections. Anyway, this newly opened socket on the Child has no encryption details associated with it. It’s a brand-new connection. It should expect encryption key exchange. But due to implementation details, the request would have an ‘encrypted’ flag set and the Child wouldn’t mind that we negotiated no cryptographic keys. It’d go ahead and try to decipher the request, only, it couldn’t. Resulting in the logged encryption exception followed by disconnection.

Post Mortem

Once the issue was figured out, the solution was simple, albeit costly. The HTTP abstraction had leaked an ugly property that we had assumed abstracted away. At design time, we couldn’t care what protocol we used to carry our bits. Encryption was added almost as an afterthought. True that encryption does require a state. However looking at our code, the socket-level connection was abstracted by layers and layers of library code. In fact, all we had was a single static function which took a URL string for a request. We had serialized and encoded the request message in base-64 and appended to the URL, which contained the server hostname/ip and port; standard web request, really.

On the communication layer, we had this single URL construction and the request call. On the data layer, we had the encryption, serialization and data manipulation logic. On the application layer, well, there were no network details whatsoever. Most of the previous code which worked locally had remained the same, with the implementation changed to interface the new network layer. So in a sense the code evolved and adapted to its final form and it wasn’t anywhere near apparent that we had leaked a major problem into our code.

In hindsight, we should’ve taken matters into our hands and implement a session-based protocol directly. This would make sense because we’d be in complete control of all network matters. For one, with HTTP we couldn’t change the sockets to use async logic, nor could we change the buffer sizes and timeouts. Perhaps we didn’t need to, but considering the gigabytes/hour we expected to transfer, sooner or later we’d have to optimize and tune the system for performance. But, the developer assigned was inexperienced and we couldn’t afford the time overhead. Personally, I feared things would get too complicated for the assigned developer to handle. I let him pick the protocol he was most comfortable with. And that’s the real danger of leaky abstraction; everyone is tricked, including the experienced.

Indeed, we ended up rewriting the communication layer. First the HTTP code was replaced with plain sockets using TCP/IP. Next, sessions were added, such that disconnections were recoverable. That is, the data layer didn’t care whether communication was interrupted or not. We weren’t going to rely on the fact that we controlled the sockets. Disconnections were made irrelevant by design. And finally, our protocol required a strict sequence of initialization and handshake that insured correct state. Once the code was working as expected, we changed the sockets to use async interface for maximum performance.

Overall, we spent an extra 2 man/months and, as a result, the data+communication layer was sped up several times over. Still, this was one hell of a case of leaky abstraction.

Further Reading:


Update:
Many are asking why not use SSL? The answer is because HTTP was the wrong choice in the first place.
We weren’t building a web-service. This was a backend server communicating with the main product. We didn’t want to limit the protocol, features or extendability by the choice of communication details. SSL would resolve the encryption issue, but we’d have had to implement an HTTPS server. In addition, whatever application protocol we would eventually realize, it had to be connectionless. The encryption layer simply uncovered this design implication that we had overlooked, hence the leak in the communication protocol abstraction. At that point we didn’t have any control messages nor did we have requests that needed states, later we added both. In fact, we added a command to iterate over a large set of data, returning one item at a time. HTTP/S would’ve made this harder to implement, as the state would have had to be sent with each request. Control messages and heartbeats would’ve been pointless.
In short, HTTP gave us very little and the connectionless nature caused as a lot of trouble. We got rid of both. We exchanged a wrong solution with the right one, hardly reinventing the wheel, if you ask me.

May 292011
 

I was looking for a good book that made a good case for the theistic beliefs without being preachy. That is, a book that introduced me to the arguments upon which the world religions build their theologies. The three world religions I speak of are the Abrahamic religions. Abraham, a prophet recognized by the world religions, is considered the first man to have had the honor of being spoken to directly by God. The oldest manuscripts to recognize Abraham’s status and record the encounter can be found in the Bible, or, as the Christians call it, the Old Testament. As such, the Jewish theology seemed to be the most reasonable source to contain the foundations I was looking forward to studying. This is one of the books suggested.

Cover of "God According to God: A Physici...

Cover via Amazon

Gerald Schroeder‘s book is subtitled “A Physicist Proves We’ve Been Wrong About God All Along.” I get it, he has a degree in physics. What has that to do with anything? Is that not an attempt at appealing to authority? Should we trust his views, before even reading a single line, just because he has a degree in physics? Or does that show that he knows what he’s talking about any better?

God According to God is well written. The author is clearly not only a good writer, but he’s also well-versed in all the topics he touches upon. Schroeder frequently admits the obvious counter-argument to the points he makes. In chapter 3 “The Unlikely Planet Earth,” where, using Drake’s equation, he calculates the number of Earth-like planets in the visible universe. At the end of the chapter he concludes by saying:

The estimated number of stars in the entire visible universe is in the order of 1022. This indicates that in the entire universe there may be approximately 104, or 10,000, earthlike planets circling a sunlike star. These 10,000 potentially earthlike planets would be distributed among the 1011, or 100,000,000,000, galaxies in the entire visible universe. That comes out to be one earthlike planet for each 10,000,000 galaxies. The probability that any one galaxy would have more than one life-bearing stellar system is slim indeed.

To be honest, at this point I had already read 3 chapters and was a bit surprised that his conclusion wasn’t that Earth was by far the only possible host of life. Part of the reason for this expectation is his obvious bias to demonstrate how unique and rare life on Earth is. Although his assumptions are a bit conservative (for example he doesn’t consider the possibility of life on moons orbiting large planet, such as Titan,) his conclusion is spot on. For what it’s worth, I thought he wasted a good bunch of papers in this chapter, as the conclusion, if anything, convinced me that Earth is just a fluke, with a possible 10,000 more sprinkled around. What is so special about that escapes me.

The book can be divided into two logical domains: Physics and Theology, but of course they don’t share an equal number of pages. The division is so stark, that one might think the respective chapters were written by completely different authors. As a matter of fact, there are contradictions between them. In chapter 2 “The Origins of Life” he writes:

Our cosmic genesis began billions of years ago in our perspective of time, first as beams of energy, then as the heavier elements fashioned within stars and supernovae from the primordial hydrogen and helium, next as stardust remnants expelled in the bursts of supernovae, and finally reaching home as rocks and water and a few simple molecules that became alive on the once molten earth.

Later, in chapter 4 “Nature Rebels”:

In the Garden of Eden, 2,448 years prior to this revelation at Sinai, Adam and Eve were confronted with the identical options.

This caused me so much cognitive dissonance that I went back to find the section where the cosmic origin, what he calls the “Big Bang Creation,” is described. This physicist apparently holds the belief that our planet has billions of years behind it, yet he maintains that Adam and Eve were in the Garden of Eden exactly 2,448 years before the revelation at Sinai! Considering the era when the Garden of Eden encounters supposedly occurred and the lack of numbers in any biblical or other sources, the above number is extremely precise. Not only that goes unexplained, Schroeder assumes the reader has already agreed to the Garden of Eden events as told in the Bible. In fact, that is my main point here: The author assumes the reader is a believer and well-acquainted to the theology and he’s basically giving scientific backing and, as is apparent in later chapters, throwing his own interpretation and understanding of the nature of God.

Perhaps the title might have given a clue or two as to the conviction of the author regarding his understanding of God’s nature and plan. There is perhaps less color hues in a rainbow than different interpretations and explanations of God’s nature, plan and instructions to the human race. The author of God According to God adds yet another, and it’s not a conventional one, at least it isn’t to me.

In chapter 6 “Arguing with God”:

The sequence of events at and following the binding give compelling force to the supposition that the God of the Bible not only wants a dialogue with us humans, but even more than that. God expects such, and if the situation seems unjust or unjustified, then, beyond a dialogue, God wants us to argue. If our case is strong enough, God will even “give in,” or at least modify the Divine directive. Moses seems to have understood this trait of the Divine.

A few pages down:

Argument seems to be the standard and the expected biblical operating procedure in our encounters with the Divine. The surprise is that, having designed and created our universe with all its magnificence and granted us the freedom of choice, God wants us, expects us, to interact with the Divine about how to run the universe.

In the next chapter “In Defense of God”:

As I read the events of the Bible, in human terms I see God in a sort of emotional bind. God desperately wants us to choose life, a dynamic, purposeful existence, but doesn’t want to force us along that line. Hence we are granted the liberating tzimtzum of creation. God has to hold back and let us try. When we really mess up, God steps in. It’s so human. Mom teaches junior to play chess. Looking over his shoulder as her son makes his moves on the board, she sees a trap developing. He is about to lose his queen. If she wants her kid to learn to think ahead, to envision the distant outcome of the initial move before that move is made, she will do well to keep her hands in her pockets and let him make the error or at most give a few very general suggestions, as God through the Bible gives to us. It’s frustrating, even painful, but it is part of the learning process, Divine as well as human.

The above quotes are not the only cases that made me stop reading, and pause… for a while. It might have been that I had expected the run of the mill explanations and arguments. Instead, I found radically new concepts. Ideas I hadn’t encountered before. I can see that some of these ideas could be called heretical. If we make a strong case arguing with God, “God will even “give in,”” and “[…] God wants us, expects us, to interact with the Divine about how to run the universe.” And apparently, there is a “Divine as well as human” learning process!

Whatever your stance on God and religion, God According to God isn’t a rehash of age-old arguments. Nor is it the typical “science proves the existence of God” kind of book. Gerald Schroeder is very well read on ancient Jewish texts. His Hebrew skills are of the translator caliber. His science is, as far as I can tell, solid. Overall, I learned quite a bit from the historical writings and the ancient Jewish theology that is blended in with the science and God’s strive to learn as we go. It’s just that I didn’t get what I paid for.

May 022011
 

Data sharing between threads is a tricky business. Anyone with any kind of experience with multi-threaded code will give you a 1001 synonyms for “tricky,” most of which you probably wouldn’t use in front of your parents. The problem I’m about the present, however, has zero to do with threading and everything with data sharing and leaky abstraction.

This is a pattern that is used very often when one object is used symmetrically at the beginning and end of another’s lifetime. That is, suppose we have a class that needs to get notified when a certain other class is created, and then again when it’s destroyed. One way to achieve this, is to simply set a flag once to true and a second time to false, in the constructor and destructor of the second object, respectively.

This particular example is in C++ but that’s just to illustrate the pattern.

class Object
{
public:

Object(SomeComponent& comp) : m_component(comp)
{
    m_component.setOnline(true); // We’re online.
}

~Object()
{
    m_component.setOnline(false); // Offline.
}
};

This looks fool-proof, as there is no way the flag will not get set, so long that Object is created and destroyed as intended. Typically, our code will be used as follows:

Object* pObject = new Object(component);
// component knows we are online and processing...

delete pObject; // Go offline and cleanup.

Now let’s see how someone might use this class…

// use smart pointer to avoid memory leaks...
std::auto_ptr<object> apObject;

// Recreate a new object...
apObject.reset(new Object(component));

See a problem? The code fails miserably! And it’s not even obvious. Why? Because there are implicit assumptions and a leaky abstraction at work. Let’s dice the last line…

Object* temp_object = new Object(component); // create new Object
  Object::Object();
    component.setOnline(true);  // was already true!
delete apObject.ptr; // new instance passed to auto_ptr
  Object::~Object(); // old instance deleted
    component.setOnline(false); // OUCH!
apObject.ptr = temp_object;

See what happened?

Both authors wrote pretty straightforward code. They couldn’t have done better without making assumptions beyond the scope of their work. This is a pattern that is very easy to run into, and it’s far from fun. Consider how one could have detected the problem in the first place. It’s not obvious. The flag was set correctly, but sometimes would fail! That is, whenever there is an Object instance, and we create another one to replace the first, the flag ends up being false. The first time we create an Object, all works fine. The second time, component seems to be unaware of us setting the flag to true.

Someone noticed the failure, assumed the flag wasn’t always set, or may be incorrectly set, reviewed the class code and sure enough concluded that all was correct. Looking at the use-case of Object we don’t necessarily run through the guts of auto_ptr. After all, it’s a building block; a pattern; an abstraction of a memory block. One would take a quick look, see that an instance of Object is created and stored in an auto_ptr. Again, nothing out of the ordinary.

So why did the code fail?

The answer is on multiple levels. First and foremost we had a shared data that wasn’t reference counted. This is a major failing point. The shared data is a liability because it’s not in the abstraction of object instances. The very same abstraction assumptions that auto_ptr makes; it points to independent memory blocks. What we did is we challenged the assumptions that auto_ptr makes and failed to safe-guard our implicitly-shared data.

In other words, we had two instances of Object at the same time, but the flag we were updating had only two states: true and false. Thereby, it had no way of tracking anything beyond a single piece of information. In our case, we were tracking whether we were online or not. The author of Object made very dangerous assumptions. First and foremost, the assumption that the flag’s state is equivalent to Object’s lifetime proved to be very misleading. Because this raised the question of whether or not more than one instance of Object can exist. That question would have avoided a lot of problems down the road, however it wasn’t obvious and perhaps never occurred to anyone.

Second, even if we assume that there can logically be one instance of Object, without proving that it’s impossible to create second instances by means of language features, we are bound to misuse, as clearly happened here. And we can’t blame the cautious programmer who used auto_ptr either.

If something shouldn’t happen, prevent it by making it impossible to happen.

Solutions

The solutions aren’t that simple. An obvious solution is to take out the flag setting calls from within Object and call them manually. However this defies the point of having them where one couldn’t possibly forget or miss calling them, in case of a bug. Consider the case when we should set the flag to false when Object is destroyed, but this happens due to an exception, which automatically destroys the Object instance. In such a case, we should catch the exception and set the said flag to false. This, of course, is never as straight forward as one would like, especially in complex and mature production code. Indeed, using the automatic guarantees of the language (in this case calling the ctor and dtor automatically) are clearly huge advantages that we can’t afford to ignore.

One possible solution is to prevent the creation of Object more than once at a time. But this can be very problematic. Consider the case when we have multiple component instances, and we are interested in a different Object per component, not a globally unique Object instance.

As I said, no easy solution. The solution that I’d use is the next best thing to instance creation prevention. Namely, to count the number of instances. However, even if we reference count the Objects, or even the calls to setting the flag, in all events, we must redefine the contract. What does it mean to have multiple instance of Object and multiple calls to set the flag to true? Does it mean we still have one responsible object and what guarantees that? What if there are other constraints, might some other code assume only one instance of Object when that flag is set?

All of the questions that flow from our suggested solutions demand us to define, or redefine, the contracts and assumptions of our objects. And whatever solution we agree on, it will have its own set of requirements and perhaps even assumption, if we’re not too careful.

Conclusion

Using design patterns and best practices are without a doubt highly recommended. Yet ironically sometimes they may lead to the most unexpected results. This is no criticism of using such recommendations from experienced specialists and industry leaders, rather, it’s a result of combining abstractions in such a way that not only hides some very fundamental assumptions in our design and/or implementation, but even creates situations where some of the implicit assumptions of our code are challenged. The case presented is a good example. Had the developers not used the ctor/dtor pattern for setting the said flag, or had they not used auto_ptr, no such problem would’ve arisen. Albeit, they would have had other failure points, as already mentioned.

Admittedly, without experience it’s near impossible to catch similar cases simply by reading code or, preferably, while designing. And inexperience has no easy remedy. But if someone figures out a trick, don’t hesitate to contact me.

QR Code Business Card