Coding – the Void

The Second Law of Optimization (Abridged)

Jun 182011

I realize that the original article of the same title was longer than what most would like to read. So here is an abridged version.

By now everybody and their screensaver have heard the Optimization Mantra: Don’t Do It! This is commonly wrapped in a three-rule package. The first two of which are copies of the mantra, and the third adds the wise word “Yet” to the one-and-only true rule and addresses it to the “expert.”

Premature optimization; Most of us have been there. And that’s what makes those words very familiar. Words of wisdom, if you will. We’ve all decided to do a smart trick or two before fleshing out the algorithm and even checking if it compiles, let alone checking the result, only to be dumbfounded by the output. I can figure it out! we declare… and after half a day, we’d be damned if we rewrote that stupid function from scratch. No chance, bub.

The rules are sound. No doubt. Another rule of optimization, when the time comes, is to use profilers and never, ever, make costly assumptions. And any assumption is probably costly. That, too, is sound. These are words of wisdom, no doubt. But, taken at face-value they could cause some harm.

In all but the smallest projects one must use a profiler, consult with others and especially talk with module owners, veteran developers and the architects before making any changes. The change-set must be planned, designed and well managed. The larger the project, the more this stage becomes important. No funny tricks, please.

Efficient Code != Premature Optimization

The traditional wisdom tells us to avoid premature optimization and when absolutely necessary, we should first use a profiler. But both of these can also be interpreted as follows: it’s OK to write inefficient and bloated code, and when necessary, we’ll see what the profiler comes up with.

Performance as an afterthought is very costly. Extremely so. But the alternative isn’t premature optimization. There is a very thin line between well-thought and designed code that you’d expect a professional to output and the student toy-project style coding. The latter focuses on getting the problem-of-the-moment solved, without any regards to error handling or performance or indeed maintenance.

It’s not premature optimization to use dictionary/map instead of a list or array if reading is more common. It’s not premature optimization to use an O(n) algorithm instead of the O(n²) that isn’t much more complicated than what we’ll use (if not an O(log₂ n) algorithm). Similarly, moving invariant data outside a loop isn’t premature optimization.

As much as I’d hate to have a pretentious show-off in my team, who’d go around “optimizing” code by making wild guesses and random changes, without running a profiler or talking with their colleagues, I’d hate it even more if the team spent their time cleaning up after one another. It’s easy to write code without thinking more than a single step ahead. It’s easy to type some code, run it, add random trace logs (instead of properly debugging,) augment the code, run again, and repeat until the correct output is observed. As dull and dead-boring as that is.

I’m not suggesting that this extreme worse-case that I’ve described is the norm (although you’d be surprised to learn just how common it is.) My point is that there is a golden mean between “premature optimization” and “garbage coding.”

The Cost of Change

It’s well documented that the cost of change increases exponentially the later a project is in it’s development cycle. (See for example Code Complete.) This cost is sometimes overlooked, thanks to the Rule of Optimization. The rule highly discourages thinking about performance when one should at least give it a good thinking when designing.

This doesn’t suggest optimization-oriented development. Rather, having a conscious grasp of the performance implications can avoid a lot of painful change down the road. As we’ve already iterated, designing and writing efficient code doesn’t necessarily mean premature optimization. It just means we’re responsible and we are balancing the cost by investing a little early and avoiding a high cost in the future. For a real-life example see Robert O’Callahan’s post.

Conclusion

Premature optimization is a major trap. The wisdom of the community tells us to avoid experimenting on our production code and postponing optimization as much as possible. Only when the code is mature, and only when necessary should we, by the aid of a profiler, decide the hot-spots and then, and only then, very carefully optimize the code.

There are no excuses to writing inefficient code if the alternative is available at a small or no cost. There is no excuse to not thinking the algorithm ahead of typing. No excuse to leaving old experimental bits and pieces because we might need them later, or that we’ll cleanup later when we optimize. The cost of poor design, badly performing code is very high.

Let’s optimize later, but let’s write efficient code, not optimum, just efficient, from the get-go.

June 18, 2011
Posted by Ashod Nakashian at 9:07 am
1 Response
Programming
Tagged with: Coding, Optimization
Font Size:
A A A

The Second Law of Optimization

Jun 172011

Image via Wikipedia

Abridged version here.

By now everybody and their screensaver have heard the Optimization Mantra: Don’t Do It! This is commonly wrapped in a three-rule package (I suspect there is a word for that). The first two of which are copies of the mantra, and the third adds the wise word “Yet” to the one-and-only true rule and addresses it to the “expert.” I suspect originally the middle “rule” didn’t exist and it was later added for effect, and perhaps to get the total to the magic-number of three.

I can almost imagine Knuth after figuring out a single-character bug in a bunch of code, with coffee mugs and burger wraps (or whatever it was that was popular in the ’60s) scattered around the desk… eyes red-shot, sleepless and edgy, declaring “Premature optimization is the root of all evil.” (But in my mind he uses more graphic synonyms for ‘evil’.)

Knuth later attributed that bit about premature optimization to Tony Hoare (the author of QuickSort) thereby distorting my mental image of young Knuth swearing as he fixed his code, only later to be saved by Hoare himself who apparently doesn’t remember uttering or coining such words. (Somebody’s got bad memory… may be more than one.)

Smart Aleck

Probably the smarter amongst us would learn from the experience of such dogged persistence and avoid trickery the next time around. While few would admit to the less-intelligent decisions they took in the past, at least some will have learned a lesson or two when the next opportunity knocked.

The aforementioned trickery doesn’t have to be optimization trickery, mind you. Some people (read: everyone) likes to be a smart-ass every so often and show off. Sure, many end up shooting themselves in the foot and making fools of themselves. But that doesn’t stop the kids from doing a crazy jump while waving to their friends, iPod on and eating crackers, just to impress someone… who typically turns around precisely when they shouldn’t. (Did I mention skateboards?)

Let’s take it from the top. Leaving aside the smaller projects we might have started and for years tended and developed. Most projects involve multiple developers and typically span generations of developers. They are legacy projects, in the sense of having a long and rich history. No one person can tell you this history, let alone describe all parts of the code. On such a project, if performance is an issue, you shouldn’t go about shooting in the dark and spending days or even weeks on your hunches. Such an approach will not only waste time, add a lot of noise and pollute the code-base and source repository (if you commit to the trunk, which you should never do, until done and ready to merge.)

In such a case, one must use a profiler, consult with others and especially talk with module owners, veteran developers and the architects before making any changes. The change-set must be planned, designed and well managed. The larger the project, the more this stage becomes important. No funny tricks, please.

Efficient Code != Premature Optimization

Of the standard interview questions we often ask (and get asked) are those on the data-structures and algorithms (discrete math and information theory.) I typically ask candidates to compare the data-structures in terms of performance which should cover both internal details and complexity characteristics (big O). It’s also a good opportunity to see how organized their thoughts are. We use arrays, lists and maps/dictionaries quite often and not having a good grasp of their essence is a shortcoming. As a follow-up to this I typically ask how they decide which to use. This isn’t an easy question, I realize. Some things are hard to put into words, even when we have a good understanding of them in our minds. But, interviews aren’t meant to be easy.

The worst answer I ever got was “I use List, because that’s what I get.” To which I had to ask “Where?” Apparently, the candidate worked on a legacy project that used Lists almost exclusively and bizarrely she never had a need for anything else. The best answer typically gives a description of the use-case. That is, they describe the algorithm they’ll implement, and from that, they decide which container to use.

The best answer isn’t merely a more detailed or technical answer. Not just. It’s the best answer because it’s the only answer that gives you reasons. The candidate must have thought about the problem and decided on an algorithm to use, they must’ve quantified the complexity of the algorithm (big O) and they must’ve known the performance characteristics of the different containers for the operations their algorithm needs. They have thought about the problem and their solution thoroughly before choosing containers, designing classes and whatnot.

Performance as an afterthought is very costly. Extremely so. But I don’t recommend premature optimization. There is a very thin line between well-thought and designed code that you’d expect a professional to output and the student toy-project style coding. The latter focuses on getting the problem-of-the-moment solved, without any regards to error handling or performance or indeed maintenance. We’ve all done it; Multiple similar functions; Same function reused for different purposes with too many responsibilities; Unreasonable resource consumption and horrible performance characteristics that the author is probably oblivious to. And so on.

It’s not premature optimization to use dictionary/map instead of a list or the most common container in your language of choice. Not when we have to read items most of the time. It’s not premature optimization if we use an O(n) algorithm instead of the O(n²) that isn’t much more complicated than what we’ll use (if not an O(log₂ n) algorithm). It’s not premature optimization if we refactor a function so it wouldn’t handle multiple unrelated cases. Similarly, moving invariant data outside a loop isn’t premature optimization. Nor is caching very complex calculation results that we don’t need to redo.

Regex object construction is typically an expensive operation due to the parsing and optimizations involve. Some dynamic languages allow for runtime compilation for further optimization. If the expression string isn’t modified, creating a new instance of this object multiple times isn’t smart. In C# this would be creating a Regex object with RegexOptions.Compiled and in Java a Pattern.compile() called from matches() on a string. Making the object a static member is the smartest solution and hardly a premature optimization. And the list goes on.

I don’t know about you, but to me, writing and modifying code instead of designing and working out the algorithm beforehand is simply counter productive. It’s not fun either. Similarly, debugging is much more interesting and engaging than adding random trace logs until we figure out what’s going on.

The Cost of Change

When it’s time to spend valuable resources on optimization (including the cost of buying profilers,) I don’t expect us to discover that we needed a hash-table instead of an array after all. Rather, I should expect the profiler to come up with more subtle insights. Information that we couldn’t easily guess (and indeed we shouldn’t need to.) I should expect the seniors in the team to have a good grasp of the performance characteristics of project, the weak points and the limitations. Surely the profiler will give us accurate information, but unless we are in a good position to make informed and educated guesses, the profiler won’t help us much. Furthermore, understanding and analyzing the profiler’s output isn’t trivial. And if we have no clue what to change, and how our changes would affect the performance, we’ll use the profiler much like the student who prints traces of variables and repeatedly makes random changes until the output is the one expected. In short, the profiler just gives us raw data, we still have to interpret it, design a change-set and have a reasonably sound expectation of improved performance. Otherwise, profiling will be pure waste.

It’s well documented that the cost of change increases exponentially the later a project is in it’s development cycle. (See for example Code Complete.) This means a design issue caught during designing or planning will cost next to nothing to fix. However, try to fix that design defect when you’re performing system-testing, after having most modules integrated and working, and you’ll find that the change cascades over all the consequent stages and work completed.

This cost is sometimes overlooked, thanks to the Rule of Optimization. The rule highly discourages thinking about performance when one should at least give it a good thinking when the business and technical requirements are finalized (as far as design is concerned) and an initial design is complete. The architecture should answer to the performance question. And at every step of the development path developers must consider the consequences of their choices, algorithms and data-structures.

I know there is a camp that by this point is probably laughing at my naïve thinking. By the time I finish optimizing or even writing efficient and clean code, they’ll say, their product would’ve shipped and the customer would’ve bought the latest and faster hardware that will offset their disadvantage in performance. “What’s the point?” they add. While this is partially true, (and it has happened before,) given the same data, the better performing product will still finish sooner on the same hardware. In addition, now that processors have stopped scaling vertically, the better designed code for concurrent scalability (horizontal scaling) will outperform even the best algorithm. This, not to mention, data outgrows hardware any day.

Conclusion

Premature optimization is a major trap. We learn by falling, getting up, dusting off and falling again. We learn by making mistakes. The wisdom of the community tells us to avoid experimenting on our production code and postponing optimization as much as possible. Only when the code is mature, and only when necessary should we, by the aid of a profiler, decide the hot-spots and then, and only then, very carefully optimize the code.

This strategy encourages developers to come up with inefficient, thoughtless and -often- outright ugly code. All in the name of avoiding premature optimization. Furthermore, it incorrectly assumes that profiling is a magic solution to improving performance. It neglects to mention how involved profiling is. Those who had no clue as to why their code is bogged down, won’t know what to change even if the profiler screamed where the slowest statement is.

Let’s optimize later, but let’s write efficient code, not optimum, just efficient, from the get-go.

June 17, 2011
Posted by Ashod Nakashian at 5:33 pm
5 Responses
Programming
Tagged with: Coding, Optimization
Font Size:
A A A

A Bad Case of Leaky Abstraction

May 302011

The story I’m about to tell is the worst case of leaky abstraction that I’ve encountered and had to resolve. Actually, it’s the most profound example that I know of. Profound here is used in a negative sense, of course. This isn’t the performance issues Joel brings as examples to leaky abstraction. Nor is it a case of getting a network exception while working with files, because the folder is now mapped to a network drive. Nor a case of getting out-of-disk-space error while writing to a file-backed memory block. This is a case of frustration and a very hard-to-catch defect. A case of a project that came too close to failing altogether, had we not figured it out in time.

http-Request from browser via webserver and back

Image via Wikipedia

Background

Our flagship project worked with 3rd party servers which for technical reasons had to be installed on the same machine as our product. This was an obvious shortcoming and had to change. The requirements came in and we were asked to move the interfaces with all 3rd party servers into lightweight distributed processes. These new processes were dubbed Child processes, while the central process, well, Central. A Child would run on a different machine on the network, communicating with the Central process to get commands, execute them and return the results.

The design was straightforward: all processes would be running as services listening on some configurable port. The commands were simple application-level message objects, each with its own type, serialized on the wire. The operations were synchronous and we needn’t progress updates or heartbeats. We left the network design synchronous as well, for simplicity.

The developer who was assigned the communication-layer had background in web-services and proposed to use standard HTTP protocol. We thought about it and declared that while it would have some small overhead, the simplicity of reusing some library would be a plus. After all, HTTP has data recovery and is a standard protocol. And if we really cared about overhead, we’d use UDP which has no duplicate detection, data recovery or even ordered packet transmission. Plus, the developer who’d work on this feature was comfortable with HTTP. So why not?

As it turns out, HTTP was the single worst decision made on this particular project.

Since we were now transmitting potentially sensitive data, the requirements were amended to include data encryption to protect our customers’ data from network sniffers. We used a standard asymmetric encryption for extra security. This meant that we had to generate a pair of public and private keys each time we connected. We devised a protocol to communicate the key the Child must have using a symmetric encryption algorithm. We were confident this was secure enough for our needs and it wasn’t overly complicated to implement.

Trouble

The project was complete when I took the product for a final set of developer white-box testing. This is something I learned to do before shipping any product, as I do have the responsibility of designing the features, I also feel responsible for looking under the hood in case there is some potential mechanical or technical issue. Much like your car mechanic would do before you go on an off-road trip.

That’s when things started to fall apart. All worked fine, except every now and then I’d get errors and the Child would disconnect. Inspecting the logs showed data-encryption exceptions. The deciphering function was failing. Every single developer who run the code verified that it worked fine without any problems whatsoever. I asked them to pay attention to this issue. They came back saying all was perfectly fine. It was only on my machine!

Mind you, I learned not to blame before I eliminate every possible culprit. And the prime suspect is, always, a bug. A developer error. So I started sniffing around. Visiting the design back and forth. Nothing made sense of the issue. The code works, almost all the time. Then it fails. Reconnect again, and it works fine… until it fails again.

Troubleshooting this issue wasn’t fun, precisely because it wasn’t fruitful. No amount of debugging helped, or, in fact, could ever help. The puzzle had to be solved by reason. Experimentation showed as much as I had already gathered from the logs. Still, I tried different scenarios. One thing was for sure, you couldn’t tell when it’ll fail next.

The Leak

Remember that HTTP is a connectionless protocol. That is, it’s designed to communicate a single request and its response and disconnect. This is the typical scenario. It holds no connections and no states, therefore, it has no session. On the web, sessions are realized by the HTTP server. An HTTP server would typically create some unique key upon login or first-request missing a key, and it would track all subsequent requests by getting the said key either in the URL or using cookies. In any event, even though a web-service may have support for sessions, the underlying protocol is still connectionless and stateless.

To improve performance, an afterthought of reusing connections was added. This is typically called Keep-Alive. The idea is that a flag is added to the HTTP header which tells the server not to close the connection immediately after responding to the request, anticipating further requests. This is reasonable, as a web-page typically loads multiple images and embedded items from the same server. The client and server supporting Keep-Alive reuse the same connection for several requests, until one of them closes the connection. What is most important in this scenario is that if either party doesn’t respect this hint, nothing would break. In fact, nothing would work any different, except, of course, for the extra connections and disconnections that would occur for each request.

Since the implementor of this feature was a savvy web developer, he always had this flag set. And so, as long as the connection wasn’t interrupted, or indeed, the underlying library we were using didn’t decide to close the connection on whim, all was well and we had no problems. However, when a new request went with a new connection, rather than an existing one, the Child’s server would accept a new socket, on a new port, rather than the previously open socket. This is what was happening on my test environment. Perhaps it was the fact that I was testing across VM images that triggered the disconnections. Anyway, this newly opened socket on the Child has no encryption details associated with it. It’s a brand-new connection. It should expect encryption key exchange. But due to implementation details, the request would have an ‘encrypted’ flag set and the Child wouldn’t mind that we negotiated no cryptographic keys. It’d go ahead and try to decipher the request, only, it couldn’t. Resulting in the logged encryption exception followed by disconnection.

Post Mortem

Once the issue was figured out, the solution was simple, albeit costly. The HTTP abstraction had leaked an ugly property that we had assumed abstracted away. At design time, we couldn’t care what protocol we used to carry our bits. Encryption was added almost as an afterthought. True that encryption does require a state. However looking at our code, the socket-level connection was abstracted by layers and layers of library code. In fact, all we had was a single static function which took a URL string for a request. We had serialized and encoded the request message in base-64 and appended to the URL, which contained the server hostname/ip and port; standard web request, really.

On the communication layer, we had this single URL construction and the request call. On the data layer, we had the encryption, serialization and data manipulation logic. On the application layer, well, there were no network details whatsoever. Most of the previous code which worked locally had remained the same, with the implementation changed to interface the new network layer. So in a sense the code evolved and adapted to its final form and it wasn’t anywhere near apparent that we had leaked a major problem into our code.

In hindsight, we should’ve taken matters into our hands and implement a session-based protocol directly. This would make sense because we’d be in complete control of all network matters. For one, with HTTP we couldn’t change the sockets to use async logic, nor could we change the buffer sizes and timeouts. Perhaps we didn’t need to, but considering the gigabytes/hour we expected to transfer, sooner or later we’d have to optimize and tune the system for performance. But, the developer assigned was inexperienced and we couldn’t afford the time overhead. Personally, I feared things would get too complicated for the assigned developer to handle. I let him pick the protocol he was most comfortable with. And that’s the real danger of leaky abstraction; everyone is tricked, including the experienced.

Indeed, we ended up rewriting the communication layer. First the HTTP code was replaced with plain sockets using TCP/IP. Next, sessions were added, such that disconnections were recoverable. That is, the data layer didn’t care whether communication was interrupted or not. We weren’t going to rely on the fact that we controlled the sockets. Disconnections were made irrelevant by design. And finally, our protocol required a strict sequence of initialization and handshake that insured correct state. Once the code was working as expected, we changed the sockets to use async interface for maximum performance.

Overall, we spent an extra 2 man/months and, as a result, the data+communication layer was sped up several times over. Still, this was one hell of a case of leaky abstraction.

The pitfall of smart-pointers and implicit data sharing

May 022011

Data sharing between threads is a tricky business. Anyone with any kind of experience with multi-threaded code will give you a 1001 synonyms for “tricky,” most of which you probably wouldn’t use in front of your parents. The problem I’m about the present, however, has zero to do with threading and everything with data sharing and leaky abstraction.

This is a pattern that is used very often when one object is used symmetrically at the beginning and end of another’s lifetime. That is, suppose we have a class that needs to get notified when a certain other class is created, and then again when it’s destroyed. One way to achieve this, is to simply set a flag once to true and a second time to false, in the constructor and destructor of the second object, respectively.

This particular example is in C++ but that’s just to illustrate the pattern.

class Object
{
public:

Object(SomeComponent& comp) : m_component(comp)
{
    m_component.setOnline(true); // We’re online.
}

~Object()
{
    m_component.setOnline(false); // Offline.
}
};

This looks fool-proof, as there is no way the flag will not get set, so long that Object is created and destroyed as intended. Typically, our code will be used as follows:

Object* pObject = new Object(component);
// component knows we are online and processing...

delete pObject; // Go offline and cleanup.

Now let’s see how someone might use this class…

// use smart pointer to avoid memory leaks...
std::auto_ptr<object> apObject;

// Recreate a new object...
apObject.reset(new Object(component));

See a problem? The code fails miserably! And it’s not even obvious. Why? Because there are implicit assumptions and a leaky abstraction at work. Let’s dice the last line…

Object* temp_object = new Object(component); // create new Object
  Object::Object();
    component.setOnline(true);  // was already true!
delete apObject.ptr; // new instance passed to auto_ptr
  Object::~Object(); // old instance deleted
    component.setOnline(false); // OUCH!
apObject.ptr = temp_object;

See what happened?

Both authors wrote pretty straightforward code. They couldn’t have done better without making assumptions beyond the scope of their work. This is a pattern that is very easy to run into, and it’s far from fun. Consider how one could have detected the problem in the first place. It’s not obvious. The flag was set correctly, but sometimes would fail! That is, whenever there is an Object instance, and we create another one to replace the first, the flag ends up being false. The first time we create an Object, all works fine. The second time, component seems to be unaware of us setting the flag to true.

Someone noticed the failure, assumed the flag wasn’t always set, or may be incorrectly set, reviewed the class code and sure enough concluded that all was correct. Looking at the use-case of Object we don’t necessarily run through the guts of auto_ptr. After all, it’s a building block; a pattern; an abstraction of a memory block. One would take a quick look, see that an instance of Object is created and stored in an auto_ptr. Again, nothing out of the ordinary.

So why did the code fail?

The answer is on multiple levels. First and foremost we had a shared data that wasn’t reference counted. This is a major failing point. The shared data is a liability because it’s not in the abstraction of object instances. The very same abstraction assumptions that auto_ptr makes; it points to independent memory blocks. What we did is we challenged the assumptions that auto_ptr makes and failed to safe-guard our implicitly-shared data.

In other words, we had two instances of Object at the same time, but the flag we were updating had only two states: true and false. Thereby, it had no way of tracking anything beyond a single piece of information. In our case, we were tracking whether we were online or not. The author of Object made very dangerous assumptions. First and foremost, the assumption that the flag’s state is equivalent to Object’s lifetime proved to be very misleading. Because this raised the question of whether or not more than one instance of Object can exist. That question would have avoided a lot of problems down the road, however it wasn’t obvious and perhaps never occurred to anyone.

Second, even if we assume that there can logically be one instance of Object, without proving that it’s impossible to create second instances by means of language features, we are bound to misuse, as clearly happened here. And we can’t blame the cautious programmer who used auto_ptr either.

If something shouldn’t happen, prevent it by making it impossible to happen.

Solutions

The solutions aren’t that simple. An obvious solution is to take out the flag setting calls from within Object and call them manually. However this defies the point of having them where one couldn’t possibly forget or miss calling them, in case of a bug. Consider the case when we should set the flag to false when Object is destroyed, but this happens due to an exception, which automatically destroys the Object instance. In such a case, we should catch the exception and set the said flag to false. This, of course, is never as straight forward as one would like, especially in complex and mature production code. Indeed, using the automatic guarantees of the language (in this case calling the ctor and dtor automatically) are clearly huge advantages that we can’t afford to ignore.

One possible solution is to prevent the creation of Object more than once at a time. But this can be very problematic. Consider the case when we have multiple component instances, and we are interested in a different Object per component, not a globally unique Object instance.

As I said, no easy solution. The solution that I’d use is the next best thing to instance creation prevention. Namely, to count the number of instances. However, even if we reference count the Objects, or even the calls to setting the flag, in all events, we must redefine the contract. What does it mean to have multiple instance of Object and multiple calls to set the flag to true? Does it mean we still have one responsible object and what guarantees that? What if there are other constraints, might some other code assume only one instance of Object when that flag is set?

All of the questions that flow from our suggested solutions demand us to define, or redefine, the contracts and assumptions of our objects. And whatever solution we agree on, it will have its own set of requirements and perhaps even assumption, if we’re not too careful.

Conclusion

Using design patterns and best practices are without a doubt highly recommended. Yet ironically sometimes they may lead to the most unexpected results. This is no criticism of using such recommendations from experienced specialists and industry leaders, rather, it’s a result of combining abstractions in such a way that not only hides some very fundamental assumptions in our design and/or implementation, but even creates situations where some of the implicit assumptions of our code are challenged. The case presented is a good example. Had the developers not used the ctor/dtor pattern for setting the said flag, or had they not used auto_ptr, no such problem would’ve arisen. Albeit, they would have had other failure points, as already mentioned.

Admittedly, without experience it’s near impossible to catch similar cases simply by reading code or, preferably, while designing. And inexperience has no easy remedy. But if someone figures out a trick, don’t hesitate to contact me.

May 2, 2011
Posted by Ashod Nakashian at 3:34 am
2 Responses
From the Trenches, Programming
Tagged with: C++, Coding, Leaky Abstraction, Pitfall
Font Size:
A A A

Why Code Readability Matters

Mar 012011

Lately my colleagues have been commenting on my relentless notes on code styling in our internal code-reviews. Evidently I’ve been paying more and more attention to what could rightly be called cosmetics. Sure there are bad conventions and, well, better ones. Admittedly, coding conventions are just that, conventions. No matter how much we debate tabs vs. spaces and how much we search for truth, they remain subjective and very much a personal preference. Yet, one can’t deny that there are some better conventions that do in fact improve the overall quality of the code.

But code quality is not what worries me. Not as much as worrying about eye strain and accidental misreadings. Both of which are also factors in the overall code quality, no doubt. It’s not rare that we find ourselves agonizing over a mistake that if it weren’t for haste and misunderstanding, could’ve been completely avoided.

Image via Wikipedia

Developers don’t write code, not just. We spend the majority of our time adding, removing and editing small pieces of code. In fact, most of the time we do small changes in an otherwise large code-base. Rarely do we write new independent pieces of code, unless of course we are starting a project from scratch. Most projects are legacy projects in the sense that they are passed from generations of developers to the next. We add new functions to implement new features, rewrite some and edit some others to meet our needs. All of which means reading a great deal of code to understand which statement is doing what and decide the best change-set to get the output we plan. But even when we work on a completely independent module or class, once we have the smallest chunk of code written, we already need to go back and forth to interweave the remaining code.

Oddly, as I grew in experience I also spent increasingly more time reading code before making additions or changes. This seems reasonable in hindsight. It’s much easier to write new code, much more difficult to try to understand existing code and to find ways to adapt them. Experience tells us that the easiest route is not necessarily the best in the long run. Sure, we might get faster results if we write that case-insensitive string comparison function for the 30th time whenever we need one, or so we might think. But the fact is that chances are that we’ll get a few incorrect results as well, which we might not even notice until it’s too late. With a little effort, we should find a good implementation that we can reuse and learn how to use it. Indeed, that implementation might very well be in our code-base. My example of string comparison is not all that artificial, considering that some languages don’t come with standard string comparison functions, with Unicode, case and locale options. Some languages do have standardized (or at least canonical) functions, but you need to explicitly include the library reference in your projects and that might involve SCM and build server issues. Not fun when the deadline is looming, and when exactly was the last time we didn’t have an imminent deadline? (Notice that I do not mean to imply that DIY coding is a deadly sin. It probably is in most cases, but surely there are good reasons to write your own. Like early optimization, the first two rules are don’t do it. But that’s another topic for another day.)

Another great side-effect to reading code is that one will eventually find bugs and at least report them, if not fix them on the spot (and send for review). So instead of creating new sources of bugs, reuse reduces code duplication and possibly resolves old bugs.

When we anticipate and reasonably-accurately estimate the overhead of possible bugs in our home-grown functions, scattered around the code-base, in comparison to the overhead of learning and incorporating a higher quality library function, then and only then do we stand a chance of making reasonable decisions. Unfortunately, understanding the difficulty of such comparisons and especially grasping the perils of code duplication combined with the Dunning–Kruger effect when it comes to estimating expected bugs in one’s own code, requires, paradoxically, experience to become obvious.

The whole issue of maintaining legacy code vs. writing from scratch is a different battle altogether. However, I will only say that improving on legacy code in any pace is almost always superior to the demolition approach. It’s less risky, you still have working bits and pieces and you have a reference code right on your plate. And here is my point: to avoid rewriting, one must learn to read existing code, and read them well.

To make things more concrete, here are some of my top readability points:

Consistency.
I can read code with virtually any convention, no problem. I’ve grown to learn to adapt. If I think I need to read code, then I better accept that I may very well read code not written in my favorite style. One thing that is nearly impossible to adapt to, however, is when the code looks like it’s written by a 5 year old who just switched from Qwerty keyboard to Dvorak.
Whatever convention you choose or get forced to use, just stick to it. Make sure it’s what you use everywhere in that project.
Consistency.
There is no way to stress how important this is. People experiment with braces-on-next-line and braces-on-same-line and forget that their experiments end up in the source repository for the frustration of everyone else on the team. Seriously, experimentation is fine. Finding your preferred style is perfectly Ok. Just don’t commit them. Not in the Trunk at least.
Space out the code.
Readable code is like readable text. As much as everyone hates reading text-walls, so do developers who have to read sandwiched code. Put empty lines between all scopes. An end-brace is a great place to add that extra line to make it easier to read and find the start of the next scope. Need to comment? Great. Just avoid newspaper-style coding, where the code is turned into a two-column format, code on the left, comments on the right. There are very few cases that one needs to do that. Comments go above the code block or statement, not next to it. Comments are also more readable if they are preceded by a blank line. There is no shortage of disk space. Break long functions, break mega classes and even break long files. Make your functions reasonable in size. Use cyclomatic complexity metrics to know when your code is too complex and hence hard to read and understand. Refactor.
Stop banner-commenting.
If you need to add 5 lines of boxed comments within functions, then it’s a good indication you need to start a new function. The banner can go above the new function. As much as I’d rather not have it there either, it’s much better than having a 1000+ lines in a single function with half a dozen banners. After all, if you need to signal the start of a new logical block, functions do that better. Same goes to breaking long functions, classes and files. (Unlike what the image above may imply, that image is just an image of some code, I do not endorse the style it adopts.)
Don’t use magic numbers.
Constants are a great tool to give names to what otherwise is a seemingly magic number that makes the code work. Default values, container limits, timeouts and user-messages are just some examples of good candidates for constants. Just avoid converting every integer into a constant. That’s not the purpose of constants and it reduces code readability.
Choose identifier names wisely.
There has been no shortage of literature on naming conventions and the perils of choosing names that lack significant differences. Yet, some still insist on using very ambiguous names and reuse them across logical boundaries. The identifiers not only need to be meaningful and unambiguous, but they also need to be consistent across functions, classes, modules and even the complete project. At least an effort to make them consistent should be made. If you need to use acronyms, then make sure you use the same case convention. Have database or network connections, use the same identifier names everywhere. It’s much, much more readable to see an identifier and immediately know what the type is. Incidentally, this was the original purpose of Hungarian notation (original paper). Same goes to function names. ‘Get’ can be used for read-only functions that have no side-effects. Avoid using verbs and nouns in an identifier then implement logic that contradicts the natural meaning of those words. It’s downright inhumane. If a functions claims to get some value, then that function should never modify any other value (with the possible exception of updating caches, if any, which should only boost performance without affecting correctness.) Prefer using proper names in identifiers. Meaningful names, nouns and verbs can be a great help, use them to their limit.
Don’t chain declarations.
Chaining variable declarations is a very common practice in C. This is partially due to the fact that Ansi C didn’t allow declaring variables after any executable statement. Developers had to declare everything up-front and at the top of their functions. This meant that all the variables of a given type were declared in a chain of commas. Declare variables in the innermost scope possible and as close to the usage statement as possible. Avoid recycling variables, unless it’s for the very same purpose. Never re-purpose variables, instead, declare new ones.
A similar practice is sometimes used to chain calls where a single function call which takes one or more arguments is called by first calling other functions and passing their return values as arguments to the first call. These two types of chaining make readability a painful process. Some people chain 3 or 4 function calls within one another, thinking it’s more compact and therefore better, somehow. It’s not. Not when someone needs to scan the code for side-effects and to find a hard bug. A possible exception to this is chaining calls one after another. Some classes are designed to chain calls such that each call returns a reference to the original instance to call other members. This is often done on string and I/O classes. This type of chaining, if not excessively used and too long, can be helpful and improve readability.
Limit line and file lengths.
Scrolling horizontally is a huge pain. No one can be reasonably expected to read anything while scrolling left and right for each line. But some people have larger screens than others. Yet some have weaker eye-sight, so use larger fonts. I’ve seen code apparently aligned to 120 characters per line, with occasional 150+ characters per line, back when 17″ screens were the norm. Some don’t even care or keep track of line length. This is very unfortunate as it makes other people’s lives very hard, especially those with smaller screens, lower resolutions and large fonts (read: bad eye-sight.) Code can’t be word-wrapped like article text (without the help of advanced editors.)
A good starting point is the legacy standard of 80 characters-per-line (cpl). If we can stick to that, great. We should make an effort to minimize the line-length and 80 characters is not too narrow to make things garbled. For teams who can all work with 100 or 120 cpl, the extra line width can give quite a bit flexibility and improve readability. But again be careful because too long a line and readability suffers yet again. Think how newspapers and magazines have narrow columns and how readable they are. Our wide screens with large footprints are like newspapers and moving our heads to read across the line, as opposed to moving just our eyes, is an extra physical strain and certainly a speed bump that we can do without.
Make scopes explicit.
Languages often have implicit scopes. The first statement after a conditional or a loop in C-style languages implicitly fall under the conditional or loop, but not consecutive statements. It’s much harder to parse the end of statements using one’s eyeballs to figure out which statement belongs to the implicit scope. Making these cases explicit makes it much easier to see them without much effort. Adding braces and an empty line after the closing brace, even around single-statement scopes does improve readability quite a bit.
Comment.
Code is very well understood by machines, even when obsecure and mangled. Humans on the other hand need to understand purpose and reason, beyond seeing what a piece of code would do. Without understanding the reasons behind particular constructs, operations and values, no developer can maintain the code in any sensible way. Comments are a great tool to communicate the less obvious aspects of the code. Explain the business needs, implicit assumptions, workarounds, special-requests and the temporary solutions. Talk to your fellow developers through comments. But, as every rule comes with exceptions, avoid over-commenting at all costs. Resist the temptation to declare that an error is about to occure right before throwing an exception. The throw statement is self-commenting. Don’t repeat yourself. Rather, explain why the case is considered unrecoverable at that point, if it’s not obvious. Prefer to write code that speaks for itself. Self-descriptive code is the most readable code.

As a corollary, some code should be made hard to read. You read that right. Code that depends on hacks, bad practices or temporary solutions should be made harder to read. Perhaps my best example is casting. It’s not rare that we need to cast instances of objects to specific types, and while this is a standard practice, it isn’t exactly highly recommended either. Making the casting a little less readable is a good way to force the reader to squint and make sure it’s not a bogus cast. This isn’t unlike italic text which makes reading the text harder and therefore makes it stand out. In the same vein, temporary solutions and hacks should be made to stand out. Adding TODO and FIXME markers are a good start. BUGBUG is another marker for known issues or limitations. Actually, if there is any place for that banner-comment, this would be it. Make known issues stand out like a sore thumb.

As with most things, the above is mostly a guideline, rules of thumb if you will. They are not set in stone and they aren’t complete by any stretch of imagination. There are many obviously important cases left out. Languages, ecosystems, frameworks and libraries come with their own nuances. What’s clear in one language might be very ambiguous in another, and therefore frowned upon. The important thing is to write with the reader in mind, to avoid confusing styles and especially to be mindful of the context and when an explanatory comment is due, and when refactoring is a better choice.

Conclusion

Good developers write code fast and with low bug-rate. Great developers read existing code, reuse and improve them, generating higher quality code with no duplication and reduced bug-rate for both new and existing code. We should write readable code; written for humans and not just for machines. And we should read code and improve, rather than rewriting.

March 1, 2011
Posted by Ashod Nakashian at 7:17 am
30 Responses
Programming
Tagged with: Coding, Coding Conventions
Font Size:
A A A