May 302011
 

The story I’m about to tell is the worst case of leaky abstraction that I’ve encountered and had to resolve. Actually, it’s the most profound example that I know of. Profound here is used in a negative sense, of course. This isn’t the performance issues Joel brings as examples to leaky abstraction. Nor is it a case of getting a network exception while working with files, because the folder is now mapped to a network drive. Nor a case of getting out-of-disk-space error while writing to a file-backed memory block. This is a case of frustration and a very hard-to-catch defect. A case of a project that came too close to failing altogether, had we not figured it out in time.

http-Request from browser via webserver and back

Image via Wikipedia

Background

Our flagship project worked with 3rd party servers which for technical reasons had to be installed on the same machine as our product. This was an obvious shortcoming and had to change. The requirements came in and we were asked to move the interfaces with all 3rd party servers into lightweight distributed processes. These new processes were dubbed Child processes, while the central process, well, Central. A Child would run on a different machine on the network, communicating with the Central process to get commands, execute them and return the results.

The design was straightforward: all processes would be running as services listening on some configurable port. The commands were simple application-level message objects, each with its own type, serialized on the wire. The operations were synchronous and we needn’t progress updates or heartbeats. We left the network design synchronous as well, for simplicity.

The developer who was assigned the communication-layer had background in web-services and proposed to use standard HTTP protocol. We thought about it and declared that while it would have some small overhead, the simplicity of reusing some library would be a plus. After all, HTTP has data recovery and is a standard protocol. And if we really cared about overhead, we’d use UDP which has no duplicate detection, data recovery or even ordered packet transmission. Plus, the developer who’d work on this feature was comfortable with HTTP. So why not?

As it turns out, HTTP was the single worst decision made on this particular project.

Since we were now transmitting potentially sensitive data, the requirements were amended to include data encryption to protect our customers’ data from network sniffers. We used a standard asymmetric encryption for extra security. This meant that we had to generate a pair of public and private keys each time we connected. We devised a protocol to communicate the key the Child must have using a symmetric encryption algorithm. We were confident this was secure enough for our needs and it wasn’t overly complicated to implement.

Trouble

The project was complete when I took the product for a final set of developer white-box testing. This is something I learned to do before shipping any product, as I do have the responsibility of designing the features, I also feel responsible for looking under the hood in case there is some potential mechanical or technical issue. Much like your car mechanic would do before you go on an off-road trip.

That’s when things started to fall apart. All worked fine, except every now and then I’d get errors and the Child would disconnect. Inspecting the logs showed data-encryption exceptions. The deciphering function was failing. Every single developer who run the code verified that it worked fine without any problems whatsoever. I asked them to pay attention to this issue. They came back saying all was perfectly fine. It was only on my machine!

Mind you, I learned not to blame before I eliminate every possible culprit. And the prime suspect is, always, a bug. A developer error. So I started sniffing around. Visiting the design back and forth. Nothing made sense of the issue. The code works, almost all the time. Then it fails. Reconnect again, and it works fine… until it fails again.

Troubleshooting this issue wasn’t fun, precisely because it wasn’t fruitful. No amount of debugging helped, or, in fact, could ever help. The puzzle had to be solved by reason. Experimentation showed as much as I had already gathered from the logs. Still, I tried different scenarios. One thing was for sure, you couldn’t tell when it’ll fail next.

The Leak

Remember that HTTP is a connectionless protocol. That is, it’s designed to communicate a single request and its response and disconnect. This is the typical scenario. It holds no connections and no states, therefore, it has no session. On the web, sessions are realized by the HTTP server. An HTTP server would typically create some unique key upon login or first-request missing a key, and it would track all subsequent requests by getting the said key either in the URL or using cookies. In any event, even though a web-service may have support for sessions, the underlying protocol is still connectionless and stateless.

To improve performance, an afterthought of reusing connections was added. This is typically called Keep-Alive. The idea is that a flag is added to the HTTP header which tells the server not to close the connection immediately after responding to the request, anticipating further requests. This is reasonable, as a web-page typically loads multiple images and embedded items from the same server. The client and server supporting Keep-Alive reuse the same connection for several requests, until one of them closes the connection. What is most important in this scenario is that if either party doesn’t respect this hint, nothing would break. In fact, nothing would work any different, except, of course, for the extra connections and disconnections that would occur for each request.

Since the implementor of this feature was a savvy web developer, he always had this flag set. And so, as long as the connection wasn’t interrupted, or indeed, the underlying library we were using didn’t decide to close the connection on whim, all was well and we had no problems. However, when a new request went with a new connection, rather than an existing one, the Child’s server would accept a new socket, on a new port, rather than the previously open socket. This is what was happening on my test environment. Perhaps it was the fact that I was testing across VM images that triggered the disconnections. Anyway, this newly opened socket on the Child has no encryption details associated with it. It’s a brand-new connection. It should expect encryption key exchange. But due to implementation details, the request would have an ‘encrypted’ flag set and the Child wouldn’t mind that we negotiated no cryptographic keys. It’d go ahead and try to decipher the request, only, it couldn’t. Resulting in the logged encryption exception followed by disconnection.

Post Mortem

Once the issue was figured out, the solution was simple, albeit costly. The HTTP abstraction had leaked an ugly property that we had assumed abstracted away. At design time, we couldn’t care what protocol we used to carry our bits. Encryption was added almost as an afterthought. True that encryption does require a state. However looking at our code, the socket-level connection was abstracted by layers and layers of library code. In fact, all we had was a single static function which took a URL string for a request. We had serialized and encoded the request message in base-64 and appended to the URL, which contained the server hostname/ip and port; standard web request, really.

On the communication layer, we had this single URL construction and the request call. On the data layer, we had the encryption, serialization and data manipulation logic. On the application layer, well, there were no network details whatsoever. Most of the previous code which worked locally had remained the same, with the implementation changed to interface the new network layer. So in a sense the code evolved and adapted to its final form and it wasn’t anywhere near apparent that we had leaked a major problem into our code.

In hindsight, we should’ve taken matters into our hands and implement a session-based protocol directly. This would make sense because we’d be in complete control of all network matters. For one, with HTTP we couldn’t change the sockets to use async logic, nor could we change the buffer sizes and timeouts. Perhaps we didn’t need to, but considering the gigabytes/hour we expected to transfer, sooner or later we’d have to optimize and tune the system for performance. But, the developer assigned was inexperienced and we couldn’t afford the time overhead. Personally, I feared things would get too complicated for the assigned developer to handle. I let him pick the protocol he was most comfortable with. And that’s the real danger of leaky abstraction; everyone is tricked, including the experienced.

Indeed, we ended up rewriting the communication layer. First the HTTP code was replaced with plain sockets using TCP/IP. Next, sessions were added, such that disconnections were recoverable. That is, the data layer didn’t care whether communication was interrupted or not. We weren’t going to rely on the fact that we controlled the sockets. Disconnections were made irrelevant by design. And finally, our protocol required a strict sequence of initialization and handshake that insured correct state. Once the code was working as expected, we changed the sockets to use async interface for maximum performance.

Overall, we spent an extra 2 man/months and, as a result, the data+communication layer was sped up several times over. Still, this was one hell of a case of leaky abstraction.

Further Reading:


Update:
Many are asking why not use SSL? The answer is because HTTP was the wrong choice in the first place.
We weren’t building a web-service. This was a backend server communicating with the main product. We didn’t want to limit the protocol, features or extendability by the choice of communication details. SSL would resolve the encryption issue, but we’d have had to implement an HTTPS server. In addition, whatever application protocol we would eventually realize, it had to be connectionless. The encryption layer simply uncovered this design implication that we had overlooked, hence the leak in the communication protocol abstraction. At that point we didn’t have any control messages nor did we have requests that needed states, later we added both. In fact, we added a command to iterate over a large set of data, returning one item at a time. HTTP/S would’ve made this harder to implement, as the state would have had to be sent with each request. Control messages and heartbeats would’ve been pointless.
In short, HTTP gave us very little and the connectionless nature caused as a lot of trouble. We got rid of both. We exchanged a wrong solution with the right one, hardly reinventing the wheel, if you ask me.

May 292011
 

I was looking for a good book that made a good case for the theistic beliefs without being preachy. That is, a book that introduced me to the arguments upon which the world religions build their theologies. The three world religions I speak of are the Abrahamic religions. Abraham, a prophet recognized by the world religions, is considered the first man to have had the honor of being spoken to directly by God. The oldest manuscripts to recognize Abraham’s status and record the encounter can be found in the Bible, or, as the Christians call it, the Old Testament. As such, the Jewish theology seemed to be the most reasonable source to contain the foundations I was looking forward to studying. This is one of the books suggested.

Cover of "God According to God: A Physici...

Cover via Amazon

Gerald Schroeder‘s book is subtitled “A Physicist Proves We’ve Been Wrong About God All Along.” I get it, he has a degree in physics. What has that to do with anything? Is that not an attempt at appealing to authority? Should we trust his views, before even reading a single line, just because he has a degree in physics? Or does that show that he knows what he’s talking about any better?

God According to God is well written. The author is clearly not only a good writer, but he’s also well-versed in all the topics he touches upon. Schroeder frequently admits the obvious counter-argument to the points he makes. In chapter 3 “The Unlikely Planet Earth,” where, using Drake’s equation, he calculates the number of Earth-like planets in the visible universe. At the end of the chapter he concludes by saying:

The estimated number of stars in the entire visible universe is in the order of 1022. This indicates that in the entire universe there may be approximately 104, or 10,000, earthlike planets circling a sunlike star. These 10,000 potentially earthlike planets would be distributed among the 1011, or 100,000,000,000, galaxies in the entire visible universe. That comes out to be one earthlike planet for each 10,000,000 galaxies. The probability that any one galaxy would have more than one life-bearing stellar system is slim indeed.

To be honest, at this point I had already read 3 chapters and was a bit surprised that his conclusion wasn’t that Earth was by far the only possible host of life. Part of the reason for this expectation is his obvious bias to demonstrate how unique and rare life on Earth is. Although his assumptions are a bit conservative (for example he doesn’t consider the possibility of life on moons orbiting large planet, such as Titan,) his conclusion is spot on. For what it’s worth, I thought he wasted a good bunch of papers in this chapter, as the conclusion, if anything, convinced me that Earth is just a fluke, with a possible 10,000 more sprinkled around. What is so special about that escapes me.

The book can be divided into two logical domains: Physics and Theology, but of course they don’t share an equal number of pages. The division is so stark, that one might think the respective chapters were written by completely different authors. As a matter of fact, there are contradictions between them. In chapter 2 “The Origins of Life” he writes:

Our cosmic genesis began billions of years ago in our perspective of time, first as beams of energy, then as the heavier elements fashioned within stars and supernovae from the primordial hydrogen and helium, next as stardust remnants expelled in the bursts of supernovae, and finally reaching home as rocks and water and a few simple molecules that became alive on the once molten earth.

Later, in chapter 4 “Nature Rebels”:

In the Garden of Eden, 2,448 years prior to this revelation at Sinai, Adam and Eve were confronted with the identical options.

This caused me so much cognitive dissonance that I went back to find the section where the cosmic origin, what he calls the “Big Bang Creation,” is described. This physicist apparently holds the belief that our planet has billions of years behind it, yet he maintains that Adam and Eve were in the Garden of Eden exactly 2,448 years before the revelation at Sinai! Considering the era when the Garden of Eden encounters supposedly occurred and the lack of numbers in any biblical or other sources, the above number is extremely precise. Not only that goes unexplained, Schroeder assumes the reader has already agreed to the Garden of Eden events as told in the Bible. In fact, that is my main point here: The author assumes the reader is a believer and well-acquainted to the theology and he’s basically giving scientific backing and, as is apparent in later chapters, throwing his own interpretation and understanding of the nature of God.

Perhaps the title might have given a clue or two as to the conviction of the author regarding his understanding of God’s nature and plan. There is perhaps less color hues in a rainbow than different interpretations and explanations of God’s nature, plan and instructions to the human race. The author of God According to God adds yet another, and it’s not a conventional one, at least it isn’t to me.

In chapter 6 “Arguing with God”:

The sequence of events at and following the binding give compelling force to the supposition that the God of the Bible not only wants a dialogue with us humans, but even more than that. God expects such, and if the situation seems unjust or unjustified, then, beyond a dialogue, God wants us to argue. If our case is strong enough, God will even “give in,” or at least modify the Divine directive. Moses seems to have understood this trait of the Divine.

A few pages down:

Argument seems to be the standard and the expected biblical operating procedure in our encounters with the Divine. The surprise is that, having designed and created our universe with all its magnificence and granted us the freedom of choice, God wants us, expects us, to interact with the Divine about how to run the universe.

In the next chapter “In Defense of God”:

As I read the events of the Bible, in human terms I see God in a sort of emotional bind. God desperately wants us to choose life, a dynamic, purposeful existence, but doesn’t want to force us along that line. Hence we are granted the liberating tzimtzum of creation. God has to hold back and let us try. When we really mess up, God steps in. It’s so human. Mom teaches junior to play chess. Looking over his shoulder as her son makes his moves on the board, she sees a trap developing. He is about to lose his queen. If she wants her kid to learn to think ahead, to envision the distant outcome of the initial move before that move is made, she will do well to keep her hands in her pockets and let him make the error or at most give a few very general suggestions, as God through the Bible gives to us. It’s frustrating, even painful, but it is part of the learning process, Divine as well as human.

The above quotes are not the only cases that made me stop reading, and pause… for a while. It might have been that I had expected the run of the mill explanations and arguments. Instead, I found radically new concepts. Ideas I hadn’t encountered before. I can see that some of these ideas could be called heretical. If we make a strong case arguing with God, “God will even “give in,”” and “[…] God wants us, expects us, to interact with the Divine about how to run the universe.” And apparently, there is a “Divine as well as human” learning process!

Whatever your stance on God and religion, God According to God isn’t a rehash of age-old arguments. Nor is it the typical “science proves the existence of God” kind of book. Gerald Schroeder is very well read on ancient Jewish texts. His Hebrew skills are of the translator caliber. His science is, as far as I can tell, solid. Overall, I learned quite a bit from the historical writings and the ancient Jewish theology that is blended in with the science and God’s strive to learn as we go. It’s just that I didn’t get what I paid for.

May 022011
 

Data sharing between threads is a tricky business. Anyone with any kind of experience with multi-threaded code will give you a 1001 synonyms for “tricky,” most of which you probably wouldn’t use in front of your parents. The problem I’m about the present, however, has zero to do with threading and everything with data sharing and leaky abstraction.

This is a pattern that is used very often when one object is used symmetrically at the beginning and end of another’s lifetime. That is, suppose we have a class that needs to get notified when a certain other class is created, and then again when it’s destroyed. One way to achieve this, is to simply set a flag once to true and a second time to false, in the constructor and destructor of the second object, respectively.

This particular example is in C++ but that’s just to illustrate the pattern.

class Object
{
public:

Object(SomeComponent& comp) : m_component(comp)
{
    m_component.setOnline(true); // We’re online.
}

~Object()
{
    m_component.setOnline(false); // Offline.
}
};

This looks fool-proof, as there is no way the flag will not get set, so long that Object is created and destroyed as intended. Typically, our code will be used as follows:

Object* pObject = new Object(component);
// component knows we are online and processing...

delete pObject; // Go offline and cleanup.

Now let’s see how someone might use this class…

// use smart pointer to avoid memory leaks...
std::auto_ptr<object> apObject;

// Recreate a new object...
apObject.reset(new Object(component));

See a problem? The code fails miserably! And it’s not even obvious. Why? Because there are implicit assumptions and a leaky abstraction at work. Let’s dice the last line…

Object* temp_object = new Object(component); // create new Object
  Object::Object();
    component.setOnline(true);  // was already true!
delete apObject.ptr; // new instance passed to auto_ptr
  Object::~Object(); // old instance deleted
    component.setOnline(false); // OUCH!
apObject.ptr = temp_object;

See what happened?

Both authors wrote pretty straightforward code. They couldn’t have done better without making assumptions beyond the scope of their work. This is a pattern that is very easy to run into, and it’s far from fun. Consider how one could have detected the problem in the first place. It’s not obvious. The flag was set correctly, but sometimes would fail! That is, whenever there is an Object instance, and we create another one to replace the first, the flag ends up being false. The first time we create an Object, all works fine. The second time, component seems to be unaware of us setting the flag to true.

Someone noticed the failure, assumed the flag wasn’t always set, or may be incorrectly set, reviewed the class code and sure enough concluded that all was correct. Looking at the use-case of Object we don’t necessarily run through the guts of auto_ptr. After all, it’s a building block; a pattern; an abstraction of a memory block. One would take a quick look, see that an instance of Object is created and stored in an auto_ptr. Again, nothing out of the ordinary.

So why did the code fail?

The answer is on multiple levels. First and foremost we had a shared data that wasn’t reference counted. This is a major failing point. The shared data is a liability because it’s not in the abstraction of object instances. The very same abstraction assumptions that auto_ptr makes; it points to independent memory blocks. What we did is we challenged the assumptions that auto_ptr makes and failed to safe-guard our implicitly-shared data.

In other words, we had two instances of Object at the same time, but the flag we were updating had only two states: true and false. Thereby, it had no way of tracking anything beyond a single piece of information. In our case, we were tracking whether we were online or not. The author of Object made very dangerous assumptions. First and foremost, the assumption that the flag’s state is equivalent to Object’s lifetime proved to be very misleading. Because this raised the question of whether or not more than one instance of Object can exist. That question would have avoided a lot of problems down the road, however it wasn’t obvious and perhaps never occurred to anyone.

Second, even if we assume that there can logically be one instance of Object, without proving that it’s impossible to create second instances by means of language features, we are bound to misuse, as clearly happened here. And we can’t blame the cautious programmer who used auto_ptr either.

If something shouldn’t happen, prevent it by making it impossible to happen.

Solutions

The solutions aren’t that simple. An obvious solution is to take out the flag setting calls from within Object and call them manually. However this defies the point of having them where one couldn’t possibly forget or miss calling them, in case of a bug. Consider the case when we should set the flag to false when Object is destroyed, but this happens due to an exception, which automatically destroys the Object instance. In such a case, we should catch the exception and set the said flag to false. This, of course, is never as straight forward as one would like, especially in complex and mature production code. Indeed, using the automatic guarantees of the language (in this case calling the ctor and dtor automatically) are clearly huge advantages that we can’t afford to ignore.

One possible solution is to prevent the creation of Object more than once at a time. But this can be very problematic. Consider the case when we have multiple component instances, and we are interested in a different Object per component, not a globally unique Object instance.

As I said, no easy solution. The solution that I’d use is the next best thing to instance creation prevention. Namely, to count the number of instances. However, even if we reference count the Objects, or even the calls to setting the flag, in all events, we must redefine the contract. What does it mean to have multiple instance of Object and multiple calls to set the flag to true? Does it mean we still have one responsible object and what guarantees that? What if there are other constraints, might some other code assume only one instance of Object when that flag is set?

All of the questions that flow from our suggested solutions demand us to define, or redefine, the contracts and assumptions of our objects. And whatever solution we agree on, it will have its own set of requirements and perhaps even assumption, if we’re not too careful.

Conclusion

Using design patterns and best practices are without a doubt highly recommended. Yet ironically sometimes they may lead to the most unexpected results. This is no criticism of using such recommendations from experienced specialists and industry leaders, rather, it’s a result of combining abstractions in such a way that not only hides some very fundamental assumptions in our design and/or implementation, but even creates situations where some of the implicit assumptions of our code are challenged. The case presented is a good example. Had the developers not used the ctor/dtor pattern for setting the said flag, or had they not used auto_ptr, no such problem would’ve arisen. Albeit, they would have had other failure points, as already mentioned.

Admittedly, without experience it’s near impossible to catch similar cases simply by reading code or, preferably, while designing. And inexperience has no easy remedy. But if someone figures out a trick, don’t hesitate to contact me.

QR Code Business Card