The story I’m about to tell is the worst case of leaky abstraction that I’ve encountered and had to resolve. Actually, it’s the most profound example that I know of. Profound here is used in a negative sense, of course. This isn’t the performance issues Joel brings as examples to leaky abstraction. Nor is it a case of getting a network exception while working with files, because the folder is now mapped to a network drive. Nor a case of getting out-of-disk-space error while writing to a file-backed memory block. This is a case of frustration and a very hard-to-catch defect. A case of a project that came too close to failing altogether, had we not figured it out in time.

http-Request from browser via webserver and back

Image via Wikipedia

Background

Our flagship project worked with 3rd party servers which for technical reasons had to be installed on the same machine as our product. This was an obvious shortcoming and had to change. The requirements came in and we were asked to move the interfaces with all 3rd party servers into lightweight distributed processes. These new processes were dubbed Child processes, while the central process, well, Central. A Child would run on a different machine on the network, communicating with the Central process to get commands, execute them and return the results.

The design was straightforward: all processes would be running as services listening on some configurable port. The commands were simple application-level message objects, each with its own type, serialized on the wire. The operations were synchronous and we needn’t progress updates or heartbeats. We left the network design synchronous as well, for simplicity.

The developer who was assigned the communication-layer had background in web-services and proposed to use standard HTTP protocol. We thought about it and declared that while it would have some small overhead, the simplicity of reusing some library would be a plus. After all, HTTP has data recovery and is a standard protocol. And if we really cared about overhead, we’d use UDP which has no duplicate detection, data recovery or even ordered packet transmission. Plus, the developer who’d work on this feature was comfortable with HTTP. So why not?

As it turns out, HTTP was the single worst decision made on this particular project.

Since we were now transmitting potentially sensitive data, the requirements were amended to include data encryption to protect our customers’ data from network sniffers. We used a standard asymmetric encryption for extra security. This meant that we had to generate a pair of public and private keys each time we connected. We devised a protocol to communicate the key the Child must have using a symmetric encryption algorithm. We were confident this was secure enough for our needs and it wasn’t overly complicated to implement.

Trouble

The project was complete when I took the product for a final set of developer white-box testing. This is something I learned to do before shipping any product, as I do have the responsibility of designing the features, I also feel responsible for looking under the hood in case there is some potential mechanical or technical issue. Much like your car mechanic would do before you go on an off-road trip.

That’s when things started to fall apart. All worked fine, except every now and then I’d get errors and the Child would disconnect. Inspecting the logs showed data-encryption exceptions. The deciphering function was failing. Every single developer who run the code verified that it worked fine without any problems whatsoever. I asked them to pay attention to this issue. They came back saying all was perfectly fine. It was only on my machine!

Mind you, I learned not to blame before I eliminate every possible culprit. And the prime suspect is, always, a bug. A developer error. So I started sniffing around. Visiting the design back and forth. Nothing made sense of the issue. The code works, almost all the time. Then it fails. Reconnect again, and it works fine… until it fails again.

Troubleshooting this issue wasn’t fun, precisely because it wasn’t fruitful. No amount of debugging helped, or, in fact, could ever help. The puzzle had to be solved by reason. Experimentation showed as much as I had already gathered from the logs. Still, I tried different scenarios. One thing was for sure, you couldn’t tell when it’ll fail next.

The Leak

Remember that HTTP is a connectionless protocol. That is, it’s designed to communicate a single request and its response and disconnect. This is the typical scenario. It holds no connections and no states, therefore, it has no session. On the web, sessions are realized by the HTTP server. An HTTP server would typically create some unique key upon login or first-request missing a key, and it would track all subsequent requests by getting the said key either in the URL or using cookies. In any event, even though a web-service may have support for sessions, the underlying protocol is still connectionless and stateless.

To improve performance, an afterthought of reusing connections was added. This is typically called Keep-Alive. The idea is that a flag is added to the HTTP header which tells the server not to close the connection immediately after responding to the request, anticipating further requests. This is reasonable, as a web-page typically loads multiple images and embedded items from the same server. The client and server supporting Keep-Alive reuse the same connection for several requests, until one of them closes the connection. What is most important in this scenario is that if either party doesn’t respect this hint, nothing would break. In fact, nothing would work any different, except, of course, for the extra connections and disconnections that would occur for each request.

Since the implementor of this feature was a savvy web developer, he always had this flag set. And so, as long as the connection wasn’t interrupted, or indeed, the underlying library we were using didn’t decide to close the connection on whim, all was well and we had no problems. However, when a new request went with a new connection, rather than an existing one, the Child’s server would accept a new socket, on a new port, rather than the previously open socket. This is what was happening on my test environment. Perhaps it was the fact that I was testing across VM images that triggered the disconnections. Anyway, this newly opened socket on the Child has no encryption details associated with it. It’s a brand-new connection. It should expect encryption key exchange. But due to implementation details, the request would have an ‘encrypted’ flag set and the Child wouldn’t mind that we negotiated no cryptographic keys. It’d go ahead and try to decipher the request, only, it couldn’t. Resulting in the logged encryption exception followed by disconnection.

Post Mortem

Once the issue was figured out, the solution was simple, albeit costly. The HTTP abstraction had leaked an ugly property that we had assumed abstracted away. At design time, we couldn’t care what protocol we used to carry our bits. Encryption was added almost as an afterthought. True that encryption does require a state. However looking at our code, the socket-level connection was abstracted by layers and layers of library code. In fact, all we had was a single static function which took a URL string for a request. We had serialized and encoded the request message in base-64 and appended to the URL, which contained the server hostname/ip and port; standard web request, really.

On the communication layer, we had this single URL construction and the request call. On the data layer, we had the encryption, serialization and data manipulation logic. On the application layer, well, there were no network details whatsoever. Most of the previous code which worked locally had remained the same, with the implementation changed to interface the new network layer. So in a sense the code evolved and adapted to its final form and it wasn’t anywhere near apparent that we had leaked a major problem into our code.

In hindsight, we should’ve taken matters into our hands and implement a session-based protocol directly. This would make sense because we’d be in complete control of all network matters. For one, with HTTP we couldn’t change the sockets to use async logic, nor could we change the buffer sizes and timeouts. Perhaps we didn’t need to, but considering the gigabytes/hour we expected to transfer, sooner or later we’d have to optimize and tune the system for performance. But, the developer assigned was inexperienced and we couldn’t afford the time overhead. Personally, I feared things would get too complicated for the assigned developer to handle. I let him pick the protocol he was most comfortable with. And that’s the real danger of leaky abstraction; everyone is tricked, including the experienced.

Indeed, we ended up rewriting the communication layer. First the HTTP code was replaced with plain sockets using TCP/IP. Next, sessions were added, such that disconnections were recoverable. That is, the data layer didn’t care whether communication was interrupted or not. We weren’t going to rely on the fact that we controlled the sockets. Disconnections were made irrelevant by design. And finally, our protocol required a strict sequence of initialization and handshake that insured correct state. Once the code was working as expected, we changed the sockets to use async interface for maximum performance.

Overall, we spent an extra 2 man/months and, as a result, the data+communication layer was sped up several times over. Still, this was one hell of a case of leaky abstraction.

Further Reading:


Update:
Many are asking why not use SSL? The answer is because HTTP was the wrong choice in the first place.
We weren’t building a web-service. This was a backend server communicating with the main product. We didn’t want to limit the protocol, features or extendability by the choice of communication details. SSL would resolve the encryption issue, but we’d have had to implement an HTTPS server. In addition, whatever application protocol we would eventually realize, it had to be connectionless. The encryption layer simply uncovered this design implication that we had overlooked, hence the leak in the communication protocol abstraction. At that point we didn’t have any control messages nor did we have requests that needed states, later we added both. In fact, we added a command to iterate over a large set of data, returning one item at a time. HTTP/S would’ve made this harder to implement, as the state would have had to be sent with each request. Control messages and heartbeats would’ve been pointless.
In short, HTTP gave us very little and the connectionless nature caused as a lot of trouble. We got rid of both. We exchanged a wrong solution with the right one, hardly reinventing the wheel, if you ask me.

Share
 

Data sharing between threads is a tricky business. Anyone with any kind of experience with multi-threaded code will give you a 1001 synonyms for “tricky,” most of which you probably wouldn’t use in front of your parents. The problem I’m about the present, however, has zero to do with threading and everything with data sharing and leaky abstraction.

This is a pattern that is used very often when one object is used symmetrically at the beginning and end of another’s lifetime. That is, suppose we have a class that needs to get notified when a certain other class is created, and then again when it’s destroyed. One way to achieve this, is to simply set a flag once to true and a second time to false, in the constructor and destructor of the second object, respectively.

This particular example is in C++ but that’s just to illustrate the pattern.

class Object
{
public:

Object(SomeComponent& comp) : m_component(comp)
{
    m_component.setOnline(true); // We’re online.
}

~Object()
{
    m_component.setOnline(false); // Offline.
}
};

This looks fool-proof, as there is no way the flag will not get set, so long that Object is created and destroyed as intended. Typically, our code will be used as follows:

Object* pObject = new Object(component);
// component knows we are online and processing...

delete pObject; // Go offline and cleanup.

Now let’s see how someone might use this class…

// use smart pointer to avoid memory leaks...
std::auto_ptr<object> apObject;

// Recreate a new object...
apObject.reset(new Object(component));

See a problem? The code fails miserably! And it’s not even obvious. Why? Because there are implicit assumptions and a leaky abstraction at work. Let’s dice the last line…

Object* temp_object = new Object(component); // create new Object
  Object::Object();
    component.setOnline(true);  // was already true!
delete apObject.ptr; // new instance passed to auto_ptr
  Object::~Object(); // old instance deleted
    component.setOnline(false); // OUCH!
apObject.ptr = temp_object;

See what happened?

Both authors wrote pretty straightforward code. They couldn’t have done better without making assumptions beyond the scope of their work. This is a pattern that is very easy to run into, and it’s far from fun. Consider how one could have detected the problem in the first place. It’s not obvious. The flag was set correctly, but sometimes would fail! That is, whenever there is an Object instance, and we create another one to replace the first, the flag ends up being false. The first time we create an Object, all works fine. The second time, component seems to be unaware of us setting the flag to true.

Someone noticed the failure, assumed the flag wasn’t always set, or may be incorrectly set, reviewed the class code and sure enough concluded that all was correct. Looking at the use-case of Object we don’t necessarily run through the guts of auto_ptr. After all, it’s a building block; a pattern; an abstraction of a memory block. One would take a quick look, see that an instance of Object is created and stored in an auto_ptr. Again, nothing out of the ordinary.

So why did the code fail?

The answer is on multiple levels. First and foremost we had a shared data that wasn’t reference counted. This is a major failing point. The shared data is a liability because it’s not in the abstraction of object instances. The very same abstraction assumptions that auto_ptr makes; it points to independent memory blocks. What we did is we challenged the assumptions that auto_ptr makes and failed to safe-guard our implicitly-shared data.

In other words, we had two instances of Object at the same time, but the flag we were updating had only two states: true and false. Thereby, it had no way of tracking anything beyond a single piece of information. In our case, we were tracking whether we were online or not. The author of Object made very dangerous assumptions. First and foremost, the assumption that the flag’s state is equivalent to Object’s lifetime proved to be very misleading. Because this raised the question of whether or not more than one instance of Object can exist. That question would have avoided a lot of problems down the road, however it wasn’t obvious and perhaps never occurred to anyone.

Second, even if we assume that there can logically be one instance of Object, without proving that it’s impossible to create second instances by means of language features, we are bound to misuse, as clearly happened here. And we can’t blame the cautious programmer who used auto_ptr either.

If something shouldn’t happen, prevent it by making it impossible to happen.

Solutions

The solutions aren’t that simple. An obvious solution is to take out the flag setting calls from within Object and call them manually. However this defies the point of having them where one couldn’t possibly forget or miss calling them, in case of a bug. Consider the case when we should set the flag to false when Object is destroyed, but this happens due to an exception, which automatically destroys the Object instance. In such a case, we should catch the exception and set the said flag to false. This, of course, is never as straight forward as one would like, especially in complex and mature production code. Indeed, using the automatic guarantees of the language (in this case calling the ctor and dtor automatically) are clearly huge advantages that we can’t afford to ignore.

One possible solution is to prevent the creation of Object more than once at a time. But this can be very problematic. Consider the case when we have multiple component instances, and we are interested in a different Object per component, not a globally unique Object instance.

As I said, no easy solution. The solution that I’d use is the next best thing to instance creation prevention. Namely, to count the number of instances. However, even if we reference count the Objects, or even the calls to setting the flag, in all events, we must redefine the contract. What does it mean to have multiple instance of Object and multiple calls to set the flag to true? Does it mean we still have one responsible object and what guarantees that? What if there are other constraints, might some other code assume only one instance of Object when that flag is set?

All of the questions that flow from our suggested solutions demand us to define, or redefine, the contracts and assumptions of our objects. And whatever solution we agree on, it will have its own set of requirements and perhaps even assumption, if we’re not too careful.

Conclusion

Using design patterns and best practices are without a doubt highly recommended. Yet ironically sometimes they may lead to the most unexpected results. This is no criticism of using such recommendations from experienced specialists and industry leaders, rather, it’s a result of combining abstractions in such a way that not only hides some very fundamental assumptions in our design and/or implementation, but even creates situations where some of the implicit assumptions of our code are challenged. The case presented is a good example. Had the developers not used the ctor/dtor pattern for setting the said flag, or had they not used auto_ptr, no such problem would’ve arisen. Albeit, they would have had other failure points, as already mentioned.

Admittedly, without experience it’s near impossible to catch similar cases simply by reading code or, preferably, while designing. And inexperience has no easy remedy. But if someone figures out a trick, don’t hesitate to contact me.

Share
QR Code Business Card