Reading Cowboy Coding and the Pink Sombrero I’m reminded of the two occasions when I deserved that Pink Sombrero, or at least a sombrero. Since I’ll have done cowboy coding exactly twice before retiring, allow me to share the experience with you.

Bacardi mojito served in Bacardi branded glass.

Image via Wikipedia

This happened a handful of years ago. I was working for a prominent player in the electronic trading market. Our internal process dictated design review by an architect, code review by at least two developers, including one from a different team, commit, SCM build signed-off to QA, QA sign-off to Operations after testing, Ops performs pre-market testing with real currency before final deployment. Our customers are bankers and traders gambling millions on commodities around the globe, so there is a reason for such a strict process.

During a relaxed Sunday evening I got a call from Frankfurt Operations team. Something about the deployment tests failing. I was on standby for this deployment as a technical support for the Ops. The market opens Monday and there was a pre-market test that we couldn’t miss.

I’m in Yerevan. The developers responsible for the deployed component are in Mountain Time, in deep sleep. The European market-open time is in between. Immediately I was on my office workstation via remote desktop talking with a sleepy manager, VP of Software and a couple of guys from Ops. I started reading logs and code. I was promised at least one of the original coders to review my fix. After a marathon of about 7 hours we had a fix. The other guy reviewed it and, to my surprise, the VP of Software approved hotfixing my local build on the production farm.

Obviously this isn’t like a CSS fix on a live web-site. Leaving the code size and complexity aside, the module in question typically processes thousands of orders a day, each worth a few grands, some tens. Mistakes here cost a tad more than page views. Luckily, all went better then expected. The next day I went to the office and I felt like a walking million dollars. Where is that pink sombrero when you deserve it?

A couple of months went by, when again I got a similar call. This time some commodity price was missing or incorrect. The fix wasn’t very difficult. We had to find the correct price, write a small DB update script to patch the Price table and test. Only thing was that there wasn’t nearly as much time as last time. Specifically, the pre-market test had come to a close and we had to fix the issue before market-open and go live without testing.

Fortunately, I found the price of the commodity in question on a test server. Wrote the SQL to update the DB. Without blinking, I got green light to patch the production servers by the VP of Software. Still no sombrero! All went as planned, except, the price was wrong!

Early the next day at work I was already being asked by my manager and her supervisor about the fix. “Who approved it?!?” they asked impatiently and in the most irritating tone, before telling me how it blew up. Turns out the test server had old data. Everybody and their dog were in feisty mode.

And that was the last time I went cowboy coding, sans sombrero. Lesson learned, and I don’t care who approves cowboy coding… although I wish I had a Mojito while at it.

Share
 

The story I’m about to tell is the worst case of leaky abstraction that I’ve encountered and had to resolve. Actually, it’s the most profound example that I know of. Profound here is used in a negative sense, of course. This isn’t the performance issues Joel brings as examples to leaky abstraction. Nor is it a case of getting a network exception while working with files, because the folder is now mapped to a network drive. Nor a case of getting out-of-disk-space error while writing to a file-backed memory block. This is a case of frustration and a very hard-to-catch defect. A case of a project that came too close to failing altogether, had we not figured it out in time.

http-Request from browser via webserver and back

Image via Wikipedia

Background

Our flagship project worked with 3rd party servers which for technical reasons had to be installed on the same machine as our product. This was an obvious shortcoming and had to change. The requirements came in and we were asked to move the interfaces with all 3rd party servers into lightweight distributed processes. These new processes were dubbed Child processes, while the central process, well, Central. A Child would run on a different machine on the network, communicating with the Central process to get commands, execute them and return the results.

The design was straightforward: all processes would be running as services listening on some configurable port. The commands were simple application-level message objects, each with its own type, serialized on the wire. The operations were synchronous and we needn’t progress updates or heartbeats. We left the network design synchronous as well, for simplicity.

The developer who was assigned the communication-layer had background in web-services and proposed to use standard HTTP protocol. We thought about it and declared that while it would have some small overhead, the simplicity of reusing some library would be a plus. After all, HTTP has data recovery and is a standard protocol. And if we really cared about overhead, we’d use UDP which has no duplicate detection, data recovery or even ordered packet transmission. Plus, the developer who’d work on this feature was comfortable with HTTP. So why not?

As it turns out, HTTP was the single worst decision made on this particular project.

Since we were now transmitting potentially sensitive data, the requirements were amended to include data encryption to protect our customers’ data from network sniffers. We used a standard asymmetric encryption for extra security. This meant that we had to generate a pair of public and private keys each time we connected. We devised a protocol to communicate the key the Child must have using a symmetric encryption algorithm. We were confident this was secure enough for our needs and it wasn’t overly complicated to implement.

Trouble

The project was complete when I took the product for a final set of developer white-box testing. This is something I learned to do before shipping any product, as I do have the responsibility of designing the features, I also feel responsible for looking under the hood in case there is some potential mechanical or technical issue. Much like your car mechanic would do before you go on an off-road trip.

That’s when things started to fall apart. All worked fine, except every now and then I’d get errors and the Child would disconnect. Inspecting the logs showed data-encryption exceptions. The deciphering function was failing. Every single developer who run the code verified that it worked fine without any problems whatsoever. I asked them to pay attention to this issue. They came back saying all was perfectly fine. It was only on my machine!

Mind you, I learned not to blame before I eliminate every possible culprit. And the prime suspect is, always, a bug. A developer error. So I started sniffing around. Visiting the design back and forth. Nothing made sense of the issue. The code works, almost all the time. Then it fails. Reconnect again, and it works fine… until it fails again.

Troubleshooting this issue wasn’t fun, precisely because it wasn’t fruitful. No amount of debugging helped, or, in fact, could ever help. The puzzle had to be solved by reason. Experimentation showed as much as I had already gathered from the logs. Still, I tried different scenarios. One thing was for sure, you couldn’t tell when it’ll fail next.

The Leak

Remember that HTTP is a connectionless protocol. That is, it’s designed to communicate a single request and its response and disconnect. This is the typical scenario. It holds no connections and no states, therefore, it has no session. On the web, sessions are realized by the HTTP server. An HTTP server would typically create some unique key upon login or first-request missing a key, and it would track all subsequent requests by getting the said key either in the URL or using cookies. In any event, even though a web-service may have support for sessions, the underlying protocol is still connectionless and stateless.

To improve performance, an afterthought of reusing connections was added. This is typically called Keep-Alive. The idea is that a flag is added to the HTTP header which tells the server not to close the connection immediately after responding to the request, anticipating further requests. This is reasonable, as a web-page typically loads multiple images and embedded items from the same server. The client and server supporting Keep-Alive reuse the same connection for several requests, until one of them closes the connection. What is most important in this scenario is that if either party doesn’t respect this hint, nothing would break. In fact, nothing would work any different, except, of course, for the extra connections and disconnections that would occur for each request.

Since the implementor of this feature was a savvy web developer, he always had this flag set. And so, as long as the connection wasn’t interrupted, or indeed, the underlying library we were using didn’t decide to close the connection on whim, all was well and we had no problems. However, when a new request went with a new connection, rather than an existing one, the Child’s server would accept a new socket, on a new port, rather than the previously open socket. This is what was happening on my test environment. Perhaps it was the fact that I was testing across VM images that triggered the disconnections. Anyway, this newly opened socket on the Child has no encryption details associated with it. It’s a brand-new connection. It should expect encryption key exchange. But due to implementation details, the request would have an ‘encrypted’ flag set and the Child wouldn’t mind that we negotiated no cryptographic keys. It’d go ahead and try to decipher the request, only, it couldn’t. Resulting in the logged encryption exception followed by disconnection.

Post Mortem

Once the issue was figured out, the solution was simple, albeit costly. The HTTP abstraction had leaked an ugly property that we had assumed abstracted away. At design time, we couldn’t care what protocol we used to carry our bits. Encryption was added almost as an afterthought. True that encryption does require a state. However looking at our code, the socket-level connection was abstracted by layers and layers of library code. In fact, all we had was a single static function which took a URL string for a request. We had serialized and encoded the request message in base-64 and appended to the URL, which contained the server hostname/ip and port; standard web request, really.

On the communication layer, we had this single URL construction and the request call. On the data layer, we had the encryption, serialization and data manipulation logic. On the application layer, well, there were no network details whatsoever. Most of the previous code which worked locally had remained the same, with the implementation changed to interface the new network layer. So in a sense the code evolved and adapted to its final form and it wasn’t anywhere near apparent that we had leaked a major problem into our code.

In hindsight, we should’ve taken matters into our hands and implement a session-based protocol directly. This would make sense because we’d be in complete control of all network matters. For one, with HTTP we couldn’t change the sockets to use async logic, nor could we change the buffer sizes and timeouts. Perhaps we didn’t need to, but considering the gigabytes/hour we expected to transfer, sooner or later we’d have to optimize and tune the system for performance. But, the developer assigned was inexperienced and we couldn’t afford the time overhead. Personally, I feared things would get too complicated for the assigned developer to handle. I let him pick the protocol he was most comfortable with. And that’s the real danger of leaky abstraction; everyone is tricked, including the experienced.

Indeed, we ended up rewriting the communication layer. First the HTTP code was replaced with plain sockets using TCP/IP. Next, sessions were added, such that disconnections were recoverable. That is, the data layer didn’t care whether communication was interrupted or not. We weren’t going to rely on the fact that we controlled the sockets. Disconnections were made irrelevant by design. And finally, our protocol required a strict sequence of initialization and handshake that insured correct state. Once the code was working as expected, we changed the sockets to use async interface for maximum performance.

Overall, we spent an extra 2 man/months and, as a result, the data+communication layer was sped up several times over. Still, this was one hell of a case of leaky abstraction.

Further Reading:


Update:
Many are asking why not use SSL? The answer is because HTTP was the wrong choice in the first place.
We weren’t building a web-service. This was a backend server communicating with the main product. We didn’t want to limit the protocol, features or extendability by the choice of communication details. SSL would resolve the encryption issue, but we’d have had to implement an HTTPS server. In addition, whatever application protocol we would eventually realize, it had to be connectionless. The encryption layer simply uncovered this design implication that we had overlooked, hence the leak in the communication protocol abstraction. At that point we didn’t have any control messages nor did we have requests that needed states, later we added both. In fact, we added a command to iterate over a large set of data, returning one item at a time. HTTP/S would’ve made this harder to implement, as the state would have had to be sent with each request. Control messages and heartbeats would’ve been pointless.
In short, HTTP gave us very little and the connectionless nature caused as a lot of trouble. We got rid of both. We exchanged a wrong solution with the right one, hardly reinventing the wheel, if you ask me.

Share
 

Data sharing between threads is a tricky business. Anyone with any kind of experience with multi-threaded code will give you a 1001 synonyms for “tricky,” most of which you probably wouldn’t use in front of your parents. The problem I’m about the present, however, has zero to do with threading and everything with data sharing and leaky abstraction.

This is a pattern that is used very often when one object is used symmetrically at the beginning and end of another’s lifetime. That is, suppose we have a class that needs to get notified when a certain other class is created, and then again when it’s destroyed. One way to achieve this, is to simply set a flag once to true and a second time to false, in the constructor and destructor of the second object, respectively.

This particular example is in C++ but that’s just to illustrate the pattern.

class Object
{
public:

Object(SomeComponent& comp) : m_component(comp)
{
    m_component.setOnline(true); // We’re online.
}

~Object()
{
    m_component.setOnline(false); // Offline.
}
};

This looks fool-proof, as there is no way the flag will not get set, so long that Object is created and destroyed as intended. Typically, our code will be used as follows:

Object* pObject = new Object(component);
// component knows we are online and processing...

delete pObject; // Go offline and cleanup.

Now let’s see how someone might use this class…

// use smart pointer to avoid memory leaks...
std::auto_ptr<object> apObject;

// Recreate a new object...
apObject.reset(new Object(component));

See a problem? The code fails miserably! And it’s not even obvious. Why? Because there are implicit assumptions and a leaky abstraction at work. Let’s dice the last line…

Object* temp_object = new Object(component); // create new Object
  Object::Object();
    component.setOnline(true);  // was already true!
delete apObject.ptr; // new instance passed to auto_ptr
  Object::~Object(); // old instance deleted
    component.setOnline(false); // OUCH!
apObject.ptr = temp_object;

See what happened?

Both authors wrote pretty straightforward code. They couldn’t have done better without making assumptions beyond the scope of their work. This is a pattern that is very easy to run into, and it’s far from fun. Consider how one could have detected the problem in the first place. It’s not obvious. The flag was set correctly, but sometimes would fail! That is, whenever there is an Object instance, and we create another one to replace the first, the flag ends up being false. The first time we create an Object, all works fine. The second time, component seems to be unaware of us setting the flag to true.

Someone noticed the failure, assumed the flag wasn’t always set, or may be incorrectly set, reviewed the class code and sure enough concluded that all was correct. Looking at the use-case of Object we don’t necessarily run through the guts of auto_ptr. After all, it’s a building block; a pattern; an abstraction of a memory block. One would take a quick look, see that an instance of Object is created and stored in an auto_ptr. Again, nothing out of the ordinary.

So why did the code fail?

The answer is on multiple levels. First and foremost we had a shared data that wasn’t reference counted. This is a major failing point. The shared data is a liability because it’s not in the abstraction of object instances. The very same abstraction assumptions that auto_ptr makes; it points to independent memory blocks. What we did is we challenged the assumptions that auto_ptr makes and failed to safe-guard our implicitly-shared data.

In other words, we had two instances of Object at the same time, but the flag we were updating had only two states: true and false. Thereby, it had no way of tracking anything beyond a single piece of information. In our case, we were tracking whether we were online or not. The author of Object made very dangerous assumptions. First and foremost, the assumption that the flag’s state is equivalent to Object’s lifetime proved to be very misleading. Because this raised the question of whether or not more than one instance of Object can exist. That question would have avoided a lot of problems down the road, however it wasn’t obvious and perhaps never occurred to anyone.

Second, even if we assume that there can logically be one instance of Object, without proving that it’s impossible to create second instances by means of language features, we are bound to misuse, as clearly happened here. And we can’t blame the cautious programmer who used auto_ptr either.

If something shouldn’t happen, prevent it by making it impossible to happen.

Solutions

The solutions aren’t that simple. An obvious solution is to take out the flag setting calls from within Object and call them manually. However this defies the point of having them where one couldn’t possibly forget or miss calling them, in case of a bug. Consider the case when we should set the flag to false when Object is destroyed, but this happens due to an exception, which automatically destroys the Object instance. In such a case, we should catch the exception and set the said flag to false. This, of course, is never as straight forward as one would like, especially in complex and mature production code. Indeed, using the automatic guarantees of the language (in this case calling the ctor and dtor automatically) are clearly huge advantages that we can’t afford to ignore.

One possible solution is to prevent the creation of Object more than once at a time. But this can be very problematic. Consider the case when we have multiple component instances, and we are interested in a different Object per component, not a globally unique Object instance.

As I said, no easy solution. The solution that I’d use is the next best thing to instance creation prevention. Namely, to count the number of instances. However, even if we reference count the Objects, or even the calls to setting the flag, in all events, we must redefine the contract. What does it mean to have multiple instance of Object and multiple calls to set the flag to true? Does it mean we still have one responsible object and what guarantees that? What if there are other constraints, might some other code assume only one instance of Object when that flag is set?

All of the questions that flow from our suggested solutions demand us to define, or redefine, the contracts and assumptions of our objects. And whatever solution we agree on, it will have its own set of requirements and perhaps even assumption, if we’re not too careful.

Conclusion

Using design patterns and best practices are without a doubt highly recommended. Yet ironically sometimes they may lead to the most unexpected results. This is no criticism of using such recommendations from experienced specialists and industry leaders, rather, it’s a result of combining abstractions in such a way that not only hides some very fundamental assumptions in our design and/or implementation, but even creates situations where some of the implicit assumptions of our code are challenged. The case presented is a good example. Had the developers not used the ctor/dtor pattern for setting the said flag, or had they not used auto_ptr, no such problem would’ve arisen. Albeit, they would have had other failure points, as already mentioned.

Admittedly, without experience it’s near impossible to catch similar cases simply by reading code or, preferably, while designing. And inexperience has no easy remedy. But if someone figures out a trick, don’t hesitate to contact me.

Share
 

Here is a story from the trenches…

Almost 3 years ago during a job interview one of the interviewers made a comment about how they never hire MBA’s for managerial positions. I didn’t get a chance to ask, when she continued to highlight the advantages (brag, really) of working with a superior who had worked as a programmer and climbed his/her way up the food chain. “We never have to talk to an MBA around here,” she added, and went on to finish with “everybody knows the code.”

The Ideal Project Manager?

You know, she’s talking about the kind of person who would know how to make time estimations and complexity assessment because, well, they had worked on similar things themselves. Not only that, imagine a supervisor who has actually worked on the very same projects that you’d be working on! No more lengthy meetings discussing time estimates and project plans, no more trying to hopelessly convince the manager why we need that extra person to pitch in to finish on time, no more wasted time attempting to find analogies of “spaghetti code” and hacks that must be refactored away before we add the next big thing, and more importantly, you’d have someone up the managerial food chain getting more realistic requirements and deadlines right off the bat before you even hear about the next project. After all, they know as much as the team everything there is to know about the codebase and software engineering… or so I thought!

I got hired alright. At first, I was rather amused by the idea; most everyone had worked on what I was working on at some time and every body seemed to have some idea about it. I imagined talking to the project manager about some project and how he’d point out some tricky part and that I should probably add in an extra week just to make sure it’s covered. I imagined hearing anecdotes and funny stories. In my mind I thought I would never have to talk the manager into anything; s/he would know what I’m talking about, and many times I expected them to tell me they’d extend the deadline because they know more testing would be needed. They know the code!

Enter Mr. MBA

Well, before I tell you what you probably expect anyway, let me tell you about an MBA who got hired at a time when there were no candidates for the position of a PM. They needed a PM, and they needed one immediately. So they hired Mr. MBA. Unlike what I had expected from an MBA, and unlike the typical Mr. MBA that I ran into throughout my career, this one was different. For a start his estimates were very realistic and often more accurate than the team’s! He was also very understanding. He’d get feedback from the team, make his own calculation (almost always to the team’s benefit) and take it up the chain. So often he’d take my estimates and add a 20 or 30% and make a comment to the effect of “QA is too busy, they’ll probably delay this project” or “we have external dependency; you can never know how effective they will communicate”. And almost always he was right on the money! The only times that he’d push for a shorter schedule or demand more aggressive results is when the success of the project directly depends on these factors and he thinks there is a way to optimize the project plan. And he’d find a way to reorganize things around and optimize things to get the right results.

How well do you think the other managers did? Needless to say they were a huge disappointment. They were the typical MBA I had in mind. Extremely rigid in their methods. Very opinionated and argumentative. Snobbish. Typically cutting any estimate by half and exclaiming why it would take that long to get such a trivial task done. Complaining and whining at every barrier or issue. Complete babies. And at the end of the day, they cared more about proving they were right than to get things done.

Why?

Here is why I think things turned out the opposite of what I expected. I think the reason that managers with technical background failed is because they almost always overestimate their abilities and undervalue their peers’ assessment. You see, where the MBA doesn’t try to imagine himself working on the tasks and projects, instead relies on historic data, experience, statistics and plain common sense, the tech manager tries to guess how long it would take him to do it.

Programmers are not the most humble. It’s easy to overestimate one’s abilities especially when you don’t want to look like you’re on the slow side in front of your once-peers, now subordinates. This is exacerbated by the fact that the manager in question probably hasn’t typed a single line of code in months. But even if they do, they still almost always undervalue the feedback of the team. In short, it’s very hard for someone who’s done a job to play a different roll and assume that s/he know nothing about the subject. After all, their advantage was supposed to be the very fact that they have been programmers. They are expected to use their experience! However, those who can’t get over the feeling of competing with their subordinates are doomed to do a bad job. While the tech manager is wasting energy completing with his/her subordinates, Mr. MBA was trying to solve real problems and work with the team.

Do you have a similar story? What’s your take on the subject? Are managers with programming experience better software PM’s or MBA’s know best?

Share
QR Code Business Card