Complex Systems

I hate Windows, it seems that all my problems at work come from having to deal with Windows.  And Mac OS X, I hate Mac OS X as well for the same reason.

Actually, I don’t really hate thoses operating systems, but it got your attention.  I actually think they are both perfectly fine operating systems.  But they do cause all my headaches at work.  I’m a Linux user by default and venturing into the realm of Windows and OS X always seems to give me headaches.

And it is not really the operating systems that cause me the headaches.  The real issue is the complexity of the systems that I have to work with.  As the main part of my job, I maintain and (try to) enhance and extend two fairly complex systems.  One is the public data server for data from the primary instrument on a NASA satellite mission, the other is the software build system for the primary instrument team for that mission.

Both of these systems suffer to some extent from the second system effect as described by Fred Brooks in the Mythical Man Month, as both are the follow-on systems to earlier systems that worked quite well. And both second systems were written by the author of the first system.

In the case of the data server, I only have myself to blame, since I am the original author.  I did all the trade studies, wrote the requirements and design documents, and implemented the system.  In fact, knowing about the second system effect, I tried really hard to avoid suffering from it.  And for the most part, I think I succeeded.  It’s a realtively small, focused system that does one thing really fast.

But it is still complex.  And it still gives me headaches when things go wrong.  And I wrote it.  I understand intuitively what it is supposed to be doing and how it works.  I can only imagine the headaches the guy who was maintianing it for the year I was off working on a different project had.

The other system, on the other hand, was not written by me, and I don’t have the intuitive grasp of the system like the original developer did.  Although I’m getting a better feel for it every day.  And in many ways, this system much more complex than the data server.  It’s an automated build system.  When a user checks in and tags new code, the build system launches a series of processes that checks out the code, builds it, runs all the associated tests, bundles up user, developer and source distributions and publishes all the results (including e-mailing developers about any of their packages that failed to compile or pass their tests).

It’s a fairly standard build system.  Except that it all has to run on seven different operating systems.  With six different compilers.  And it runs on a batch queuing system and talks to four different databases on two different MySQL servers.  Did I mention it was fairly complex?

Just to enumerate, the operating systems we currently support are 32 and 64 bit Redhat Enterprise Linux 4 & 5, Mac OS X 10.6 (Snow Leopard), Mac OS X 10.4 (Tiger, going away as soon as the Snow Leopard support is fully functional) and Windows XP (with Windows 7 support looming soon).  The compilers we currently support are four versions of gcc (3.4, 4.0, 4.1 and 4.2) and two versions of Visual Studio (2003 and 2008).  It’s not actually as bad as it sounds.  With the exception of two versions of VS running on Win XP, there is only one compiler supported per *nix style OS.  This variety is actually a good thing as it helps keep the codebase clean since it has to work everywhere.

The real trouble comes from the infrastructure supporting the system and the ways it interacts (or doesn’t) with these different operating systems.

The programs that run the build system were written in C++ using the Qt library.  Now I didn’t know anything about Qt when I acquired the responsibility for the project but after sifting through the code, I think I can understand why this was chosen.  One of the main reasons was the use of the timer and process control functionality, both to launch checks at specific intervals and to kill build or, more importantly, test processes that have hung and are taking to long.  Only that latter doesn’t seem to work on Snow Leopard, as we found out when one of our packages was seg faulting in the tests and instead of dying, it was going into an infinite loop.  And since the build system code didn’t properly kill it, the entire system hung up for that OS.  And right now I can’t tell if the problem is Qt, the underlying OS, how we’re applying it, or some combination of the three.  Complexity.

This build system has a lot of moving parts.  And I think the reason is that it is built around the central batch queuing system at the national laboratory where it runs.  In theory and at the beginning, that was a good idea.  We were sometimes triggering new builds every 15 to 30 minutes and the entire build process takes about an hour or two to run (there’s a lot of code to be compiled and tested).  By using the batch farm, we could have all these builds running and not piling up on one another by leveraging a tiny fraction of the thousands of CPU cores available in the farm.

But that came with tradeoffs.  For example, since the various parts of the process could potentially (and usually do) run on different machines, you can’t us local storage and have to use network disks (via AFS in our case) to hold all the code and test data.  This doesn’t seem to be an issue for the *nix systems but for some reason accessing the network disks from Windows is sloooooow.  A process that takes 10 minutes on the *nix boxes can take 30-40 (or more sometimes) minutes on the Windows boxes, reason unknown.  There are other tradeoffs as well, all increasing the complexity.

And then a couple of things happened.  The lab never really supported Mac OS X in the batch farm, so we had to get our own OS X boxes.  And we somewhat pioneered Windows usage so we had to get those boxes ourselves as well.  And then they dropped Redhat EL 4, and 32 bit Redhat EL 5.  So now, the only OS supported in the main farm that we were supposed to use is Redhat EL5 64-bit.  Everything else runs on our own project purchaced machines, but we’re still wedded to this complex infrastructure of using the batch farm.

Luckily, we’re starting to move away from that.  But it’s painfully slow, mainly since I seem to spend all my time running around propping up the beast to keep it in production and have little time to work on an alternative.  But at least there is motion in the right direction.  Towards simpler systems.  And away from complexity.

This post originally appeared on my old Programming Space blog.