Chris Oldwood – Page 3 – ACCU World of Code

After being away from the relational database world for a few years itâ€™s been interesting coming back and working on a mature system with plenty of SQL code. Itâ€™s been said that SQL is the assembly language of databases and when SQL code is written only using its primitives (types and tables) itâ€™s easy to see why.

Way back in 2011 I wrote â€œThe Public Interface of a Databaseâ€ which was a distillation of my thoughts at the time about what I felt was generally wrong with much of the database code I saw. One aspect in particular which I felt was sorely underutilised was the use of views to build a logical model over the top of the physical model to allow a more emergent design to unfold. This post documents some of the ways Iâ€™ve found views to be beneficial in supporting a more agile approach to database design.

Views for Code Reuse

The first thing that struck me about the recent SQL code I saw was how much there was of it. Most queries were pretty verbose and as a consequence you had to work hard to comprehend what was going on. Just as you see the same tired examples around Orders => OrderItems => Products so the code had a similar set of 3 table joins over and over again as they formed the basis for so many queries.

One of the primary uses for database views is as a code reuse mechanism. Instead of copy-and-pasting the same bunch of joins everywhere:

FROM Orders o
INNER JOIN OrderItems oi
ON o.Id = oi.OrderId
INNER JOIN Products p
ON oi.ProductId = p.Id

we could simply say:

FROM OrdersOrderItemsProducts

This one simplification reduces a lot of complexity and means that wherever we see that name we instantly recognise it without mentally working through the joins in our head. Views are composable too meaning that we can implement one view in terms of another rather than starting from scratch every time.

Naming

However, if the name OrdersOrderItemsProducts makes you wince then I donâ€™t blame you because itâ€™s jarring due to its length and unnaturalness. Itâ€™s a classic attempt at naming based on how itâ€™s implemented rather than what it means.

I suspect a difficulty in naming views is part of the reason for their lack of use in some cases. For our classic example above I would probably go with OrderedProducts or ProductsOrdered. The latter is probably preferable as the point of focus is the Products â€œsetâ€ with the use of Orders being a means to qualify which products weâ€™re interested in, like â€œusers onlineâ€. Of course one could just easily say â€œunread messagesâ€ and therefore we quickly remember why naming is one of the two hardest problems in computer science.

Either way itâ€™s important that we do spend the time required to name our views appropriately as they become the foundation on which we base many of our other queries.

Views for Encapsulation

Using views as a code reuse mechanism is definitely highly beneficial but where I think they start to provide more value are as a mechanism for revealing new, derived sets of data. The name ProductsOrdered is not radically different from the more long-winded OrdersOrderItemsProducts and therefore it still heavily reflects the physical relationship of the underlying tables.

Now imagine a cinema ticketing system where you have two core relationships: Venue => Screen => SeatingPlan and Film => Screening => Ticket => Seat. By navigating these two relationships it is possible to determine the occupancy of the venue, screen, showing, etc. and yet the term Occupancy says nothing about how that is achieved. In essence we have revealed a new abstraction (Occupancy) which can be independently queried and therefore elevates our thinking to a higher plane instead of getting bogged down in the lengthy chain of joins across a variety of base tables.

Views for Addressing Uncertainty

We can also turn this thinking upside down, so that rather than creating something new by hiding the underlying existing structure, we can start with something concrete and re-organise how things work underneath. This is the essence of refactoring â€“ changing the design without changing the behaviour.

When databases were used as a point of integration this idea of hiding the underlying schema from â€œconsumersâ€ made sense as it gave you more room to change the schema without breaking a bunch of queries your consumers had already created. But even if you have sole control over your schema there is still a good reason why you might want to hide the schema, nay implementation, even from much of your own code.

Imagine you are developing a system where you need to keep daily versions of your customerâ€™s details easily accessible because you regularly perform computations across multiple dates [1] and you need to use the correct version of each customerâ€™s data for the relevant date. When you start out you may not know what the most appropriate way to store them because you do not know how frequently they change, what kinds of changes are made, or how the data will be used in practice.

If you assume that most attributes change most days you may well plump to just store them daily, in full, e.g.

| Date       | Name      | Valuation | ... |
| 2019-03-01 | Company A | Â£102m     | ... |
| 2019-03-01 | Company B | Â£47m      | ... |
| 2019-03-02 | Company A | Â£105m     | ... |
| 2019-03-02 | Company B | Â£42m      | ... |
| 2019-03-03 | Company A | Â£105m     | ... |
| 2019-03-03 | Company B | Â£42m      | ... |

On the contrary, if the attributes rarely change each day then maybe we can version the data instead:

| Name      | Version | Valuation | ... |
| Company A | 1       | Â£147m     | ... |
| Company A | 2       | Â£156m     | ... |
| Company B | 1       | Â£27m      | ... |

So far so good, but how do we track which version belongs to which date? Once again I can think of two obvious choices. The first is much like the original verbose table and we record it on a daily basis:

| Date       | Name      | Version |
| 2019-03-01 | Company A | 1       |
| 2019-03-01 | Company B | 1       |
| 2019-03-02 | Company A | 1       |
| 2019-03-02 | Company B | 2       |

The second is to coalesce dates with the same version creating a much more compact form:

| From       | To         | Name      | Version |
| 2019-03-01 | (null)     | Company A | 1       |
| 2019-03-01 | 2019-03-01 | Company B | 1       |
| 2019-03-02 | (null)     | Company B | 2       |

Notice how we have yet another design choice to make here â€“ whether to use NULL to represent â€œthe futureâ€, or whether to put todayâ€™s date as the upper bound and bump it on a daily basis [2].

So, with all those choices how do we make a decision? What if we donâ€™t need to make a decision, now? What if we Use Uncertainty as a Driver and create a design that is easily changeable when we know more about the shape of the data and how itâ€™s used?

What we do know is that we need to process customer data on a per-date basis, therefore, instead of starting with a Customer table we start with a Customer view which has the shape weâ€™re interested in:

We can happily use this view wherever we like knowing that the underlying structure could change without us needing to fix up lots of code. Naturally some code will be dependent on the physical structure, but the point is that weâ€™ve kept it to a bare minimum. If we need to transition from one design to another, but canâ€™t take the downtime to rewrite all the data up-front, that can often be hidden behind the view too.

Views as Interfaces

Itâ€™s probably my background [3] but I canâ€™t help but notice a strong parallel in the latter two examples with the use of interfaces in object-oriented code. George Box reminds us that â€œall models are wrong, but some are usefulâ€ and so we should be careful not to strain the analogy too far but I think there is some value in considering the relationship between views and tables as somewhat akin to interfaces and classes, at least for the purposes of encapsulation as described above.

On a similar note we often strive to create and use the narrowest interface that solves our problem and that should be no different in the database world either. Creating narrower interfaces (views) allows us to remain more in control of our implementation by leaking less.

One final type related comparison that I think worthy of mention is that itâ€™s easier to spot structural problems when you have a â€œricher type systemâ€, i.e. many well-named views. For example, if a query joins through ProductsOrdered to get to UserPreferences you can easily see something funky is going on.

Embracing Change

When you work alongside a database where the SQL code and schema gets refactored almost as heavily as the services that depend on it is a pleasurable experience [4]. Scott Ambler wrote a couple of books over a decade ago (Refactoring Databases: Evolutionary Database Design and Agile Database Techniques) which convinced me long ago that it was possible to design databases that could embrace change. Making judicious use of views certainly helped achieve that in part by keeping the accidental complexity down.

Admittedly performance concerns, still a dark art in the world of databases, gets in the way every now and but Iâ€™d rather try to make the database a better place for my successors rather than assume it canâ€™t be done.

[1] In investment banking itâ€™s common to re-evaluate trades and portfolios on historical dates both for regulatory and analytical purposes.

[2] Some interesting scenarios crop up here when repeatability matters and you have an unreliable upstream data source.

[3] Iâ€™m largely a self-taught, back-end developer with many years of writing C++ and C# based services.

[4] Having a large suite of database unit tests, also written in T-SQL, really helped as we could use TDD on the database schema too.

March 8, 2019March 8, 2019

The Perils of Multi-Phase Construction

Chris Oldwood from The OldWood Thing

Iâ€™ve never really been a fan of C#â€™s object initializer syntax. Yes, itâ€™s a little more convenient to write but it has a big downside which is it forces you to make your types mutable by default. Okay, thatâ€™s a bit strong, it doesnâ€™t force you to do anything, but it does promote that way of thinking and allows people to take advantage of mutability outside the initialisation block [1].

This post is inspired by some buggy code I encountered where my suspicion is that the subtleties of the object initialisation syntax got lost along the way and partially constructed objects eventually found their way into the wild.

No Dragons Yet

The method, which was to get the next message from a message queue, was originally written something like this:

Message result = null;
RawMessage message = queue.Receive();

if (message != null)
{
result = new Message
{
    Priority = message.Priority,
    Type = GetHeader(message, â€œMessageTypeâ€),
    Body = message.Body,
};
}

return result;

This was effectively correct. I say â€œeffectively correctâ€ because it doesnâ€™t contain the bug which came later but still relies on mutability which we know can be dangerous.

For example, what would happen if the GetHeader() method threw an exception? At the moment there is no error handling and so the exception propagates out the method and back up the stack. Because we make no effort to recover we let the caller decide what happens when a duff message comes in.

The Dragons Begin Circling

Presumably the behaviour when a malformed message arrived was undesirable because the method was changed slightly to include some recovery fairly soon after:

Still no bug yet, but that catch handler falling through to the return at the bottom is somewhat questionable; we are making the reader work hard to track what happens to result under the happy / sad paths to ensure it remains correct under further change.

Object Initialisation Syntax

Before showing the bug, hereâ€™s a brief refresher on how the object initialisation syntax works under the covers [2] in the context of our example code. Essentially it invokes the default constructor first and then performs assignments on the various other properties, e.g.

var __m = new Message();
__m.Priority = message.Priority;
__m.Type = GetHeader(message, â€œMessageTypeâ€);
__m.Body = message.Body,
result = __m;

Notice how the compiler introduces a hidden temporary variable during the construction which it then assigns to the target at the end? This ensures that any exceptions during construction wonâ€™t create partially constructed objects that are bound to variables by accident. (This assumes you donâ€™t use the constructor or property setter to attach itself to any global variables either.)

Hence, with respect to our example, if any part of the initialization fails then result will be left as null and therefore the message is indeed discarded and the caller gets a null reference back.

The Dragons Surface

Time passes and the code is then updated to support a new property which is also passed via a header. And then another, and another. However, being more complicated than a simple string value the logic to parse it is placed outside the object initialisation block, like this:

Message result = null;
RawMessage message = queue.Receive();

if (message != null)
{
try
{
    result = new Message
    {
      Priority = message.Priority,
      Type = GetHeader(message, â€œMessageTypeâ€),
      Body = message.Body,
    };

    var str = GetHeader(message, â€œSomeIntValueâ€);
    if (str != null && TryParseInt(str, out var value))
      result.IntValue = value;

    // ... more of the same ...
}
catch (Exception e)
{
    Log.Error(â€œInvalid message. Skipping.â€);
}
}

return result;

Now the problems start. With the latter header parsing code outside the initialisation block result is assigned a partially constructed object while the remaining parsing code runs. Any exceptions that occur [3] mean that result will be left only partially constructed and the caller will be returned the duff object because the exception handler falls out the bottom.

+1 for Tests

The reason I spotted the bug was because I was writing some tests around the code for a new header which also temporarily needed to be optional, like the others, to decouple the deployments. When running the tests there was an error displayed on the console output [4] telling me the message was being discarded, which I didnâ€™t twig at first. It was when I added a retrospective test for the previous optional fields and I found my new one wasnâ€™t be parsed correctly that I realised something funky was going on.

Alternatives

So, whatâ€™s the answer? Well, I can think of a number of approaches that would fix this particular code, ranging from small to large in terms of the amount of code that needs changing and our appetite for it.

Firstly we could avoid falling through in the exception handler and make it easier on the reader to comprehend what would be returned in the face of a parsing error:

catch (Exception e)
{
Log.Error(â€œInvalid message. Skipping.â€);
return null;
}

Secondly we could reduce the scope of the result variable and return that at the end of the parsing block so itâ€™s also clearer about what the happy path returns:

var result = new Message
{
// . . .
};

var str = GetHeader(message, â€œSomeIntValueâ€);
if (str != null && TryParseInt(str, out var value)
result.IntValue = value;

return result;

We could also short circuit the original check too and remove the longer lived result variable altogether with:

RawMessage message = queue.Receive();

if (message == null)
return null;

These are all quite simple changes which are also safe going forward should someone add more header values in the same way. Of course, if we were truly perverse and wanted to show how clever we were, we could fold the extra values back into the initialisation block by doing an Extract Function on the logic instead and leave the original dragons in place, e.g.

try
{
result = new Message
{
    Priority = message.Priority,
    Type = GetHeader(message, â€œMessageTypeâ€),
    Body = message.Body,
    IntValue = GetIntHeader(message, â€œSomeIntValueâ€),
    // ... more of the same ...
};
}
catch (Exception e)
{
Log.Error(â€œInvalid message. Skipping.â€);
}

But we would never do that because the aim is to write code that helps stop people making these kinds of mistakes in the first place. If we want to be clever we should make it easier for the maintainers to fall into The Pit of Success.

Other Alternatives

I said at the beginning that I was not a fan of mutability by default and therefore it would be remiss of me not to suggest that the entire Message type be made immutable and all properties set via the constructor instead:

result = new Message
(
priority: message.Priority,
type: GetHeader(message, â€œMessageTypeâ€),
body: message.Body,
IntValue: GetIntHeader(message, â€œSomeIntValueâ€),
// ... more of the same ...
);

Yes, adding a new property is a little more work but, as always, writing the tests to make sure it all works correctly will dominate here.

I would also prefer to see use of an Optional<> type instead of a null reference for signalling â€œno messageâ€ but thatâ€™s a different discussion.

Epilogue

While this bug was merely â€œtheoreticalâ€ at the time I discovered it [5] it quickly came back to bite. A bug fix I made on the sending side got deployed before the receiving end and so the misleading error popped up in the logs after all.

Although the system appeared to be functioning correctly it had slowed down noticeably which we quickly discovered was down to the receiving process continually restarting. What I hadnâ€™t twigged just from reading this nugget of code was that due to the catch handler falling through and passing the message on it was being acknowledged on the queue twice â€“â€“ once in that catch handler, and again after processing it. This second acknowledgment attempt generated a fatal error that caused the process to restart. Deploying the fixed receiver code as well sorted the issue out.

Ironically the impetus for my blog post â€œBlack Hole - The Fail Fast Anti-Patternâ€ way back in 2012 was also triggered by two-phase construction problems that caused a process to go into a nasty failure mode, but that time it processed messages much too quickly and stayed alive failing them all.

[1] Generally speaking the setting of multiple properties implies itâ€™s multi-phase construction. The more common term Two-Phase Construction comes (I presume) from explicit constructor methods names like Initialise() or Create() which take multiple arguments, like the constructor, rather than setting properties one-by-one.

[2] This is based on my copy of The C# Programming Language: The Annotated Edition.

[3] When the header was missing it was passing a null byte[] reference into a UTF8 decoder which caused it to throw an ArgumentNullException.

[4] Internally it created a logger on-the-fly so it wasnâ€™t an obvious dependency that initially needed mocking.

[5] Itâ€™s old, so possibly it did bite in the past but nobody knew why or it magically fixed itself when both ends where upgraded close enough together.

March 4, 2019March 4, 2019

A Not So Minor Hardware Revision

Chris Oldwood from The OldWood Thing

[These events took place two decades ago, so consider it food for thought rather than a modern tale of misfortune. Naturally some details are hazy and possibly misremembered but the basic premise is still sound.]

Back in the late â€˜90s I was working on a Travelling Salesman style problem (TSP) for a large oil company which had performance improvements as a key element. Essentially we were taking a new rewrite of their existing scheduling product and trying to solve some huge performance problems with it, such as taking many minutes to load, let alone perform any scheduling computations.

We had made a number of serious improvements, such as reducing the load time from minutes to mere seconds, and, given our successes so far, were tasked with continuing to implement the rest of the features that were needed to make it usable in practice. One feature was to import the set of orders from the various customer sites which were scheduled by the underlying TSP engine.

The Catalyst

The importing of orders required reading some reasonably large text files, parsing them (which was implemented using the classic Lex & YACC toolset) and pushing them into the database where upon the engine would find them and work out a schedule for their delivery.

Initially this importer was packaged as an ActiveX control, written in C and C++, and hosted inside the PowerBuilder (PB) based GUI. Working on the engine side (written entirely in C) we had created a number of native test harnesses (in C++/MFC) to avoid needing to use the PB front-end unless absolutely necessary due to its generally poor performance. Up until this point the importer appeared to work fine on our dev workstations, but when it was passed to the QA a performance problem started showing up.

The entire team (developers and tester) had all been given identical Compaq machines. Give that we needed to run Oracle locally as well as use it for development and testing we had a whopping 256 MB of RAM to play with along with a couple of cores. The workstations were running Windows NT 4.0 and we were using Visual C++ 2 to develop with. As far as we could see they looked and behaved identically too.

The Problem

The initial bug report from the QA was that after importing a fresh set of orders the scheduling engine run took orders of magnitude longer (no pun intended) to find a solution. However, after restarting the product the engine run took the normal amount of time. Hence the conclusion was that the importer ActiveX control, being in-process with the engine, was somehow causing the slowdown. (This was in the days before the low-fragmentation heap in Windows and heap fragmentation was known to be a problem for our kind of application.)

Weirdly though the developer of the importer could not reproduce this issue on their machine, or another developerâ€™s machine that they tried, but it was pretty consistently reproducible on the QAâ€™s machine. As a workaround the logic was hoisted into a separate command-line based tool instead which was then passed along to the QA to see if matters improved, but it didnâ€™t. Restarting the product was the only way to get the engine to perform well after importing new orders and naturally this wasnâ€™t a flyer with the client as this would happen in real-life throughout the day.

In the meantime I had started to read up on Windows heaps and found some info that allowed me to write some code which could help analyse the state of the heaps and see if fragmentation was likely to be an issue anyway, even with the importer running out-of-process now. This didnâ€™t turn up anything useful at the time but the knowledge did come in handy some years later.

Tests on various other machines were now beginning to show that the problem was most likely with the QAâ€™s machine or configuration rather than with the product itself. After checking some basic Windows settings it was posited that it might be a hardware problem, such as a faulty RAM chip. The Compaq machines we had been given werenâ€™t cheap and werenâ€™t using cheap RAM chips either; the POST was doing a memory check too, but it was worth checking out further. Despite swapping over the RAM (and possibly CPUs) with another machine the problem still persisted on the QAâ€™s machine.

Whilst putting the machines back the way they were I somehow noticed that the motherboard revision was slightly different. We double-checked the version numbers and the QAs machine was one minor revision lower. We checked a few other machines we knew worked and lo-and-behold they were all on the newer revision too.

Fortunately, inside the case of one machine was the manual for the motherboard which gave a run down of the different revisions. According to the manual the slightly lower revision motherboard only supported caching of the first 64 MB RAM! Due to the way the applicationâ€™s memory footprint changed during the order import and subsequent cache reloading it was entirely plausible that the new data could reside outside the cached region [1].

This was enough evidence to get the QAâ€™s machine replaced and the problem never surfaced again.

Retrospective

Two decades of experience later and I find the way this issue was handled as rather peculiar by todayâ€™s standards.

Mostly I find the amount of time we devoted to identifying this problem as inappropriate. Granted, this problem was weird and one of the most enjoyable things about software development is dealing with â€œinterestingâ€ puzzles. I for one was no doubt guilty of wanting to solve the mystery at any cost. We should have been able to chalk the issue up to something environmental much sooner and been able to move on. Perhaps if a replacement machine had shown similar issues later it would be cause to investigate further [2].

I, along with most of the other devs, only had a handful of years of experience which probably meant we were young enough not to be bored by such issues, but also were likely too immature to escalate the problem and get a â€œgrown-upâ€ to make a more rational decision. While I suspect we had experienced some hardware failures in our time we hadnâ€™t experienced enough weird ones (i.e. non-terminal) to suspect a hardware issue sooner.

Given the focus on performance and the fact that the project was acquired from a competing consultancy after they appeared to â€œdrop the ballâ€ I guess there were some political aspects that I would have been entirely unaware of. At the time I was solely interested in finding the cause [3] whereas now I might be far more aware of any ongoing â€œcostsâ€ in this kind of investigation and would no doubt have more clout to short-circuit it even if that means we never get to the bottom of it.

As more of the infrastructure we deal with moves into the cloud there is less need, or even ability, to deal with problems in this way. Thatâ€™s great from a business point of view but Iâ€™m left wondering if that takes just a little bit more fun out of the job sometimes.

[1] This suggests to me that the OS was dishing out physical pages from a free-list where address ordering was somehow involved. I have no idea how realistic that is or was at the time.

[2] Itâ€™s entirely possible that Iâ€™ve forgotten some details here and maybe more than one machine was acting weirdly but we focused on the QAâ€™s machine for some reason.

[3] Iâ€™m going to avoid using the term â€œroot causeâ€ because we know from How Complex Systems Fail that we still havenâ€™t gotten to the bottom of it. For example, where does the responsibility for verifying the hardware was identical lie, etc.?

March 1, 2019March 2, 2019

Feature Branches and Package Dependencies

Chris Oldwood from The OldWood Thing

For most of my programming career I have worked directly on the main integration branch (aka trunk / master) for day-to-day development. Release branches have featured occasionally at various clients, mostly to compensate for bureaucracy, and I once had the misfortune to work with project-level branches (see â€œThe Cost of Long-Lived Feature Branchesâ€) which was a merge nightmare. More recently I got to work in a team that used much shorter lived feature branches [1] and I got a reminder of the kinds of problems even they cause.

When the branch is confined to a single repository and that repo is for the delivered product then the only people who are affected by the changes are those working on the branch. (Weâ€™re not talking about the Cost of Delay here for the customer, this is about intra-team delays.) However once we start making changes outside the main repository, such as our library repos (nay packages) things get more complicated.

Changing Packages

Although in theory we could create a similar branch in the library repo the point of integration between the two codebases (product and library) is usually at a binary level, i.e. the package repository. The package manager almost certainly doesnâ€™t deal in branches per-se, only published versions of a package [2].

Where things get even more complicated is when you have a few packages that all make use of some 3rd party library, e.g. a message queuing product, and you discover you need to upgrade that as part of your feature work too. If you go via the normal channels youâ€™ll end up upgrading the dependency in the packages, publish them, and then pull those upgraded packages into your feature branch and carry on where you left off [3].

Dependent Changes

However, anyone working on another feature branch or even the trunk can no longer make an orthogonal change to those packages because pulling them in would likely create an impedance mismatch, unless they also duplicate the integration work done on the feature branch or cherry pick it. Ideally that would be a trivial merge but the very nature of feature branches is to work in isolation and therefore changes tend to get intertwined because the focus is on the feature itself, not the integration steps in-between.

Essentially until the new package versions are fully integrated any changes to them will be delayed, which if youâ€™ve already started work means a blocker goes up on the board. In this instance, being used to trunk based development, I didnâ€™t want to wait so instead reverted the upgrade to the 3rd party library, published it (with my changes), and then integrated it into the main product directly on the trunk.

Unfortunately this creates an extra burden for those on the feature branch as they will need to re-upgrade the package again before integrating their changes back into the trunk. Such is the price for working in isolation.

Small is Beautiful

One of the benefits of working in an Agile way using trunk based development is that it teaches you to focus of really small, but nonetheless valuable changes. A user story may be split across a number of commits and so we have to think about the way weâ€™ll deliver the changes. Feature toggles help us to hide our work in progress but, as just described, occasionally we may need to make a more sweeping change.

In these scenarios we should be able to push the feature work temporarily onto the â€œstackâ€, make the package changes, and then pop the feature work and carry on. With trunk based development the work is woven in just like any other, but when using a feature branch you need to switch back to the trunk, perform the upgrade, then merge trunk into the feature branch before carrying on.

I believe that if you are able to make this work smoothly using a feature branch then you are almost certainly capable of making the changes directly on the trunk in the first place. Planning doesnâ€™t stop at the release and sprint level, we also need to plan how we evolve the codebase at story and task level too to minimise disruption to others whilst also making progress on our own work.

[1] They only lasted a few days and were trying to move away from that where possible.

[2] In theory this is where sematic versioning comes into play as the breaking change demands a new major version. My change would then have been made for both major versions. I say â€œin theoryâ€ because In my experience this is not an approach commonly taken by enterprise development teams for internal libraries â€“ the path of least resistance usually wins.

[3] Alternatively you may be able to hide the libraryâ€™s dependency, assuming itâ€™s backwardly compatible, by binding to it directly such as via a static library or merging assemblies. However, as with [2], itâ€™s not the usual approach.

December 12, 2018December 12, 2018

â€œHello Worldâ€ Stories

Chris Oldwood from The OldWood Thing

Iâ€™ve always tried really hard to fight against â€œtechnical storiesâ€. These are supposedly user stories but which are really framed as a solution to a problem and really just technical tasks. In â€œTurning Technical Tasks Into User Storiesâ€ I looked at how itâ€™s often possible to elevate these from an obvious solution to a problem back up to a problem which needs to be solved. At this point you may discover there are other, hopefully cheaper, solutions to the problem which have been missed in the original analysis either because things have changed or different people are doing the thinking.

On the flip-side there are occasionally times where, after having looked at a few related stories, itâ€™s apparent that they all require the same underlying mechanism to work. One common solution to this is to bulk up the first story with the technical work and let the rest flow through as normal. This way you have no technical work on your backlog per-se as itâ€™s all hidden in the stories.

Transparency

What I donâ€™t like about this approach is that one story arbitrarily gets hit with a load of extra work, which, if youâ€™re using historical data to stick a finger in the air for estimation of similar work later, skews the average somewhat. It also means that from a visibility perspective one story takes longer while the mechanism is being built.

One way Iâ€™ve found to address this has been to pull out the bare bones of the technical work into a â€œHello, World!â€ story [1]. This story is framed around building the skeleton of the mechanism that will be used to drive the implementation of the subsequent features. The aim is keep the scope minimal enough that we avoid speculating while still delivering something which stands on its own two feet and remains clearly visible on the board.

Value Proposition

While the value to the end-user is in the eventual feature, the value in the mechanism is proving to the development team that the basic approach seems sound. With the skeleton built, the idiosyncrasies around each individual feature can then be dealt with appropriately at the right time and accounted for in the usual way.

To be clear this is not about doing a spike or building a prototype, although that may have happened earlier to gain the knowledge needed to undertake this piece of work. No, here weâ€™re talking about building the bare bones of a real mechanism along with the most basic feature possible.

The reason Iâ€™ve called these â€œHello Worldâ€ Stories is probably self-evident, it alludes to the classic program many have chosen as their first â€“ to write â€œHello, World!â€ to the console. In this context the name is intended to conjure up simplicity and remind us that what weâ€™re doing is delivering the minimum required to make the platform viable. We probably wonâ€™t literally write â€œHello, World!â€ to the console, but it may a log message instead that we can then observe and monitor, or be a message on a queue that we can see discarded. Essentially whatever we can do to make its effects observable without wasting any real effort or leaving it partially complete.

Based on the classic INVEST acronym we should strive to make every unit of work: Independent, Negotiable, Valuable, Estimable, Small and Testable. By splitting it out from one of the arbitrary features it becomes more independent, negotiable, estimable and small which can be useful should short-term priorities change. And by extending the scope from a pure mechanism just a little bit further to the most trivial feature possible we make it more testable from a technical perspective, even if not from a product viewpoint. Most importantly, however, is it valuable in its own right? I think sometimes splitting the mechanism out gives value by making the I,N,E,S and T more tangible. In particular breaking work down into smaller deliverable units is often the most valuable practice even if occasionally the end-user has nothing initially to show for it.

Ultimately, I guess, I canâ€™t ever remember anyone complaining they had broken their work down into pieces that were so small they were too visible.

[1] Iâ€™m sure there is an argument about this not being a â€œstoryâ€ per-se but just a â€œtaskâ€. However I prefer to call it a story because our â€œHello, World!â€ realization should have a grounding in the real world, even if it is more abstract than what the end-user will eventually receive.

[2] There is an assumption here that weâ€™ve already decided we cannot or do not want to solve the dependent features in different ways, probably because it would be far more costly (in the long run) than briefly delaying them by building a common pillar.

December 7, 2018December 7, 2018

Overthinking is not Overengineering

Chris Oldwood from The OldWood Thing

As the pendulum swings ever closer towards being leaner and focusing on simplicity I grow more concerned about how this is beginning to affect software architecture. By breaking our work down into ever smaller chunks and then focusing on delivering the next most valuable thing, how much of what is further down the pipeline is being factored into the design decisions we make today?

Wasteful Thinking

Part of the ideas around being leaner is an attempt to reduce waste caused by speculative requirements which has led many a project in the past into a state of â€œanalysis paralysisâ€ where they canâ€™t decide what to build because the goalposts keep moving. By focusing on delivering something simpler much sooner we begin to receive some return on our investment earlier and also shape the future based on practical feedback from today, rather than trying to guess what we need.

When weâ€™re building those simpler features that sit nicely upon our existing foundations we have much less need to worry about the cost of rework from getting it wrong as itâ€™s unlikely to be expensive. But as we move from independent features to those which are based around, say, a new â€œconceptâ€ or â€œpillarâ€ we should spend a little more time looking further down the backlog to see how any design choices we make might play out later.

Thinking to Excess

The term â€œoverthinkingâ€ implies that we are doing more thinking than is actually necessary; trying to fit everyoneâ€™s requirements in and getting bogged down in analysis is definitely an undesirable outcome of spending too much time thinking about a problem. As a consequence we are starting to think less and less up-front about the problems we solve to try and ensure that we only solve the problem we actually have and not the problems we think weâ€™ll have in the future. Solving those problems that we are only speculating about can lead to overengineering if they never manage to materialise or could have been solved more simply when the facts where eventually known.

But how much thinking is â€œoverthinkingâ€? If I have a feature to develop and only spend as much effort thinking as I need to solve that problem then, by definition, any more thinking than that is â€œoverthinking itâ€. But not thinking about the wider picture is exactly what leads to the kinds of architecture & design problems that begin to hamper us later in the productâ€™s lifetime, and later on might not be measured in years but even in days or weeks if we are looking to build a set of related features that all sit on top of a new concept or pillar.

The Horizon

Hence, it feels to me that some amount of overthinking is necessary to ensure that we donâ€™t prematurely pessimise our solution and paint ourselves into a corner. We should factor work further down the backlog into our thoughts to help us see the bigger picture and work out how we can shape our decisions today to ensure it biases our thinking towards our anticipated future rather than an arbitrary one.

Acting on our impulses prematurely can lead to overengineering if we implement whatâ€™s in our thoughts without having a fairly solid backlog to draw on, and overengineering is wasteful. In contrast a small amount of overthinking â€“ thought experiments â€“ are relatively cheap and can go towards helping to maintain the integrity of the systemâ€™s architecture.

One has to be careful quoting old adages like â€œa stich in time saves nineâ€ or â€œan ounce of prevention is worth a pound of cureâ€ because they can send the wrong message and lead us back to where we were before â€“ stuck in The Analysis Phase [1]. That said I want us to avoid â€œthrowing the baby out with the bathwaterâ€ and forget exactly how much thinking is required to achieve sustained delivery in the longer term.

[1] The one phrase I always want to mean this is â€œthink globally, act locallyâ€ because it sounds like it promotes big picture thinking while only implementing what we need today, but thatâ€™s probably stretching it too far.

November 12, 2018November 12, 2018

Feeling Isolated

Chris Oldwood from The OldWood Thing

By and large I think Iâ€™ve been fairly lucky with my time as a contract programmer. Virtually all the teams Iâ€™ve worked in and systems Iâ€™ve worked on have been pretty decent. None of them are going to change the world but theyâ€™ve been enjoyable, which is probably why Iâ€™ve ended up working on them for a decent length of time [1].

I can only say â€œvirtually allâ€ because one contract sadly fell way short of the mark. Although I was technically part of a team it only really felt that way from a managerial perspective, even though we shared a codebase. I felt somewhat isolated both physically and mentally. Aside from the morning stand-up I could easily have gone the rest of the day without speaking to my teammates if I had chosen to do so.

Physical Isolation

I started the contract on a separate floor from the rest of my team with a couple of other recent joiners [2]. We were the only people on that floor with the air conditioning on full blast so we had to wear our coats in the afternoon to stay warm. None of the rest of my team had an office pass that could access the floor either, should they want to talk face-to-face while getting us up to speed.

Even when they moved us onto the same floor a month later we were still on the opposite side of the room. In the next desk shuffle I got to swap colleagues although they were working on an entirely separate area of the system with a totally different bunch of people so we had little need to collaborate per-se, only to make small talk. Also the two desks next to me only seemed to be used for a game of Tower of Hanoi by the office movers given how the occupants came and went.

Even my â€œcustomerâ€, at least, the one I knew about, because they were paying for the project, was situated in a different country and spoke a different language. Although their English was way better than any knowledge I have of a second language I quickly discovered why most communication was via email or IM instead of vocally.

Project Isolation

Being an enterprise scale organisation the work was all about projects, and who was sponsoring how many â€œresourcesâ€. Nowhere was this more apparent than the Scrum Board with its project-oriented swim-lanes. Each swim-lane had the names of the team members assigned to that project, and as the stand-up proceeded it walked down the board a project at a time with each member of the sub-team providing an update.

It was fairly apparent right from the moment I started, just by reading the body language of the team members, that there was often little real interest in what the rest of the team was doing. Those that did, cut across projects to some degree because they tended to nurse the build system, deployments and monitoring. A couple of team members never attended our stand-up because they already attended a different one that encompassed their project.

To be fair some of the apathy at the stand-up was almost certainly down to its excessive length. And with little reason for attending except to provide a status update for the managers itâ€™s no surprise those mostly on the periphery zoned out. Sometimes the only common goal of the team seemed to be to not break the system.

Code Isolation

During my short stint I effectively had one feature to work on. There were a couple of other minor tweaks to begin with but ultimately my project was one feature (nay, user story) and it took 5 months to deliver. That one feature involved making a change in an area of the codebase that nobody else knew except one of the tech leads who I soon discovered was leaving. In fact, taking away his days off after the announcement of his departure, I effectively had 3 days for any handover.

Not only were there no docs to work from there were no tests either. The only real knowledge about how any of the service was expected to behave had left firmly inside the head of the author. This pretty much just left doing a spot of software archaeology with the VCS in the hope that the commit messages might contain some extra clues. Many features had been tracked in a feature tracking tool but there were not enough licenses to go round so I had to hassle a teammate to look things up. Even then it often wasnâ€™t worth it as there were no useful details; it felt like the ticket was just there to â€œtick a boxâ€.

The code relied heavily on the caller â€œdoing the right thingâ€ so any understanding only made sense if you already knew what the caller was supposed to do, and that relied heavily on knowledge of the problem domain and the organisationâ€™s other systems. (At the interview I made it perfectly clear that I still knew little about the problem domain, despite the many years I have worked in it [3].)

Methodology Isolation

Ever since I had my epiphany [4] around testing all those years ago I have become a firm believer in TDD and automated testing as the preferred approach to the sustainable delivery of quality software. Being told early in the project that â€œyou wonâ€™t have time to write testsâ€, despite being asked in the interview about what your approach is, did not bode well.

It soon became apparent that the previous approach had been to rush something out and rely on manual, end-to-end testing and the customer doing things â€œrightâ€. Validation was almost entirely left to the underlying maths library and so bizarre errors manifested and needed investigating by the developers due to a lack of basic error handling and reporting [5].

With no way of knowing if I had broken anything, because I didnâ€™t know for sure what anything was supposed to do, my only recourse was to write new code with tests and then refactor later when someone (potentially me) could be sure that it was safe to do so. For existing code that I had to change or understand I would write a barrage of tests first to try and ensure I didnâ€™t accidentally break anything. In some cases it was hard to know what was â€œby designâ€ and what was â€œby accidentâ€.

Clearly not everyone took this approach, as you can see in â€œIt Compiles, Ship It!â€. My pessimism paid off though once the edge cases and little extras started appearing as I could turn around a fix or improvement (safely) in minutes due to my suite of automated unit and regression tests.

Environment Isolation

Sadly, despite my ability to push through changes quickly into the integration test environment, it still took weeks for them to actually appear in the production environment. When my first task, a handful of lines of boilerplate code, took 6 weeks to make it into production I assumed continuous delivery was not something they cared about.

On the contrary, for one aspect of the business, releases were very frequent. It was just that I was on the other side and due to some (IMHO) poor architecture and deployment decisions my part of the distributed system was tightly-coupled to another (major) systemâ€™s release cycle.

While it might seem great having my own integration test environment to play with, I ran into issues no one else knew about and I had no idea who was really using it and for what. Once again that information pretty much departed with the author.

Parting Thoughts

On reflection I have to look at my own behaviour first and ask myself whether I was at least partly responsible for feeling left out. Once we moved onto the same floor it was definitely easier to wander over and ask people questions, which I did. However when the response is â€œwell I worked all this out by myself originallyâ€ and â€œthatâ€™s more than anyone ever gave meâ€ I think itâ€™s not entirely unfair to assume that knowledge sharing isnâ€™t high on some peopleâ€™s agenda.

I believe I was as welcoming as I normally am and was happy to help out where possible, given the limited knowledge I had acquired. I guess that culturally there was such a large drive for autonomy that the idea of just chatting about stuff to see what improvements in the system or process would be beneficial just wasnâ€™t on the cards. A couple of times what should have been a constructive comment or question definitely came out of me more as a snide remark which is never a good sign. Iâ€™ve been trying hard to be more aware of any sarcasm, which unfortunately comes all too easily to me, and so not add to any unnecessary negativity but I know I failed a few times.

Ultimately I think it says a lot about an organisation that rejects your approach because â€œthey are not a start-upâ€ when your application of that approach has only ever been in large enterprises and none of them has ever had an issue with it before. On the contrary they have often been grateful for the insights and improvements that Iâ€™ve brought.

Maybe if I was a lot younger Iâ€™d not have known any better and stuck it out a bit more but these days I know itâ€™s just not worth the effort. I feel comfortable that I left the place in a better state than I joined it by documenting various things and writing tests for the code I wrote. After a slightly rocky start my customer seemed pretty pleased with everything I delivered, which I guess is largely what matters most.

As ever, my main regret is leaving behind some people that I wish I could have gotten to know better. Maybe I will, in another life, one where the benefits of collaboration are more positively encouraged.

[1] Mostly my tenure has been measured in years, not months.

[2] Only one of which was left when I called it a day â€“ the other two barely lasted a month or so.

[3] See â€œProblem Domain Expert or Technical Expert or Even Bothâ€ for more on this recurring theme.

[4] See â€œMy [Unit] Testing Epiphanyâ€ and my more recent ACCU / Agile on the Beach talk â€œA Test of Strengthâ€ for what lead to my enlightenment.

[5] Poor error messages is a popular topic of mine, see â€œTerse Exception Messagesâ€. Also â€œThe Perils of DateTime.Parse()â€ covers one specific example.

November 8, 2018November 8, 2018

Proxy Weirdness â€“ Socket Closed on 404

Chris Oldwood from The OldWood Thing

While investigating the issue that led to the discovery of the strange default behaviour of the .Net HttpClient class which I wrote up in â€œSurprising Defaults â€“ HttpClient ExpectContinueâ€ we also unearthed some other weirdness in a web proxy that sat between our on-premise adapter and our cloud hosted service.

Web proxies are something Iâ€™ve had cause to complain about before (see â€œThe Curse of NTLM Based HTTP Proxiesâ€) as they seem to interfere in unobvious ways and the people you need to consult with to resolve them are almost always out of reach [1]. In this particular instance nobody we spoke to in the companyâ€™s networks team knew anything about it and trying to identify if itâ€™s your on-premise proxy and not something broken with any of the other intermediaries that sit between you and the endpoint is often hard to establish.

The Symptoms

Whilst trying to track down where the â€œExpect: 100-Continueâ€ header was coming from, as we didnâ€™t initially believe it was from our code, we ran a WireShark trace to see if we could capture the traffic from, and to, our box. What was weird in the short trace that we captured was that the socket looked like it kept closing after every request. Effectively we sent a PUT request, the response would come back, and immediately afterwards the socket would be closed (RST).

Naturally we put this on the yak stack. Sometime later when checking the number of connections to the TIBCO server I used the Sysinternalsâ€™ TCPView tool to see what the service was doing and again I noticed that sockets were being opened and closed repeatedly. As we had 8 threads concurrently processing the message queue it was easy to see 8 sockets open and close again as in TCPView they go green on creation and red briefly on termination.

At least, that appeared to be true for the HTTP requests which went out to the cloud, but not for the HTTP requests that went sideways to the internal authentication service. However they also had an endpoint hosted in the cloud which our cloud service used and we didnâ€™t see that behaviour with them (i.e. cloud-to-cloud), or when we re-configured our on-premise service to use it either (i.e. on-premise-to-cloud). This suggested it was somehow related to our service, but how?

The HttpClient we were using for both sets of requests were the same [2] and so we were pretty sure that it wasnâ€™t our fault, this time, although as the old saying goes â€œonce bitten, twice shyâ€.

Naturally when it comes to working with HTTP one of the main diagnostic tools you reach for is CURL and so we replayed our requests via that to see if we could reproduce it with a different (i.e. non-.Net based) technology.

Phased Switchover

While the service we were writing was new, it was intended to replace an existing one and so part of the rollout plan was to phase it in slowly. This meant that all reads and writes would go to both versions of the service but only the one where any particular customerâ€™s data resided would succeed. The consumers of the service would therefore get a 404 from us if the data hadnâ€™t been migrated, which in the early stages of development applied to virtually every request.

A few experiments later to compare the behaviour for requests of migrated data versus unmigrated data and we had an answer. For some reason a proxy between our on-premise adapter and our web hosted service endpoint was injecting a â€œConnection: Closeâ€ header when a PUT or DELETE [3] request returned a 404. The HttpClient naturally honoured the response and duly closed the underlying socket.

However it did not have this behaviour for a GET or HEAD request that returned a 404 (I canâ€™t remember about POST). Hence the reason we didnâ€™t see this behaviour with the authentication service was because we only sent GETs, and anyway, they returned a 200 with a JSON error body instead of a 404 for invalid tokens [4].

Epilogue

I wish I could say that we tracked down the source of the behaviour and provide some closure but I canâ€™t. The need for the on-premise adapter flip-flopped between being essential and merely a performance test aid, and then back again. The issue remained as a product backlog item so we wouldnâ€™t forget it, but nothing more happened while I was there.

We informed the network team that we were opening and closing sockets like crazy, which these days with TLS is somewhat more expensive and therefore would generate extra load, but had to leave that with them along with an offer of help if they wanted to investigate it further, as much for our own sanity.

Itâ€™s problems like these which cause teams to deviate from established conventions because ultimately one is within their control while the other is outside it and the path of least resistance is nearly always seen as a winner from the business perspective.

[1] Iâ€™m sure theyâ€™re not hidden on purpose but unless you have a P1 incident itâ€™s hard to get their attention as theyâ€™re too busy dealing with actual fires to worry about a bit of smoke elsewhere.

[2] The HttpClient should be treated as a Singleton and not disposed per request, which is a common mistake.

[3] See â€œPUT vs POST and Idempotencyâ€ for more about that particular choice.

[4] The effects of this style of API response on monitoring and how you need to refactor to make the true outcome visible are covered in my recent Overload article â€œMonitoring: Turning Noise into Signalâ€.

October 30, 2018October 30, 2018

Always Reply Within Your SLA â€“ Succeed or Abort

Chris Oldwood from The OldWood Thing

Way back in 2012 I wrote the blog post â€œService Providers Are Interested In Your Timeouts Tooâ€ about how you can help service teams understand your intentions so that they can handle requests more efficiently. That was written at a time when I had been working for many years on internal systems where there were no real SLAs per-se, often just a â€œbest effortsâ€ approach with manual intervention required to â€œunblockâ€ the system when the failures start occurring [1]. In contrast I have always strived to create self-healing systems as much as possible so that only truly remarkable events require any kind of human remediation.

In more recent years Iâ€™ve spent far more time working on web services where there is a much stronger notion of an SLA and therefore a much higher probability that if you fail to meet your SLA then the client will attempt to perform some kind of recovery rather than hang around and wait for the reply [2]. Hence what I wrote about wasting resources on dead requests in that earlier blog post have started to become more significant.

Deadlines

A consequence of this ideology is that Iâ€™ve started to become far more interested in the approach of always responding within the SLA even if that means aborting mid-request. Often an SLA is seen as an aspiration rather than any kind of hard deadline, something which we hope to achieve more often than not, where â€œmore oftenâ€ usually involves quoting some (arbitrary) number of â€œninesâ€. For those requests that fall outside this magical number all bets are off and you might get an answer in a useful timeframe or you might not. This kind of uncertainty has always bothered me as a client consumer.

Hence, Iâ€™ve started moving towards building services that always provide a reply within the SLA whether or not the request has been satisfied. Instead of tying up valuable resources in the hope that when the answer finally arrives the client still has a vested interest in it, Iâ€™d prefer to just abandon the request and let the client know the SLA would be violated if it had continued servicing it. In essence the request times-out server-side, where the time-out is the SLA.

What this means for the client is that they have a definitive reply (network issues notwithstanding) to their request within the time limit allowed. More importantly if they want to allow more time to handle the request than the SLA allows for then they need to tell the service that theyâ€™re willing to wait. Essentially this creates a priority system and allows the service to decide what to do with requests that are happy to hang around for a bit longer.

Mechanics

Implementation-wise what this mostly boils down to is ensuring that every non-trivial piece of work (think: database query, network call, disk read, etc.) must be made with a bounded call time, i.e. one where a timeout can be provided so that the caller always regains control in a timely fashion. Similarly we donâ€™t start any work that we suspect we canâ€™t finish in time either. This generally manifests as aborting on the first timeout which is usually given the entire SLA and therefore youâ€™re never going to recover in time.

Internally the maximum timeout starts with the SLA and as each background query is sent it is timed and the timeout gets progressively shorter [3]. As the load increases and internal queries take longer the chances of a request aborting rises but at least the load on the upstream systems doesnâ€™t keep rising too. Ultimately itâ€™s just a classic negative feedback loop.

Limitations

Unfortunately what makes implementing this somewhat less than idea is that we still donâ€™t really have cancellable requests in many frameworks and youâ€™re never entirely sure what happens when the timeout triggers. If the underlying operation is abandoned, but has to complete anyway because it canâ€™t be cancelled, you may not be much better off. The modern async-enhanced programming world is great for avoiding tying up threads in the happy path but once you start considering the failure modes itâ€™s much harder to reason about and, more importantly, control whatâ€™s going to happen. Despite the fact that under the covers the world of I/O has practically always been asynchronous the higher layers still assume a synchronous model with syntactic sugar only helping to reinforce that perspective.

So far I donâ€™t have nearly enough production-level data points to know if itâ€™s an idea that is truly worth the effort to implement or not. Being able to reject work outright because youâ€™ve already missed the SLA isnâ€™t too onerous but does mean you need to tap into the processing pipeline early before the request is queued in the background to know when the internal clock has started ticking. Whatâ€™s harder to determine is whether you really get any benefit out of the additional complexity needed to track your requestâ€™s progress and if aborting upstream requests creates a more or equally unstable service due to the way the timeouts leave their underlying requests dangling.

I still think itâ€™s an approach worth pursuing but I wouldnâ€™t be surprised to find The Morning Paper covering something from decades ago that shows itâ€™s just a fools errand :o).

[1] See â€œSupport-Friendly Toolingâ€ for some other examples about how this can play out if reliability out-of-the-box is â€œassumedâ€.

[2] In one instance that would mean abandoning the request and potentially taking on some small financial risk on behalf of the customer.

[3] Naturally for parallel / scatter-gather I/O itâ€™s the time of the longest concurrent request.

October 2, 2018October 2, 2018

Technical Debt â€“ Conscious Competence

Chris Oldwood from The OldWood Thing

Once upon a time the term Technical Debt seemed to have a very clear meaning but over the last few years that has been diluted to generally mean any crap code or process which is holding back delivery. Iâ€™m sure any scholars of Wittgenstein will be at pains to point out that â€œmeaning is useâ€ and therefore if everyone uses it this way who am I to argue?

For me the canonical source of information on the technical debt metaphor comes from the wiki of the person who coined the phrase in the first place â€“ Ward Cunningham. The entry on Technical Debt there suggests to me that the choice to enter into debt is a wholly conscious one, not the unconscious acts of a less professional bunch of programmers.

By way of example I thought Iâ€™d take the opportunity to write up one of those occasions where Iâ€™ve been involved with taking debt on (in the original spirit of the term) and how we dealt with it, to show where the distinction lies.

The Bug

Soon after going live with v1.0 of a new calculation system in a large financial organisation we discovered that a number of key counterparties were missing from the daily report. The report generator was a late addition and there were various other issues around itâ€™s development and testing which muddied the waters somewhat but suffice to say that this wasnâ€™t delivered as cleanly as the core system was. (You might consider the more recent meaning of the term to apply here.)

More importantly what transpired was that due to various mergers in the companyâ€™s history a few counterparties had the same â€œuniqueâ€ code in different back-end systems. This wasnâ€™t just news to my team (we were all recent hires) but also to quite a few people in the business too. Due to only dealing with a limited set of â€œbooksâ€ the codes were always unique to them in their context, but our new system cut right across them all.

The Root Cause

The generation of calculations was ultimately based around a Cartesian product of two counterparties, however given that most of those were pointless there was an optimisation which used another source of data to reduce that by more than an order of magnitude.

This optimisation should have been fairly simple but due to a need to initially use some existing manually managed counterparty data to ease the cutover (so regression testing should then reconcile exactly) it was somewhat more complicated than first envisaged.

Our system was designed to use the correct source of data eventually, but do a reverse lookup for the time being. It might sound simple but the lookup actually involved multiple lookups using combinations of keys that had to make assumptions about which legacy back-end system might hold the related data. The right person who could explain how we could do what we needed to do correctly also seemed elusive; there were many people with â€œheuristicsâ€, but nobody who knew for sure.

In total there were ~100 counterparties out of a total of ~15,000 permutations that suffered from this problem. Unfortunately a handful of those 100 had a significant effect on the â€œbottom lineâ€ and therefore the usefulness of the system as a whole was in doubt at that point.

Entering Into Debt

Naturally once we unearthed this clanger we had to decide how to tackle it. After getting our heads around what this all meant and roughly where in the code the missing logic probably needed to go we had to make a decision â€“ do we try and fix the underlying issue right away or try and put a workaround in place (assuming thatâ€™s even possible) to mitigate the problem, at least temporarily.

We were all very aware of going down the dark road of putting a tactical fix in place because weâ€™d all seen where that can lead. We had made a concerted effort over the 12 months required to build the system to refactor relentlessly [1] and squash any bugs as soon as possible. This felt like a backwards step.

On the positive side by adopting a Design for Testability approach in most parts of the code we had extra switches on our processes [2] that allowed us to make per-counterparty requests, usually for diagnostic purposes. Hence the workaround took the form of sticking the list of missing counterparties in a simple text file, then using a command prompt FOR loop [3] to read the file and invoke the tool in â€œsingle counterparty modeâ€. Yes it was a little slow due the constant restarting of the process but it was easy to surgically insert into the workflow with the minimum of testing or risk.

Paying Back the Debt

With the hole plugged for now, and an easy mechanism in place for adding any other missing counterparties â€“ update the text file â€“ we were in a position to sort out the root problem without feeling under pressure to get the system working correctly 100%, ASAP.

As you can probably imagine the real solution wasnâ€™t easy, not least because it was one of a few areas of SQL code that didnâ€™t have any unit tests and was a tangled web of tables and views which had grown organically in an attempt to graft the old and the new worlds together [4].

What Did it Cost?

If we assume that the gung-ho approach would have been to just jump in and start fixing the real code, then what did we lose by not doing that? Itâ€™s possible that the final fix was simple and a little more investigation may have lead to that solution instead.

In contrast, the risk is that we end up in one of those â€œhave we fixed it or notâ€ scenarios where we spend an indeterminate amount of time being â€œreal closeâ€ to getting towards â€œdoneâ€. The old adage about the last 10% also taking 90% of the time springs immediately to mind. Instead we were almost positive we had a simple workaround that could be deployed and get the system running correctly enough in an estimable amount of time. I believe there is a lot of value in having that degree of confidence.

What I think was critical was being able to remove the pressure on finding the right solution as this gave us time to really consider what needed to be done. Any fix done under pressure is not going to be given the attention to detail that it probably deserves. You then run the risk of making the system worse and having an even deeper hole to dig yourself out of.

The customer does not care about strategic versus tactical decisions per-se, they just want the thing to work. We cared about the solution because we knew it would be a burden in the short term as everyone had to remember about the bit â€œgrafted on the sideâ€. The general trust the team had built up by keeping quality at the forefront meant that the business would be more willing to trust us to reconcile the problem appropriately when the time came.

Use Language With Care

I really hope the term Technical Debt doesnâ€™t continue to get watered down even further as itâ€™s a powerful concept which is incredibly useful in the right hands. We already have far too many words for â€œalternate implementationâ€ that are pejoratives carrying an air of unprofessionalness about them. I would like this one to remain in the hands of the professionals so they can continue to have â€œgrown upâ€ conversations with their customers about when itâ€™s appropriate to consider taking shortcuts for a short term business need without them rolling their eyes, yet again.

[1] See â€œRelentless Refactoringâ€ for more thoughts around this (unfortunately) contentious topic.

[2] â€œFrom Test Harness To Support Toolâ€, â€œBuilding Systems as Toolkitsâ€ and â€œIn The Toolbox - Home-Grown Toolsâ€ all look at the non-functional side of tooling.

[3] A batch file just wouldnâ€™t be complete without a for loop, see â€œEvery Solution Starts With â€˜FOR /Fâ€™â€.

[4] This one feature seemed destined to plague us forever, see â€œSo Many Wrongs, But No Rightsâ€ for another tale of woe.

Author: Chris Oldwood

Abstraction with Database Views

The Perils of Multi-Phase Construction

A Not So Minor Hardware Revision

Feature Branches and Package Dependencies

â€œHello Worldâ€ Stories

Overthinking is not Overengineering

Feeling Isolated

Proxy Weirdness â€“ Socket Closed on 404

Always Reply Within Your SLA â€“ Succeed or Abort

Technical Debt â€“ Conscious Competence