Validate in Production

Chris Oldwood from The OldWood Thing

The change was reasonably simple: we had to denormalise some postcode data which was currently held in a centralised relational database into some new fields in every client’s database to remove some cross-database joins that would be unsupported on the new SQL platform we were migrating too [1].

As you might imagine the database schema changes were fairly simple – we just needed to add the new columns as nullable strings into every database. The next step was to update the service code to start populating these new fields as addresses were added or edited by using data from the centralised postcode database [2].

At this point any new data or data that changed going forward would have the correctly denormalised state. However we still needed to fix up any existing data and that’s the focus of this post.

Migration Plan

To fix-up all the existing client data we needed to write a tool which would load each client’s address data that was missing its new postcode data, look it up against the centralised list, and then write back any changes. Given we were still using the cross-database joins in live for the time being to satisfy the existing reports we could roll this out in the background and avoiding putting any unnecessary load on the database cluster.

The tool wasn’t throw-away because the postcode dataset gets updated regularly and so the denormalised client data needs to be refreshed whenever the master list is updated. (This would not be that often but enough to make it worth spending a little extra time writing a reusable tool for the job for ops to run.)

Clearly this isn’t rocket science, it just requires loading the centralised data into a map, fetching the client’s addresses, looking them up, and writing back the relevant fields. The tool only took a few hours to write and test and so it was ready to run for the next release during a quiet period.

When that moment arrived the tool was run across the hundreds of client databases and plenty of data was fixed up in the process, so the task appeared completed.

Next Steps

With all the existing postcode data now correctly populated too we should have been in a position to switch the report generation feature toggle on so that it used the new denormalised data instead of doing a cross-database join to the existing centralised store.

While the team were generally confident in the changes to date I suggested we should just do a sanity check and make sure that everything was working as intended as I felt this was a reasonably simple check to run.

An initial SQL query someone knocked up just checked how many of the new fields had been populated and the numbers seemed about right, i.e. very high (we’d expect some addresses to be missing data due to missing postcodes, typos and stale postcode data). However I still felt that we should be able to get a definitive answer with very little effort by leveraging the existing we SQL we were about to discard, i.e. use the cross-database join one last time to verify the data population more precisely.

Close, but No Cigar

I massaged the existing report query to show where data from the dynamic join was different to that in the new columns that had been added (again, not rocket science). To our surprise there were quite a significant number of discrepancies.

Fortunately it didn’t take long to work out that those addresses which were missing postcode data all had postcodes which were at least partially written in lowercase whereas the ones that had worked were entirely written in uppercase.

Hence the bug was fairly simple to track down. The tool loaded the postcode data into a dictionary (map) keyed on the string postcode and did a straight lookup which is case-sensitive by default. A quick change to use a case-insensitive comparison and the tool was fixed. The data was corrected soon after and the migration verified.

Why didn’t this show up in the initial testing? Well, it turned out the tools used to generate the test data sets and also to anonymize real client databases were somewhat simplistic and this helped to provide a false level of confidence in the new tool.

Testing in Production

Whenever we make a change to our system it’s important that we verify we’ve delivered what we intended. Oftentimes the addition of a feature has some impact on the front-end and the customer and therefore it’s fairly easy to see if it’s working or not. (The customer usually has something to say about it.)

However back-end changes can be harder to verify thoroughly, but it’s still important that we do the best we can to ensure they have the expected effect. In this instance we could easily check every migrated address within a reasonable time frame and know for sure, but on large data sets this might unfeasible so you might have to settle for less. Also the use of feature switches and incremental delivery meant that even though there was a bug it did not affect the customers and we were always making forward progress.

Testing does not end with a successful run of the build pipeline or a sign-off from a QA team – it also needs to work in real life too. Ideally the work we put in up-front will make that more likely but for some classes of change, most notably where actual customer data is involved, we need to follow through and ensure that practice and theory tie up.

 

[1] Storage limitations and other factors precluded simply moving the entire postcode database into each customer DB before moving platforms. The cost was worth it to de-risk the overall migration.

[2] There was no problem with the web service having two connections to two different databases, we just needed to stop writing SQL queries that did cross-database joins.

Feature Branches and Package Dependencies

Chris Oldwood from The OldWood Thing

For most of my programming career I have worked directly on the main integration branch (aka trunk / master) for day-to-day development. Release branches have featured occasionally at various clients, mostly to compensate for bureaucracy, and I once had the misfortune to work with project-level branches (see “The Cost of Long-Lived Feature Branches”) which was a merge nightmare. More recently I got to work in a team that used much shorter lived feature branches [1] and I got a reminder of the kinds of problems even they cause.

When the branch is confined to a single repository and that repo is for the delivered product then the only people who are affected by the changes are those working on the branch. (We’re not talking about the Cost of Delay here for the customer, this is about intra-team delays.) However once we start making changes outside the main repository, such as our library repos (nay packages) things get more complicated.

Changing Packages

Although in theory we could create a similar branch in the library repo the point of integration between the two codebases (product and library) is usually at a binary level, i.e. the package repository. The package manager almost certainly doesn’t deal in branches per-se, only published versions of a package [2].

Where things get even more complicated is when you have a few packages that all make use of some 3rd party library, e.g. a message queuing product, and you discover you need to upgrade that as part of your feature work too. If you go via the normal channels you’ll end up upgrading the dependency in the packages, publish them, and then pull those upgraded packages into your feature branch and carry on where you left off [3].

Dependent Changes

However, anyone working on another feature branch or even the trunk can no longer make an orthogonal change to those packages because pulling them in would likely create an impedance mismatch, unless they also duplicate the integration work done on the feature branch or cherry pick it. Ideally that would be a trivial merge but the very nature of feature branches is to work in isolation and therefore changes tend to get intertwined because the focus is on the feature itself, not the integration steps in-between.

Essentially until the new package versions are fully integrated any changes to them will be delayed, which if you’ve already started work means a blocker goes up on the board. In this instance, being used to trunk based development, I didn’t want to wait so instead reverted the upgrade to the 3rd party library, published it (with my changes), and then integrated it into the main product directly on the trunk.

Unfortunately this creates an extra burden for those on the feature branch as they will need to re-upgrade the package again before integrating their changes back into the trunk. Such is the price for working in isolation.

Small is Beautiful

One of the benefits of working in an Agile way using trunk based development is that it teaches you to focus of really small, but nonetheless valuable changes. A user story may be split across a number of commits and so we have to think about the way we’ll deliver the changes. Feature toggles help us to hide our work in progress but, as just described, occasionally we may need to make a more sweeping change.

In these scenarios we should be able to push the feature work temporarily onto the “stack”, make the package changes, and then pop the feature work and carry on. With trunk based development the work is woven in just like any other, but when using a feature branch you need to switch back to the trunk, perform the upgrade, then merge trunk into the feature branch before carrying on.

I believe that if you are able to make this work smoothly using a feature branch then you are almost certainly capable of making the changes directly on the trunk in the first place. Planning doesn’t stop at the release and sprint level, we also need to plan how we evolve the codebase at story and task level too to minimise disruption to others whilst also making progress on our own work.

 

[1] They only lasted a few days and were trying to move away from that where possible.

[2] In theory this is where sematic versioning comes into play as the breaking change demands a new major version. My change would then have been made for both major versions. I say “in theory” because In my experience this is not an approach commonly taken by enterprise development teams for internal libraries – the path of least resistance usually wins.

[3] Alternatively you may be able to hide the library’s dependency, assuming it’s backwardly compatible, by binding to it directly such as via a static library or merging assemblies. However, as with [2], it’s not the usual approach.

“Hello World” Stories

Chris Oldwood from The OldWood Thing

I’ve always tried really hard to fight against “technical stories”. These are supposedly user stories but which are really framed as a solution to a problem and really just technical tasks. In “Turning Technical Tasks Into User Stories” I looked at how it’s often possible to elevate these from an obvious solution to a problem back up to a problem which needs to be solved. At this point you may discover there are other, hopefully cheaper, solutions to the problem which have been missed in the original analysis either because things have changed or different people are doing the thinking.

On the flip-side there are occasionally times where, after having looked at a few related stories, it’s apparent that they all require the same underlying mechanism to work. One common solution to this is to bulk up the first story with the technical work and let the rest flow through as normal. This way you have no technical work on your backlog per-se as it’s all hidden in the stories.

Transparency

What I don’t like about this approach is that one story arbitrarily gets hit with a load of extra work, which, if you’re using historical data to stick a finger in the air for estimation of similar work later, skews the average somewhat. It also means that from a visibility perspective one story takes longer while the mechanism is being built.

One way I’ve found to address this has been to pull out the bare bones of the technical work into a “Hello, World!” story [1]. This story is framed around building the skeleton of the mechanism that will be used to drive the implementation of the subsequent features. The aim is keep the scope minimal enough that we avoid speculating while still delivering something which stands on its own two feet and remains clearly visible on the board.

Value Proposition

While the value to the end-user is in the eventual feature, the value in the mechanism is proving to the development team that the basic approach seems sound. With the skeleton built, the idiosyncrasies around each individual feature can then be dealt with appropriately at the right time and accounted for in the usual way.

To be clear this is not about doing a spike or building a prototype, although that may have happened earlier to gain the knowledge needed to undertake this piece of work. No, here we’re talking about building the bare bones of a real mechanism along with the most basic feature possible.

The reason I’ve called these “Hello World” Stories is probably self-evident, it alludes to the classic program many have chosen as their first – to write “Hello, World!” to the console. In this context the name is intended to conjure up simplicity and remind us that what we’re doing is delivering the minimum required to make the platform viable. We probably won’t literally write “Hello, World!” to the console, but it may a log message instead that we can then observe and monitor, or be a message on a queue that we can see discarded. Essentially whatever we can do to make its effects observable without wasting any real effort or leaving it partially complete.

Based on the classic INVEST acronym we should strive to make every unit of work: Independent, Negotiable, Valuable, Estimable, Small and Testable. By splitting it out from one of the arbitrary features it becomes more independent, negotiable, estimable and small which can be useful should short-term priorities change. And by extending the scope from a pure mechanism just a little bit further to the most trivial feature possible we make it more testable from a technical perspective, even if not from a product viewpoint. Most importantly, however, is it valuable in its own right? I think sometimes splitting the mechanism out gives value by making the I,N,E,S and T more tangible. In particular breaking work down into smaller deliverable units is often the most valuable practice even if occasionally the end-user has nothing initially to show for it.

Ultimately, I guess, I can’t ever remember anyone complaining they had broken their work down into pieces that were so small they were too visible.

 

[1] I’m sure there is an argument about this not being a “story” per-se but just a “task”. However I prefer to call it a story because our “Hello, World!” realization should have a grounding in the real world, even if it is more abstract than what the end-user will eventually receive.

[2] There is an assumption here that we’ve already decided we cannot or do not want to solve the dependent features in different ways, probably because it would be far more costly (in the long run) than briefly delaying them by building a common pillar.

Overthinking is not Overengineering

Chris Oldwood from The OldWood Thing

As the pendulum swings ever closer towards being leaner and focusing on simplicity I grow more concerned about how this is beginning to affect software architecture. By breaking our work down into ever smaller chunks and then focusing on delivering the next most valuable thing, how much of what is further down the pipeline is being factored into the design decisions we make today?

Wasteful Thinking

Part of the ideas around being leaner is an attempt to reduce waste caused by speculative requirements which has led many a project in the past into a state of “analysis paralysis” where they can’t decide what to build because the goalposts keep moving. By focusing on delivering something simpler much sooner we begin to receive some return on our investment earlier and also shape the future based on practical feedback from today, rather than trying to guess what we need.

When we’re building those simpler features that sit nicely upon our existing foundations we have much less need to worry about the cost of rework from getting it wrong as it’s unlikely to be expensive. But as we move from independent features to those which are based around, say, a new “concept” or “pillar” we should spend a little more time looking further down the backlog to see how any design choices we make might play out later.

Thinking to Excess

The term “overthinking” implies that we are doing more thinking than is actually necessary; trying to fit everyone’s requirements in and getting bogged down in analysis is definitely an undesirable outcome of spending too much time thinking about a problem. As a consequence we are starting to think less and less up-front about the problems we solve to try and ensure that we only solve the problem we actually have and not the problems we think we’ll have in the future. Solving those problems that we are only speculating about can lead to overengineering if they never manage to materialise or could have been solved more simply when the facts where eventually known.

But how much thinking is “overthinking”? If I have a feature to develop and only spend as much effort thinking as I need to solve that problem then, by definition, any more thinking than that is “overthinking it”. But not thinking about the wider picture is exactly what leads to the kinds of architecture & design problems that begin to hamper us later in the product’s lifetime, and later on might not be measured in years but even in days or weeks if we are looking to build a set of related features that all sit on top of a new concept or pillar.

The Horizon

Hence, it feels to me that some amount of overthinking is necessary to ensure that we don’t prematurely pessimise our solution and paint ourselves into a corner. We should factor work further down the backlog into our thoughts to help us see the bigger picture and work out how we can shape our decisions today to ensure it biases our thinking towards our anticipated future rather than an arbitrary one.

Acting on our impulses prematurely can lead to overengineering if we implement what’s in our thoughts without having a fairly solid backlog to draw on, and overengineering is wasteful. In contrast a small amount of overthinking – thought experiments – are relatively cheap and can go towards helping to maintain the integrity of the system’s architecture.

One has to be careful quoting old adages like “a stich in time saves nine” or “an ounce of prevention is worth a pound of cure” because they can send the wrong message and lead us back to where we were before – stuck in The Analysis Phase [1]. That said I want us to avoid “throwing the baby out with the bathwater” and forget exactly how much thinking is required to achieve sustained delivery in the longer term.

 

[1] The one phrase I always want to mean this is think globally, act locally” because it sounds like it promotes big picture thinking while only implementing what we need today, but that’s probably stretching it too far.

Feeling Isolated

Chris Oldwood from The OldWood Thing

By and large I think I’ve been fairly lucky with my time as a contract programmer. Virtually all the teams I’ve worked in and systems I’ve worked on have been pretty decent. None of them are going to change the world but they’ve been enjoyable, which is probably why I’ve ended up working on them for a decent length of time [1].

I can only say “virtually all” because one contract sadly fell way short of the mark. Although I was technically part of a team it only really felt that way from a managerial perspective, even though we shared a codebase. I felt somewhat isolated both physically and mentally. Aside from the morning stand-up I could easily have gone the rest of the day without speaking to my teammates if I had chosen to do so.

Physical Isolation

I started the contract on a separate floor from the rest of my team with a couple of other recent joiners [2]. We were the only people on that floor with the air conditioning on full blast so we had to wear our coats in the afternoon to stay warm. None of the rest of my team had an office pass that could access the floor either, should they want to talk face-to-face while getting us up to speed.

Even when they moved us onto the same floor a month later we were still on the opposite side of the room. In the next desk shuffle I got to swap colleagues although they were working on an entirely separate area of the system with a totally different bunch of people so we had little need to collaborate per-se, only to make small talk. Also the two desks next to me only seemed to be used for a game of Tower of Hanoi by the office movers given how the occupants came and went.

Even my “customer”, at least, the one I knew about, because they were paying for the project, was situated in a different country and spoke a different language. Although their English was way better than any knowledge I have of a second language I quickly discovered why most communication was via email or IM instead of vocally.

Project Isolation

Being an enterprise scale organisation the work was all about projects, and who was sponsoring how many “resources”. Nowhere was this more apparent than the Scrum Board with its project-oriented swim-lanes. Each swim-lane had the names of the team members assigned to that project, and as the stand-up proceeded it walked down the board a project at a time with each member of the sub-team providing an update.

It was fairly apparent right from the moment I started, just by reading the body language of the team members, that there was often little real interest in what the rest of the team was doing. Those that did, cut across projects to some degree because they tended to nurse the build system, deployments and monitoring. A couple of team members never attended our stand-up because they already attended a different one that encompassed their project.

To be fair some of the apathy at the stand-up was almost certainly down to its excessive length. And with little reason for attending except to provide a status update for the managers it’s no surprise those mostly on the periphery zoned out. Sometimes the only common goal of the team seemed to be to not break the system.

Code Isolation

During my short stint I effectively had one feature to work on. There were a couple of other minor tweaks to begin with but ultimately my project was one feature (nay, user story) and it took 5 months to deliver. That one feature involved making a change in an area of the codebase that nobody else knew except one of the tech leads who I soon discovered was leaving. In fact, taking away his days off after the announcement of his departure, I effectively had 3 days for any handover.

Not only were there no docs to work from there were no tests either. The only real knowledge about how any of the service was expected to behave had left firmly inside the head of the author. This pretty much just left doing a spot of software archaeology with the VCS in the hope that the commit messages might contain some extra clues. Many features had been tracked in a feature tracking tool but there were not enough licenses to go round so I had to hassle a teammate to look things up. Even then it often wasn’t worth it as there were no useful details; it felt like the ticket was just there to “tick a box”.

The code relied heavily on the caller “doing the right thing” so any understanding only made sense if you already knew what the caller was supposed to do, and that relied heavily on knowledge of the problem domain and the organisation’s other systems. (At the interview I made it perfectly clear that I still knew little about the problem domain, despite the many years I have worked in it [3].)

Methodology Isolation

Ever since I had my epiphany [4] around testing all those years ago I have become a firm believer in TDD and automated testing as the preferred approach to the sustainable delivery of quality software. Being told early in the project that “you won’t have time to write tests”, despite being asked in the interview about what your approach is, did not bode well.

It soon became apparent that the previous approach had been to rush something out and rely on manual, end-to-end testing and the customer doing things “right”. Validation was almost entirely left to the underlying maths library and so bizarre errors manifested and needed investigating by the developers due to a lack of basic error handling and reporting [5].

With no way of knowing if I had broken anything, because I didn’t know for sure what anything was supposed to do, my only recourse was to write new code with tests and then refactor later when someone (potentially me) could be sure that it was safe to do so. For existing code that I had to change or understand I would write a barrage of tests first to try and ensure I didn’t accidentally break anything. In some cases it was hard to know what was “by design” and what was “by accident”.

Clearly not everyone took this approach, as you can see in “It Compiles, Ship It!”. My pessimism paid off though once the edge cases and little extras started appearing as I could turn around a fix or improvement (safely) in minutes due to my suite of automated unit and regression tests.

Environment Isolation

Sadly, despite my ability to push through changes quickly into the integration test environment, it still took weeks for them to actually appear in the production environment. When my first task, a handful of lines of boilerplate code, took 6 weeks to make it into production I assumed continuous delivery was not something they cared about.

On the contrary, for one aspect of the business, releases were very frequent. It was just that I was on the other side and due to some (IMHO) poor architecture and deployment decisions my part of the distributed system was tightly-coupled to another (major) system’s release cycle.

While it might seem great having my own integration test environment to play with, I ran into issues no one else knew about and I had no idea who was really using it and for what. Once again that information pretty much departed with the author.

Parting Thoughts

On reflection I have to look at my own behaviour first and ask myself whether I was at least partly responsible for feeling left out. Once we moved onto the same floor it was definitely easier to wander over and ask people questions, which I did. However when the response is “well I worked all this out by myself originally” and “that’s more than anyone ever gave me” I think it’s not entirely unfair to assume that knowledge sharing isn’t high on some people’s agenda.

I believe I was as welcoming as I normally am and was happy to help out where possible, given the limited knowledge I had acquired. I guess that culturally there was such a large drive for autonomy that the idea of just chatting about stuff to see what improvements in the system or process would be beneficial just wasn’t on the cards. A couple of times what should have been a constructive comment or question definitely came out of me more as a snide remark which is never a good sign. I’ve been trying hard to be more aware of any sarcasm, which unfortunately comes all too easily to me, and so not add to any unnecessary negativity but I know I failed a few times.

Ultimately I think it says a lot about an organisation that rejects your approach because “they are not a start-up” when your application of that approach has only ever been in large enterprises and none of them has ever had an issue with it before. On the contrary they have often been grateful for the insights and improvements that I’ve brought.

Maybe if I was a lot younger I’d not have known any better and stuck it out a bit more but these days I know it’s just not worth the effort. I feel comfortable that I left the place in a better state than I joined it by documenting various things and writing tests for the code I wrote. After a slightly rocky start my customer seemed pretty pleased with everything I delivered, which I guess is largely what matters most.

As ever, my main regret is leaving behind some people that I wish I could have gotten to know better. Maybe I will, in another life, one where the benefits of collaboration are more positively encouraged.

 

[1] Mostly my tenure has been measured in years, not months.

[2] Only one of which was left when I called it a day – the other two barely lasted a month or so.

[3] See “Problem Domain Expert or Technical Expert or Even Both” for more on this recurring theme.

[4] See “My [Unit] Testing Epiphany” and my more recent ACCU / Agile on the Beach talk “A Test of Strength” for what lead to my enlightenment.

[5] Poor error messages is a popular topic of mine, see “Terse Exception Messages”. Also “The Perils of DateTime.Parse()” covers one specific example.

Technical Debt – Conscious Competence

Chris Oldwood from The OldWood Thing

Once upon a time the term Technical Debt seemed to have a very clear meaning but over the last few years that has been diluted to generally mean any crap code or process which is holding back delivery. I’m sure any scholars of Wittgenstein will be at pains to point out that “meaning is use” and therefore if everyone uses it this way who am I to argue?

For me the canonical source of information on the technical debt metaphor comes from the wiki of the person who coined the phrase in the first place – Ward Cunningham. The entry on Technical Debt there suggests to me that the choice to enter into debt is a wholly conscious one, not the unconscious acts of a less professional bunch of programmers.

By way of example I thought I’d take the opportunity to write up one of those occasions where I’ve been involved with taking debt on (in the original spirit of the term) and how we dealt with it, to show where the distinction lies.

The Bug

Soon after going live with v1.0 of a new calculation system in a large financial organisation we discovered that a number of key counterparties were missing from the daily report. The report generator was a late addition and there were various other issues around it’s development and testing which muddied the waters somewhat but suffice to say that this wasn’t delivered as cleanly as the core system was. (You might consider the more recent meaning of the term to apply here.)

More importantly what transpired was that due to various mergers in the company’s history a few counterparties had the same “unique” code in different back-end systems. This wasn’t just news to my team (we were all recent hires) but also to quite a few people in the business too. Due to only dealing with a limited set of “books” the codes were always unique to them in their context, but our new system cut right across them all.

The Root Cause

The generation of calculations was ultimately based around a Cartesian product of two counterparties, however given that most of those were pointless there was an optimisation which used another source of data to reduce that by more than an order of magnitude.

This optimisation should have been fairly simple but due to a need to initially use some existing manually managed counterparty data to ease the cutover (so regression testing should then reconcile exactly) it was somewhat more complicated than first envisaged.

Our system was designed to use the correct source of data eventually, but do a reverse lookup for the time being. It might sound simple but the lookup actually involved multiple lookups using combinations of keys that had to make assumptions about which legacy back-end system might hold the related data. The right person who could explain how we could do what we needed to do correctly also seemed elusive; there were many people with “heuristics”, but nobody who knew for sure.

In total there were ~100 counterparties out of a total of ~15,000 permutations that suffered from this problem. Unfortunately a handful of those 100 had a significant effect on the “bottom line” and therefore the usefulness of the system as a whole was in doubt at that point.

Entering Into Debt

Naturally once we unearthed this clanger we had to decide how to tackle it. After getting our heads around what this all meant and roughly where in the code the missing logic probably needed to go we had to make a decision – do we try and fix the underlying issue right away or try and put a workaround in place (assuming that’s even possible) to mitigate the problem, at least temporarily.

We were all very aware of going down the dark road of putting a tactical fix in place because we’d all seen where that can lead. We had made a concerted effort over the 12 months required to build the system to refactor relentlessly [1] and squash any bugs as soon as possible. This felt like a backwards step.

On the positive side by adopting a Design for Testability approach in most parts of the code we had extra switches on our processes [2] that allowed us to make per-counterparty requests, usually for diagnostic purposes. Hence the workaround took the form of sticking the list of missing counterparties in a simple text file, then using a command prompt FOR loop [3] to read the file and invoke the tool in “single counterparty mode”. Yes it was a little slow due the constant restarting of the process but it was easy to surgically insert into the workflow with the minimum of testing or risk.

Paying Back the Debt

With the hole plugged for now, and an easy mechanism in place for adding any other missing counterparties – update the text file – we were in a position to sort out the root problem without feeling under pressure to get the system working correctly 100%, ASAP.

As you can probably imagine the real solution wasn’t easy, not least because it was one of a few areas of SQL code that didn’t have any unit tests and was a tangled web of tables and views which had grown organically in an attempt to graft the old and the new worlds together [4].

What Did it Cost?

If we assume that the gung-ho approach would have been to just jump in and start fixing the real code, then what did we lose by not doing that? It’s possible that the final fix was simple and a little more investigation may have lead to that solution instead.

In contrast, the risk is that we end up in one of those “have we fixed it or not” scenarios where we spend an indeterminate amount of time being “real close” to getting towards “done”. The old adage about the last 10% also taking 90% of the time springs immediately to mind. Instead we were almost positive we had a simple workaround that could be deployed and get the system running correctly enough in an estimable amount of time. I believe there is a lot of value in having that degree of confidence.

What I think was critical was being able to remove the pressure on finding the right solution as this gave us time to really consider what needed to be done. Any fix done under pressure is not going to be given the attention to detail that it probably deserves. You then run the risk of making the system worse and having an even deeper hole to dig yourself out of.

The customer does not care about strategic versus tactical decisions per-se, they just want the thing to work. We cared about the solution because we knew it would be a burden in the short term as everyone had to remember about the bit “grafted on the side”. The general trust the team had built up by keeping quality at the forefront meant that the business would be more willing to trust us to reconcile the problem appropriately when the time came.

Use Language With Care

I really hope the term Technical Debt doesn’t continue to get watered down even further as it’s a powerful concept which is incredibly useful in the right hands. We already have far too many words for “alternate implementation” that are pejoratives carrying an air of unprofessionalness about them. I would like this one to remain in the hands of the professionals so they can continue to have “grown up” conversations with their customers about when it’s appropriate to consider taking shortcuts for a short term business need without them rolling their eyes, yet again.

 

[1] See “Relentless Refactoring” for more thoughts around this (unfortunately) contentious topic.

[2] “From Test Harness To Support Tool”, “Building Systems as Toolkits” and “In The Toolbox - Home-Grown Tools” all look at the non-functional side of tooling.

[3] A batch file just wouldn’t be complete without a for loop, see “Every Solution Starts With ‘FOR /F’”.

[4] This one feature seemed destined to plague us forever, see “So Many Wrongs, But No Rights” for another tale of woe.

TODO or TODO Not – Redux

Chris Oldwood from The OldWood Thing

One of my most “successful” posts was one I wrote way back in 2011 about my dislike for TODO style comments in code – “TODO or TODO Not - There Is No Later”. The premise of that post, which includes a number of examples from real codebases I’ve worked on, is that they are fundamentally pointless because they are almost certainly too low in value to get done. If they were valuable enough they either should be a proper feature on the backlog or be left to be handled as part of a relevant story, e.g. refactoring.

At the time I wrote that post I was convinced about them not living in the codebase (past the feature’s release) but I suggested that any potentially useful ones should be converted into bona fide user stories so they could be formally considered and prioritised along with the other items. After receiving a reply to a tweet that referenced my aging blog post from a company that tries to quantify technical debt by analysing such TODO style comments I felt it must be time for an update. (The TL;DR of this post was my reply to them.)

// Learn to Let Go

One of the hardest lessons I have learned over the intervening 7 years is how to let go of stuff. What caused me to cling onto some of those TODOs that I would run across was a fear of forgetting something important. My daily use of a log book (see “Pen & Paper”) to record notes as I go along exemplifies my apparent need to keep track of my current state as my brain is like the proverbial sieve. In an era where change was much harder because the software development QA process was largely manual this makes sense as it was more popular to batch-up changes. Couple this with a general disregard for anything non-functional, such as an appreciation for refactoring, and it’s all too easy to see why people bury their personal backlog in the code rather than open it up for discussion with non-developers.

There is a familiar ring here and that’s because I’ve walked this path before more recently in 2016’s “Developers Can Be Their Own Worst Enemy”. With a modern development process that puts an emphasis on trust and transparency we can let go of the past and should have more confidence in our peers and the management to see the value in our opinions.

It’s also not acceptable to just leave a cryptic comment in the code about why something should be implemented differently or moved elsewhere – the need to change must be backed up with a reason so that we can understand the value in it, and most TODO comments are throwaway rather than well reasoned arguments.

// Measuring Technical Debt

I’ll be honest, I genuinely cannot see the point of attempting to measure the level of technical debt in a codebase, even if it were possible to do so. For a start you need to decide if you’re taking the original interpretation – a conscious decision to temporarily take a shortcut - or the ever more popular one – crap code. Even if you could do that, what are you ever going to do with the output? I once saw a SonarQube report that said we had £X million of technical debt in a codebase I was working on; how exactly does that inform you?

The reason I find it pointless is because it feels like it’s measuring the wrong thing. Just like the story about the man looking for his keys under the streetlight we apply a tool to measure the quality of the code when what really matters is measuring our ability to deliver. The reason poor quality code affects us is because it makes future change hard and slows us down. Hence if it matters there must be a causal link because if the code is that bad our pace of delivery will slow down and we’ll see that (all other things being equal).

But what is more important in my mind is that even if you did know how much debt you had you would only want to focus on the areas that are subject to change, and you do that already by focusing on the most valuable work in the backlog! And the technique for tackling the debt is refactoring. Hence we already have what we need to address the real problem.

// Trimming the Backlog

Unwinding the stack slightly let me return to the topic of reflecting unfinished work in the product’s backlog.

That sentence should already be making your spider-senses tingle – how can it be unfinished? If we’ve not met the acceptance criteria we’re not done, so carry on. If the acceptance criteria was wrong or incomplete then that’s a discussion to have with the product owner for clarification. Note that “of good quality” is an acceptance criteria inherent in every story we do as it is virtually non-negotiable [1].

Perhaps a better term would be undiscovered work. Woody Zuill has this saying “it’s in the doing of the work that we discover the work that needs doing”. What we often unearth are things which never showed up in our initial analysis and therefore we now have to factor them into our schedule or decide to drop them. When we’re knee deep in the feature it may seem really important to do it now, but given time the need may slowly dissolve, and in the end possibly disappear altogether.

What you might initially put on the backlog (or write as a TODO in the code) could be quite specific, e.g. “Need to handle poisoned messages”. As you continue you might also add other scenarios such as “Should refresh token before it expires”  and “Triage malformed messages separately from those well-formed but still invalid”. These are all valuable behaviours which go towards building a reliable service but the initial focus might not be on reliability, maybe its currently just a tactical fix to enable learning about something else. You don’t want to waste time over-engineering but at the same time you also don’t want to forget what else you have learned doing it.

The problem with this mentality is that the backlog just grows and grows, and we’re back to where we were before with TODOs littered all over the code – it’s just an every growing feature list. As time passes the need to implement some of those features will either happen because they have suddenly become important (it’s no longer tactical) or they will vanish (it remains a mere stop-gap). Leaving the stories on the backlog just makes it harder and harder to prioritise as you keep going over old ground again and again.

Once again we need to let go and rely on the fact that if they are important they will eventually resurface. If you feel really uncomfortable about deleting them entirely then you might consider rolling up related stories back into epic sized ones as part of the backlog grooming. Applying this to our recent example you could easily fold them all into a single “Service Reliability” epic that would be easier to handle. You could turn each card’s title into a checklist item.

// Confidence

That said as I get older I become less tolerant of a never-ending backlog and want to be more aggressive in the backlog grooming. Part of that is down to having more confidence in knowing that issues will be addressed [2] in a more timely fashion and that prioritisation will take into account both the technical and functional features with more or less equal consideration because the team is trusted to do The Right Thing.

Being more experienced there is no doubt more than a grain of truth that my sphere of influence is probably wider but that shouldn’t really matter. Every story must be valuable and ideally stand-up on its own merit, where my experience comes in is probably in being able to express that value more succinctly.

// TODO: Redux

I haven’t seen, heard or read anything in the intervening years that has been able to convince me in the value of using TODO comments in the code as a more successful technique of managing what needs to be done. If anything my appetite for tracking any work outside the next few sprints / weeks has begun to wane simply because it is now so easy to change in direction with only a moment’s notice should the situation deteriorate. This does not mean we should be reckless, far from it, but leaving a TODO in the code as a way of conveying a change in architecture hardly seems optimal either.

As for the notion that a TODO in the code can be equated with an increase in technical debt I don’t buy that either. I would posit it is more likely to indicate a failure within the development process itself as the mismatch between behaviour and acceptance criteria either indicates a bug in the code or a bug in the requirements and neither of these outcomes sounds like something that should be brushed off lightly with a comment in the code.

 

[1] I try not to deal in absolutes as there always seems to be an exception, but either way it must be a conscious decision.

[2] Assuming the organisation is behaving in a moderately agile way and not just paying lip service to a couple of the ceremonies or practices.

Delivery Anti-Pattern: Local Optimisations

Chris Oldwood from The OldWood Thing

The daily stand-up mostly went along as usual. I wasn’t entirely sure why there was a stand-up as it wasn’t so much a team as a bunch of people working on the same codebase but with more-or-less individual goals. Applying the microservices analogy to the team you could say we were a bunch of micro-teams – each person largely acting with autonomy rather than striving to work together to achieve a common goal [1].

Time

But I digress, only slightly. What happened at the stand-up was a brief discussion around the delivery of a feature with the suggestion that it should be hidden behind a feature toggle. The implementer explained that they weren’t going to add a feature toggle because “it was a waste of their time”.

This surprised me. Knowing what I do now about how the team operates it isn’t that surprising but at the time I was used to working in teams where every member works towards common goals. One of those common goals is to try and ensure the delivery of features is a continuous flow and is not disrupted by a bad change which then has to be backed out because rolling back has the potential to create all sorts of disruption, not least the delay of those unaffected changes.

You should note that I’m not disagreeing with their choice of whether or not to use a feature toggle – I did not know anywhere near enough about the change or the system at that time to contribute to that decision. No, what disturbed me was the reason why they chose not to take that approach – that their time is more valuable than that of anyone else in the team, or the business for that matter.

In isolation that paints an unpleasant picture of the individual in question and that simply is not the case. However their choice of words, even if done without real consideration, does appear to reinforce the culture that surrounds them. In essence, with a feeling that the focus is on them and their performance, they are naturally going to behave in a way that favours optimising delivering their own features over that of the team at large.

Quality

Another example of favouring a local optimisation over the longer term goal of sustained delivery occurred when I was assigned my first piece of work. This was not so much a story as a couple of epics funded as an entire project (over 4 months solid work in the end). My instinct, after being shown roughly where in the code I needed to dig, was to ask where the existing tests were so that I knew where to add mine. The tech lead’s immediate response was “you won’t have time to write tests”.

My usual response to this statement is a jovially phrased “how will I know if it works then?” which often has the effect of opening a line of dialogue around the testing strategy and where it’s heading. Unfortunately this time around it only succeeded in the tech lead launching into a diatribe about how important delivery was, how much the business trusted us to deliver on time, blah, blah, blah, in fact almost everything that a good test suite enables!

Of course I still went ahead and implemented the entire project TDD-style and easily delivered it on time because I knew the approach was sound and the investment was more than worth it. The subsequent enhancements and repaying of some technical debt also became trivial at that point and meant that anyone, not just me, could safely and quickly make changes to that area of code. It also showed how easy it was to add new tests to cover changes to the older parts of the component when required later.

In the end over 10% of unit tests of the entire system had been written by me during that project for a codebase of probably over 1/2 million lines of code. I also added a command line test harness and a regression testing “framework” [2] in that time too all in an effort to reduce the amount of hoops you needed to go through to diagnose and safely fix any edge cases that showed up later. None of this was rocket science or in the least bit onerous.

Knowledge

I would consider a lack of supporting documentation one further local optimisation too. When only a select few have the knowledge to help support a system you have to continually rely on their help to nurse it through the bad times. This is especially true when the system has enough quirks that the cost of taking the wrong action is quite high (in terms of additional noise). If you need to remember a complex set of conditions and actions you’re going to get it wrong eventually without some form of checklist to work from. Relying on tribal knowledge is a great form of optimisation until core members of the team leave and you unearth the gaping holes in the team’s knowledge.

Better yet, design away the problems entirely, but that’s a different can of worms…

Project Before Product

I believe this was another example of how “projects” are detrimental to the development of a complex system. With the team funded by various projects and those projects being used as a very clear division on the task board through swim lanes [3] it killed the desire to swarm on anything but a production incident because you felt beholden to your specific stakeholders.

For example there were a number of conversations about fixing issues with the system that were slowing down delivery through unreliability that ended with “but who’s going to pay for that?” Although improvements were made they had to be so small as to not really affect the delivery of the project work. Hence the only real choice was to find easier ways to treat the symptoms rather than cure the disease.

Victims of Circumstance

Whenever I bump into this kind of culture my gut instinct is not to assume they are “incompetent” people, on the contrary, they’re clearly intelligent so I’ll assume they are shaped by their environment. Of course we all have our differences, that’s what makes diversity so useful, but we have to remember to stop once in a while and reflect on what we’re doing and question whether it’s still the right approach to take. What works for building Fizz Buzz does not work for a real-time, distributed calculation engine. And even if that approach did work once upon a time the world keeps moving on and so now we might be able to do even better.

 

[1] Pairing was only something you did when you’d already been stuck for some time, and when the mistake was found you went your separate ways again.

[2] I say “framework” because it was really just leveraging a classic technique: a command line tool reading CSV format data which fired requests into a server, the results of which are then diff’d against a known set of results (Golden Master Testing).

[3] The stand-up was originally run in project order, lead by the PM. Unsurprisingly those not involved in the other projects were rarely engaged in the meeting unless it was their turn to speak.

Treat All Test Environments Like Production

Chris Oldwood from The OldWood Thing

One of the policies I pushed for from the start when working on a greenfield system many years ago was the notion that we were going to treat all test environments (e.g. dev and UAT) like the production environment.

As you can probably imagine this was initially greeted with a heavy dose of scepticism. However all the complaints I could see against the idea were dysfunctional behaviours of the delivery process. All the little workarounds and hacks that were used to back-up their reasons for granting unfettered access to the environments seemed to be the result of poorly thought out design, inadequate localised testing or organisational problems. (See “Testing Drives the Need for Flexible Configuration” for how we addressed one of those concerns.)

To be clear, I am not suggesting that you should completely disable all access to the environment; on the contrary I believe that this is required even in production for those rare occasions when you just cannot piece together the problem from your monitoring and source code alone. No, what I was suggesting was that we employ the same speed bumps and privileges in our test environments that we would in production. And that went for the database too.

The underlying principle I was trying to enshrine here was that shared testing environments, by their very nature, should be treated with the utmost care to ensure a smooth delivery of change. In the past I have worked on systems where dev and test environments were a free-for-all. The result is that you waste so much time investigating issues that are orthogonal to your actual problem because someone messed with it for their own use and just left it in a broken state. (This is another example of the “Broken Windows” syndrome.)

A secondary point I was trying to make was that your test environments are also, by definition, your practice runs at getting things right. Many organisations have a lot of rigour around how they deploy to production but very little when it comes to the opportunities leading up to it. In essence your dev and test environments give you two chances to get things right before the final performance – if you’re not doing dress rehearsals beforehand how can you expect it to go right on the day? When production deployments go wrong we get fearful of them and then risk aversion kicks in meaning we do them less often and a downward spiral kicks in.

The outcome of this seemingly “draconian” approach to managing the development and test environments was that we also got to practice supporting the system in two other environments, and in a way that prepared us for what we needed to do when the fire was no longer just a drill. In particular we quickly learned what diagnostic tools we should already have on the box and, most importantly, what privileges we needed to perform certain actions. It also affected what custom tools we built and what extra features we added to the services and processes to allow safe use for analysis during support (e.g. a --ReadOnly switch).

The Principle of Least Privilege suggests that for our incident analysis we should only require read access to any resource, such as files, the database, OS logs, etc. If you know that you are protected from making accidental mistakes you can be more aggressive in your approach as you feel confident that the outcome of any mistake will not result in breaking the system any further [1][2]. Only at the point at which you need to make a change to the system configuration or data should the speed bumps kick in and you elevate yourself temporarily, make the change and immediately drop back to mere mortal status again.

The database was an area in particular where we had all been bitten before by support issues made worse through the execution of ad-hoc SQL passed around by email or pasted in off the wiki. Instead we added a new schema (i.e. namespace) specifically for admin and support stored procedures that were developed properly, i.e. they were written test-first. (See “You Write Your SQL Unit Tests in SQL” for more on how and why we did it this way.) This meant applying certain kinds of workarounds were easier to administer because they were essentially part of the production codebase, not just some afterthought that nobody maintained.

On the design front this also started to have an interesting effect as we found ourselves wanting to leverage our production service code in new ways to ensure that we avoided violating invariants by hosting the underlying service components inside new containers, i.e. command line tools or making them scriptable. (See “Building Systems as Toolkits”.)

The Interface Segregation Principle is your friend here as it pushes you towards having separate interfaces for reading and writing making it clearer which components you can direct towards a production service if you’re trying to reproduce an issue locally. For example our calculation engine support tool allowed you to point any “readers” towards real service endpoints whilst redirecting the the writers to /dev/null (i.e. using the Null Object pattern) or to some simple in-memory implementation (think Dictionary) to pass data from one internal task to the next.

I find it somewhat annoying that we went to a lot of effort to give ourselves the best chance of designing and building a supportable system that also provided traceability only for the infrastructure team to disallow our request for personal per-environment support accounts, saying instead that we needed to share a single one! Even getting them to give us a separate account for dev, UAT and production was hard work. It sometimes feel like the people who complain most about a lack of transparency and rigour are the same ones that deny you access to exactly that.

I know there were times when it felt as though we could drop our guard in dev or UAT “just this once” but I don’t remember us ever doing that. Instead we always used it as an opportunity to learn more about what the real need was and how it could become a bona fide feature rather than just a hack.

 

[1] That’s not entirely true. A BA once concocted a SQL query during support that ended up “bug checking” SQL Server and brought the entire system to a grinding halt. They then did it again by accident after it was restarted :o).

[2] A second example was where someone left the Sysinternals DebugView tool running overnight on a server whereupon it filled up the log window and locked up a service due to the way OutputDebugString works under the covers.

Good Stories Assure the Architecture

Chris Oldwood from The OldWood Thing

One of the problems a team can run into when they adopt a more agile way of working is they struggle to frame their backlog in the terms of user focused stories. This is a problem I’ve written about before in “Turning Technical Tasks Into User Stories” which looked at the problem for smaller units of work. Even if the team can buy into that premise for the more run-of-the-mill features it can still be a struggle to see how that works for the big ticket items like the system’s architecture.

The Awkward Silence

What I’ve experienced is that the team can start to regress when faced with discussions around what kind of architecture to aim for. With a backlog chock full of customer pleasing functionality the architectural conversations might begin to take a bit of a back seat as the focus is on fleshing out the walking skeleton with features. Naturally the nervousness starts to set in as the engineers begin to wonder when the architecture is going to get the attention it rightly deserves. It’s all very well supporting a handful of “friendly” users but what about when you have real customers who’ve entrusted you with their data and they want to make use of it without a moments notice at any hour of the day?

The temptation, which should be resisted, can be to see architectural work as outside the scope of the core backlog – creating a separate backlog for stuff “the business does not understand”. This way can lead to a split in the backlog, and potentially even two separate backlogs – a functional and a non-functional one. This just makes prioritisation impossible. Also burying the work kills transparency, eventually erodes trust, and still doesn’t get you the answers you really need.

Instead, the urge should be to frame the architectural concerns in terms the stakeholder does understand, so that the business can be more informed about their actual benefits. In addition, when “The Architecture” is a journey and not a single destination there is no longer one set of benefits to aim for there are multiple trade-offs as the architecture evolves over time, changing at each step to satisfy the ongoing needs of the customer(s) along the way. There is in essence no “final solution” there is only “what we need for the foreseeable future”.

Tell Me a Story

So, what do I mean by “good stories”? Well, the traditional way this goes is for an analyst to solicit some non-functional requirements for some speculative eventual system behaviour. If we’re really lucky it might end up in the right ballpark at one particular point in the future. What’s missing from this scene is a proper conversation, a proper story – one with a beginning, a middle, and an end – where we are today, the short term and the longer term vision.

But not only do we need to get a feel for their aspirations we also need quantifiable metrics about how the system needs to perform. Vague statements like “fast enough” are just not helpful. A globally accessible system with an anticipated latency in the tens of milliseconds will need to break the law of physics unless we trade-off something else. We also need to know how those exceptional events like Cyber Monday are to be factored into the operation side.

It’s not just about performance either. In many cases end users care that their data is secure, both in-flight (over the network) and at rest, although they likely have no idea what this actually means in practice. Patching servers is a technical task, but the bigger story is about how the team responds to a vulnerability which may make patching irrelevant. Similarly database backups are not the issue it’s about service availability – you cannot be highly available if the loss of an entire data centre potentially means waiting for a database to be restored from scratch elsewhere.

Most of the traditional conversations around non-functional requirements focus entirely on the happy path, for me the conversation doesn’t really get going until you start talking about what needs to happen when the system is down. It’s never a case of “if”, but “when” it fails and therefore mitigating these problems features heavily in our architectural choices. It’s an uncomfortable conversation as we never like discussing failure but that’s what having “grown up” conversations mean.

Incremental Architecture

Although I’ve used the term “story” in this post’s title, many of the issues that need discussing are really in the realm of “epics”. However we shouldn’t get bogged down in the terminology, instead the essence is to remember to focus on the outcome from the user’s perspective. Ask yourselves how fast, how secure, how available, etc. it needs to be now, and how those needs might change in response to the system’s, and the business’s growth.

With a clearer picture of the potential risks and opportunities we are better placed to design and build in small increments such that the architecture can be allowed to emerge at a sustainable rate.