Cargo Culting GitFlow

Chris Oldwood from The OldWood Thing

A few years back I got to spend a couple of weeks consulting at a small company involved in the production of smart cards. My team had been brought in by the company’s management to cast our critical eye over their software development process and provide a report on what we found along with any recommendations on how it could be improved.

The company only had a few developers and while the hardware side of the business seemed to be running pretty smoothly the software side was seriously lacking. To give you some indication of how bad things used to be, they weren’t even using version control for their source code. Effectively when a new customer came on board they would find the most recent and relevant existing customer’s version (stored in a .zip file), copy their version of the system, and then start hacking out a new one just for the new customer.

As you can imagine in a set-up like this, if a bug is found it would need to be fixed in every version and therefore it only gets fixed if a customer noticed and reported it. This led to more divergence. Also as the software usually went in a kiosk the hardware and OS out in the wild was often ancient (Windows 2000 in some cases) [1].

When I say “how bad things used to be” this was some months before we started our investigation. The company had already brought in a previous consultant to do an “Agile Transformation” and they had recognised these issues and made a number of very sensible recommendations, like introducing version control, automated builds, unit testing, more collaboration with the business, etc.

However, we didn’t think they looked too hard at the way the team were actually working and only addressed the low hanging fruit by using whatever they found in their copy of The Agile Transformation Playbook™, e.g. Scrum. Naturally we weren’t there at the time but through the course of our conversations with the team it became apparent that a cookie-cutter approach had been prescribed despite it being (in our opinion) far too heavyweight for the handful of people in the team.

As the title of this post suggests, and the one choice I found particularly amusing, was the introduction of VSTS (Visual Studio Team Services; rebranded Azure DevOps) and a GitFlow style workflow for the development team. While I applaud the introduction of version control and isolated, repeatable builds to the company, this feels like another heavyweight choice. The fact that they were already using Visual Studio and writing their web service in C# probably means it’s not that surprising if you wanted to pick a Big Iron product.

The real kicker though was the choice of a GitFlow style workflow for the new product team where there were only two developers – one for the front-end and another for the back-end. They were using feature branches and pull requests despite the fact that they were the only people working in their codebase. While the company might have hired another developer at some point in the future they had no immediate plans to to grow the team to any significant size [2] so there would never be any merge conflicts to resolve in the short to medium term! Their project was a greenfield one to create a configurable product instead of the many one-offs to date, so they had no regressions to worry about at this point either – it was all about learning and building a prototype.

It’s entirely possible the previous consultant was working on different information to us but there was nothing in our conversations with the team or management that suggested they previously had different goals to what they were asking from us now. Sadly this is all too common an occurrence – a company hires an agile coach or consultant who may know how to handle the transformation from the business end [3] but they don’t really know the technical side. Adopting an agile mindset requires the XP technical practices too to be successful and so, unless the transformation team really knows its development onions, the practices are going to be rolled out and applied with a cargo cult mentality instead of being taught in a way that the team understands which practices are most pertinent to their situation and why.

In contrast, the plan we put forward was to strip out much of the fat and focus on making it easy to develop something which could be easily demo’d to the stakeholders for rapid feedback. We also proposed putting someone who was “more developer than scrum master” into the team for a short period so they could really grok the XP practices and see why they matter. (This was something I personally pushed quite hard for because I’ve seen how this has played out before when you’re not hands-on, see “The Importance of Leading by Example”.)

 

[1] Luckily these kiosks weren’t connected to a network; upgrades were a site visit with a USB stick.

[2] Sadly there were cultural reasons for this – a topic for another day.

[3] This is debatable but I’m trying to be generous here as my expertise is mostly on the technical side of the fence.

Git is Not the Problem

Chris Oldwood from The OldWood Thing

Git comes in for a lot of stick for being a complicated tool that’s hard to learn, and they’re right, git is a complicated tool. But it’s a tool designed to solve a difficult problem – many disparate people collaborating on a single product in a totally decentralized fashion. However, many of us don’t need to work that way, so why are we using the tool in a way that makes our lives more difficult?

KISS

For my entire professional programming career, which now spans over 25 years, and my personal endeavours, I have used a version control tool (VCS) to manage the source code. In that time, for the most part, I have worked in a trunk-based development fashion [1]. That means all development goes on in one integration branch and the general philosophy for every commit is “always be ready to ship” [2]. As you might guess features toggles (in many different guises) play a significant part in achieving that.

A consequence of this simplistic way of working is that my development cycle, and therefore my use of git, boils down to these few steps [3]:

  • clone
  • edit / build / test
  • diff
  • add / commit
  • pull
  • push

There may occasionally be a short inner loop where a merge conflict shows up during the pull (integration) phase which causes me to go through the edit / diff / commit cycle again, but by-and-large conflicts are rare due to close collaboration and very short change cycles. Ultimately though, from the gazillions of commands that git supports, I mostly use just those 6. As you can probably guess, despite using git for nearly 7 years, I actually know very little about it (command wise). [4]

Isolation

Where I see people getting into trouble and subsequently venting their anger is when branches are involved. This is not a problem which is specific to git though, you see this crop up with any VCS that supports branches whether it be ClearCase, Perforce, Subversion, etc. Hence, the tool is not the problem, the workflow is. And that commonly stems from a delivery process mandated by the organization, meaning that ultimately the issue is one of an organizational nature, not the tooling per-se.

An organisation which seeks to reduce risk by isolating work (and by extension its people) onto branches is increasing the delay in feedback thereby paradoxically increasing the risk of integration, or so-called “merge debt”. A natural side-effect of making it harder to push through changes is that people will start batching up work in an attempt to boost "efficiency”. The trick is to go in the opposite direction and break things down into smaller units of work that are easier to produce and quicker to improve. Balancing production code changes with a solid investment in test coverage and automation reduces that risk further along with collaboration boosting techniques like pair and mob programming.

Less is More

Instead of enforcing a complicated workflow and employing complex tools in the hope that we can remain in control of our process we should instead seek to keep the workflow simple so that our tools remain easy to use. Git was written to solve a problem most teams don’t have as they neither have the volume of distributed people or complexity of product to deal with. Organisations that do have complex codebases cannot expect to dig themselves out of their hole simply by introducing a more powerful version control tool, it will only increase the cost of delay while bringing a false sense of security as programmers work in the dark for longer.

 

[1] My “Branching Strategies” article in ACCU’s Overload covers this topic if you’re looking for a summary.

[2] This does not preclude the use of private branches for spikes and/or release branches for hotfix engineering when absolutely needed. #NoAbsolutes.

[3] See “In The Toolbox - Commit Checklist” for some deeper discussion about what goes through my head during the diff / commit phase.

[4] I pondered including “log” in the list for when doing a spot of software archaeology but that is becoming much rarer these days. I also only use “fetch” when I have to work with feature branches.

Validate in Production

Chris Oldwood from The OldWood Thing

The change was reasonably simple: we had to denormalise some postcode data which was currently held in a centralised relational database into some new fields in every client’s database to remove some cross-database joins that would be unsupported on the new SQL platform we were migrating too [1].

As you might imagine the database schema changes were fairly simple – we just needed to add the new columns as nullable strings into every database. The next step was to update the service code to start populating these new fields as addresses were added or edited by using data from the centralised postcode database [2].

At this point any new data or data that changed going forward would have the correctly denormalised state. However we still needed to fix up any existing data and that’s the focus of this post.

Migration Plan

To fix-up all the existing client data we needed to write a tool which would load each client’s address data that was missing its new postcode data, look it up against the centralised list, and then write back any changes. Given we were still using the cross-database joins in live for the time being to satisfy the existing reports we could roll this out in the background and avoiding putting any unnecessary load on the database cluster.

The tool wasn’t throw-away because the postcode dataset gets updated regularly and so the denormalised client data needs to be refreshed whenever the master list is updated. (This would not be that often but enough to make it worth spending a little extra time writing a reusable tool for the job for ops to run.)

Clearly this isn’t rocket science, it just requires loading the centralised data into a map, fetching the client’s addresses, looking them up, and writing back the relevant fields. The tool only took a few hours to write and test and so it was ready to run for the next release during a quiet period.

When that moment arrived the tool was run across the hundreds of client databases and plenty of data was fixed up in the process, so the task appeared completed.

Next Steps

With all the existing postcode data now correctly populated too we should have been in a position to switch the report generation feature toggle on so that it used the new denormalised data instead of doing a cross-database join to the existing centralised store.

While the team were generally confident in the changes to date I suggested we should just do a sanity check and make sure that everything was working as intended as I felt this was a reasonably simple check to run.

An initial SQL query someone knocked up just checked how many of the new fields had been populated and the numbers seemed about right, i.e. very high (we’d expect some addresses to be missing data due to missing postcodes, typos and stale postcode data). However I still felt that we should be able to get a definitive answer with very little effort by leveraging the existing we SQL we were about to discard, i.e. use the cross-database join one last time to verify the data population more precisely.

Close, but No Cigar

I massaged the existing report query to show where data from the dynamic join was different to that in the new columns that had been added (again, not rocket science). To our surprise there were quite a significant number of discrepancies.

Fortunately it didn’t take long to work out that those addresses which were missing postcode data all had postcodes which were at least partially written in lowercase whereas the ones that had worked were entirely written in uppercase.

Hence the bug was fairly simple to track down. The tool loaded the postcode data into a dictionary (map) keyed on the string postcode and did a straight lookup which is case-sensitive by default. A quick change to use a case-insensitive comparison and the tool was fixed. The data was corrected soon after and the migration verified.

Why didn’t this show up in the initial testing? Well, it turned out the tools used to generate the test data sets and also to anonymize real client databases were somewhat simplistic and this helped to provide a false level of confidence in the new tool.

Testing in Production

Whenever we make a change to our system it’s important that we verify we’ve delivered what we intended. Oftentimes the addition of a feature has some impact on the front-end and the customer and therefore it’s fairly easy to see if it’s working or not. (The customer usually has something to say about it.)

However back-end changes can be harder to verify thoroughly, but it’s still important that we do the best we can to ensure they have the expected effect. In this instance we could easily check every migrated address within a reasonable time frame and know for sure, but on large data sets this might unfeasible so you might have to settle for less. Also the use of feature switches and incremental delivery meant that even though there was a bug it did not affect the customers and we were always making forward progress.

Testing does not end with a successful run of the build pipeline or a sign-off from a QA team – it also needs to work in real life too. Ideally the work we put in up-front will make that more likely but for some classes of change, most notably where actual customer data is involved, we need to follow through and ensure that practice and theory tie up.

 

[1] Storage limitations and other factors precluded simply moving the entire postcode database into each customer DB before moving platforms. The cost was worth it to de-risk the overall migration.

[2] There was no problem with the web service having two connections to two different databases, we just needed to stop writing SQL queries that did cross-database joins.

Feature Branches and Package Dependencies

Chris Oldwood from The OldWood Thing

For most of my programming career I have worked directly on the main integration branch (aka trunk / master) for day-to-day development. Release branches have featured occasionally at various clients, mostly to compensate for bureaucracy, and I once had the misfortune to work with project-level branches (see “The Cost of Long-Lived Feature Branches”) which was a merge nightmare. More recently I got to work in a team that used much shorter lived feature branches [1] and I got a reminder of the kinds of problems even they cause.

When the branch is confined to a single repository and that repo is for the delivered product then the only people who are affected by the changes are those working on the branch. (We’re not talking about the Cost of Delay here for the customer, this is about intra-team delays.) However once we start making changes outside the main repository, such as our library repos (nay packages) things get more complicated.

Changing Packages

Although in theory we could create a similar branch in the library repo the point of integration between the two codebases (product and library) is usually at a binary level, i.e. the package repository. The package manager almost certainly doesn’t deal in branches per-se, only published versions of a package [2].

Where things get even more complicated is when you have a few packages that all make use of some 3rd party library, e.g. a message queuing product, and you discover you need to upgrade that as part of your feature work too. If you go via the normal channels you’ll end up upgrading the dependency in the packages, publish them, and then pull those upgraded packages into your feature branch and carry on where you left off [3].

Dependent Changes

However, anyone working on another feature branch or even the trunk can no longer make an orthogonal change to those packages because pulling them in would likely create an impedance mismatch, unless they also duplicate the integration work done on the feature branch or cherry pick it. Ideally that would be a trivial merge but the very nature of feature branches is to work in isolation and therefore changes tend to get intertwined because the focus is on the feature itself, not the integration steps in-between.

Essentially until the new package versions are fully integrated any changes to them will be delayed, which if you’ve already started work means a blocker goes up on the board. In this instance, being used to trunk based development, I didn’t want to wait so instead reverted the upgrade to the 3rd party library, published it (with my changes), and then integrated it into the main product directly on the trunk.

Unfortunately this creates an extra burden for those on the feature branch as they will need to re-upgrade the package again before integrating their changes back into the trunk. Such is the price for working in isolation.

Small is Beautiful

One of the benefits of working in an Agile way using trunk based development is that it teaches you to focus of really small, but nonetheless valuable changes. A user story may be split across a number of commits and so we have to think about the way we’ll deliver the changes. Feature toggles help us to hide our work in progress but, as just described, occasionally we may need to make a more sweeping change.

In these scenarios we should be able to push the feature work temporarily onto the “stack”, make the package changes, and then pop the feature work and carry on. With trunk based development the work is woven in just like any other, but when using a feature branch you need to switch back to the trunk, perform the upgrade, then merge trunk into the feature branch before carrying on.

I believe that if you are able to make this work smoothly using a feature branch then you are almost certainly capable of making the changes directly on the trunk in the first place. Planning doesn’t stop at the release and sprint level, we also need to plan how we evolve the codebase at story and task level too to minimise disruption to others whilst also making progress on our own work.

 

[1] They only lasted a few days and were trying to move away from that where possible.

[2] In theory this is where sematic versioning comes into play as the breaking change demands a new major version. My change would then have been made for both major versions. I say “in theory” because In my experience this is not an approach commonly taken by enterprise development teams for internal libraries – the path of least resistance usually wins.

[3] Alternatively you may be able to hide the library’s dependency, assuming it’s backwardly compatible, by binding to it directly such as via a static library or merging assemblies. However, as with [2], it’s not the usual approach.

“Hello World” Stories

Chris Oldwood from The OldWood Thing

I’ve always tried really hard to fight against “technical stories”. These are supposedly user stories but which are really framed as a solution to a problem and really just technical tasks. In “Turning Technical Tasks Into User Stories” I looked at how it’s often possible to elevate these from an obvious solution to a problem back up to a problem which needs to be solved. At this point you may discover there are other, hopefully cheaper, solutions to the problem which have been missed in the original analysis either because things have changed or different people are doing the thinking.

On the flip-side there are occasionally times where, after having looked at a few related stories, it’s apparent that they all require the same underlying mechanism to work. One common solution to this is to bulk up the first story with the technical work and let the rest flow through as normal. This way you have no technical work on your backlog per-se as it’s all hidden in the stories.

Transparency

What I don’t like about this approach is that one story arbitrarily gets hit with a load of extra work, which, if you’re using historical data to stick a finger in the air for estimation of similar work later, skews the average somewhat. It also means that from a visibility perspective one story takes longer while the mechanism is being built.

One way I’ve found to address this has been to pull out the bare bones of the technical work into a “Hello, World!” story [1]. This story is framed around building the skeleton of the mechanism that will be used to drive the implementation of the subsequent features. The aim is keep the scope minimal enough that we avoid speculating while still delivering something which stands on its own two feet and remains clearly visible on the board.

Value Proposition

While the value to the end-user is in the eventual feature, the value in the mechanism is proving to the development team that the basic approach seems sound. With the skeleton built, the idiosyncrasies around each individual feature can then be dealt with appropriately at the right time and accounted for in the usual way.

To be clear this is not about doing a spike or building a prototype, although that may have happened earlier to gain the knowledge needed to undertake this piece of work. No, here we’re talking about building the bare bones of a real mechanism along with the most basic feature possible.

The reason I’ve called these “Hello World” Stories is probably self-evident, it alludes to the classic program many have chosen as their first – to write “Hello, World!” to the console. In this context the name is intended to conjure up simplicity and remind us that what we’re doing is delivering the minimum required to make the platform viable. We probably won’t literally write “Hello, World!” to the console, but it may a log message instead that we can then observe and monitor, or be a message on a queue that we can see discarded. Essentially whatever we can do to make its effects observable without wasting any real effort or leaving it partially complete.

Based on the classic INVEST acronym we should strive to make every unit of work: Independent, Negotiable, Valuable, Estimable, Small and Testable. By splitting it out from one of the arbitrary features it becomes more independent, negotiable, estimable and small which can be useful should short-term priorities change. And by extending the scope from a pure mechanism just a little bit further to the most trivial feature possible we make it more testable from a technical perspective, even if not from a product viewpoint. Most importantly, however, is it valuable in its own right? I think sometimes splitting the mechanism out gives value by making the I,N,E,S and T more tangible. In particular breaking work down into smaller deliverable units is often the most valuable practice even if occasionally the end-user has nothing initially to show for it.

Ultimately, I guess, I can’t ever remember anyone complaining they had broken their work down into pieces that were so small they were too visible.

 

[1] I’m sure there is an argument about this not being a “story” per-se but just a “task”. However I prefer to call it a story because our “Hello, World!” realization should have a grounding in the real world, even if it is more abstract than what the end-user will eventually receive.

[2] There is an assumption here that we’ve already decided we cannot or do not want to solve the dependent features in different ways, probably because it would be far more costly (in the long run) than briefly delaying them by building a common pillar.

Overthinking is not Overengineering

Chris Oldwood from The OldWood Thing

As the pendulum swings ever closer towards being leaner and focusing on simplicity I grow more concerned about how this is beginning to affect software architecture. By breaking our work down into ever smaller chunks and then focusing on delivering the next most valuable thing, how much of what is further down the pipeline is being factored into the design decisions we make today?

Wasteful Thinking

Part of the ideas around being leaner is an attempt to reduce waste caused by speculative requirements which has led many a project in the past into a state of “analysis paralysis” where they can’t decide what to build because the goalposts keep moving. By focusing on delivering something simpler much sooner we begin to receive some return on our investment earlier and also shape the future based on practical feedback from today, rather than trying to guess what we need.

When we’re building those simpler features that sit nicely upon our existing foundations we have much less need to worry about the cost of rework from getting it wrong as it’s unlikely to be expensive. But as we move from independent features to those which are based around, say, a new “concept” or “pillar” we should spend a little more time looking further down the backlog to see how any design choices we make might play out later.

Thinking to Excess

The term “overthinking” implies that we are doing more thinking than is actually necessary; trying to fit everyone’s requirements in and getting bogged down in analysis is definitely an undesirable outcome of spending too much time thinking about a problem. As a consequence we are starting to think less and less up-front about the problems we solve to try and ensure that we only solve the problem we actually have and not the problems we think we’ll have in the future. Solving those problems that we are only speculating about can lead to overengineering if they never manage to materialise or could have been solved more simply when the facts where eventually known.

But how much thinking is “overthinking”? If I have a feature to develop and only spend as much effort thinking as I need to solve that problem then, by definition, any more thinking than that is “overthinking it”. But not thinking about the wider picture is exactly what leads to the kinds of architecture & design problems that begin to hamper us later in the product’s lifetime, and later on might not be measured in years but even in days or weeks if we are looking to build a set of related features that all sit on top of a new concept or pillar.

The Horizon

Hence, it feels to me that some amount of overthinking is necessary to ensure that we don’t prematurely pessimise our solution and paint ourselves into a corner. We should factor work further down the backlog into our thoughts to help us see the bigger picture and work out how we can shape our decisions today to ensure it biases our thinking towards our anticipated future rather than an arbitrary one.

Acting on our impulses prematurely can lead to overengineering if we implement what’s in our thoughts without having a fairly solid backlog to draw on, and overengineering is wasteful. In contrast a small amount of overthinking – thought experiments – are relatively cheap and can go towards helping to maintain the integrity of the system’s architecture.

One has to be careful quoting old adages like “a stich in time saves nine” or “an ounce of prevention is worth a pound of cure” because they can send the wrong message and lead us back to where we were before – stuck in The Analysis Phase [1]. That said I want us to avoid “throwing the baby out with the bathwater” and forget exactly how much thinking is required to achieve sustained delivery in the longer term.

 

[1] The one phrase I always want to mean this is think globally, act locally” because it sounds like it promotes big picture thinking while only implementing what we need today, but that’s probably stretching it too far.

Feeling Isolated

Chris Oldwood from The OldWood Thing

By and large I think I’ve been fairly lucky with my time as a contract programmer. Virtually all the teams I’ve worked in and systems I’ve worked on have been pretty decent. None of them are going to change the world but they’ve been enjoyable, which is probably why I’ve ended up working on them for a decent length of time [1].

I can only say “virtually all” because one contract sadly fell way short of the mark. Although I was technically part of a team it only really felt that way from a managerial perspective, even though we shared a codebase. I felt somewhat isolated both physically and mentally. Aside from the morning stand-up I could easily have gone the rest of the day without speaking to my teammates if I had chosen to do so.

Physical Isolation

I started the contract on a separate floor from the rest of my team with a couple of other recent joiners [2]. We were the only people on that floor with the air conditioning on full blast so we had to wear our coats in the afternoon to stay warm. None of the rest of my team had an office pass that could access the floor either, should they want to talk face-to-face while getting us up to speed.

Even when they moved us onto the same floor a month later we were still on the opposite side of the room. In the next desk shuffle I got to swap colleagues although they were working on an entirely separate area of the system with a totally different bunch of people so we had little need to collaborate per-se, only to make small talk. Also the two desks next to me only seemed to be used for a game of Tower of Hanoi by the office movers given how the occupants came and went.

Even my “customer”, at least, the one I knew about, because they were paying for the project, was situated in a different country and spoke a different language. Although their English was way better than any knowledge I have of a second language I quickly discovered why most communication was via email or IM instead of vocally.

Project Isolation

Being an enterprise scale organisation the work was all about projects, and who was sponsoring how many “resources”. Nowhere was this more apparent than the Scrum Board with its project-oriented swim-lanes. Each swim-lane had the names of the team members assigned to that project, and as the stand-up proceeded it walked down the board a project at a time with each member of the sub-team providing an update.

It was fairly apparent right from the moment I started, just by reading the body language of the team members, that there was often little real interest in what the rest of the team was doing. Those that did, cut across projects to some degree because they tended to nurse the build system, deployments and monitoring. A couple of team members never attended our stand-up because they already attended a different one that encompassed their project.

To be fair some of the apathy at the stand-up was almost certainly down to its excessive length. And with little reason for attending except to provide a status update for the managers it’s no surprise those mostly on the periphery zoned out. Sometimes the only common goal of the team seemed to be to not break the system.

Code Isolation

During my short stint I effectively had one feature to work on. There were a couple of other minor tweaks to begin with but ultimately my project was one feature (nay, user story) and it took 5 months to deliver. That one feature involved making a change in an area of the codebase that nobody else knew except one of the tech leads who I soon discovered was leaving. In fact, taking away his days off after the announcement of his departure, I effectively had 3 days for any handover.

Not only were there no docs to work from there were no tests either. The only real knowledge about how any of the service was expected to behave had left firmly inside the head of the author. This pretty much just left doing a spot of software archaeology with the VCS in the hope that the commit messages might contain some extra clues. Many features had been tracked in a feature tracking tool but there were not enough licenses to go round so I had to hassle a teammate to look things up. Even then it often wasn’t worth it as there were no useful details; it felt like the ticket was just there to “tick a box”.

The code relied heavily on the caller “doing the right thing” so any understanding only made sense if you already knew what the caller was supposed to do, and that relied heavily on knowledge of the problem domain and the organisation’s other systems. (At the interview I made it perfectly clear that I still knew little about the problem domain, despite the many years I have worked in it [3].)

Methodology Isolation

Ever since I had my epiphany [4] around testing all those years ago I have become a firm believer in TDD and automated testing as the preferred approach to the sustainable delivery of quality software. Being told early in the project that “you won’t have time to write tests”, despite being asked in the interview about what your approach is, did not bode well.

It soon became apparent that the previous approach had been to rush something out and rely on manual, end-to-end testing and the customer doing things “right”. Validation was almost entirely left to the underlying maths library and so bizarre errors manifested and needed investigating by the developers due to a lack of basic error handling and reporting [5].

With no way of knowing if I had broken anything, because I didn’t know for sure what anything was supposed to do, my only recourse was to write new code with tests and then refactor later when someone (potentially me) could be sure that it was safe to do so. For existing code that I had to change or understand I would write a barrage of tests first to try and ensure I didn’t accidentally break anything. In some cases it was hard to know what was “by design” and what was “by accident”.

Clearly not everyone took this approach, as you can see in “It Compiles, Ship It!”. My pessimism paid off though once the edge cases and little extras started appearing as I could turn around a fix or improvement (safely) in minutes due to my suite of automated unit and regression tests.

Environment Isolation

Sadly, despite my ability to push through changes quickly into the integration test environment, it still took weeks for them to actually appear in the production environment. When my first task, a handful of lines of boilerplate code, took 6 weeks to make it into production I assumed continuous delivery was not something they cared about.

On the contrary, for one aspect of the business, releases were very frequent. It was just that I was on the other side and due to some (IMHO) poor architecture and deployment decisions my part of the distributed system was tightly-coupled to another (major) system’s release cycle.

While it might seem great having my own integration test environment to play with, I ran into issues no one else knew about and I had no idea who was really using it and for what. Once again that information pretty much departed with the author.

Parting Thoughts

On reflection I have to look at my own behaviour first and ask myself whether I was at least partly responsible for feeling left out. Once we moved onto the same floor it was definitely easier to wander over and ask people questions, which I did. However when the response is “well I worked all this out by myself originally” and “that’s more than anyone ever gave me” I think it’s not entirely unfair to assume that knowledge sharing isn’t high on some people’s agenda.

I believe I was as welcoming as I normally am and was happy to help out where possible, given the limited knowledge I had acquired. I guess that culturally there was such a large drive for autonomy that the idea of just chatting about stuff to see what improvements in the system or process would be beneficial just wasn’t on the cards. A couple of times what should have been a constructive comment or question definitely came out of me more as a snide remark which is never a good sign. I’ve been trying hard to be more aware of any sarcasm, which unfortunately comes all too easily to me, and so not add to any unnecessary negativity but I know I failed a few times.

Ultimately I think it says a lot about an organisation that rejects your approach because “they are not a start-up” when your application of that approach has only ever been in large enterprises and none of them has ever had an issue with it before. On the contrary they have often been grateful for the insights and improvements that I’ve brought.

Maybe if I was a lot younger I’d not have known any better and stuck it out a bit more but these days I know it’s just not worth the effort. I feel comfortable that I left the place in a better state than I joined it by documenting various things and writing tests for the code I wrote. After a slightly rocky start my customer seemed pretty pleased with everything I delivered, which I guess is largely what matters most.

As ever, my main regret is leaving behind some people that I wish I could have gotten to know better. Maybe I will, in another life, one where the benefits of collaboration are more positively encouraged.

 

[1] Mostly my tenure has been measured in years, not months.

[2] Only one of which was left when I called it a day – the other two barely lasted a month or so.

[3] See “Problem Domain Expert or Technical Expert or Even Both” for more on this recurring theme.

[4] See “My [Unit] Testing Epiphany” and my more recent ACCU / Agile on the Beach talk “A Test of Strength” for what lead to my enlightenment.

[5] Poor error messages is a popular topic of mine, see “Terse Exception Messages”. Also “The Perils of DateTime.Parse()” covers one specific example.

Technical Debt – Conscious Competence

Chris Oldwood from The OldWood Thing

Once upon a time the term Technical Debt seemed to have a very clear meaning but over the last few years that has been diluted to generally mean any crap code or process which is holding back delivery. I’m sure any scholars of Wittgenstein will be at pains to point out that “meaning is use” and therefore if everyone uses it this way who am I to argue?

For me the canonical source of information on the technical debt metaphor comes from the wiki of the person who coined the phrase in the first place – Ward Cunningham. The entry on Technical Debt there suggests to me that the choice to enter into debt is a wholly conscious one, not the unconscious acts of a less professional bunch of programmers.

By way of example I thought I’d take the opportunity to write up one of those occasions where I’ve been involved with taking debt on (in the original spirit of the term) and how we dealt with it, to show where the distinction lies.

The Bug

Soon after going live with v1.0 of a new calculation system in a large financial organisation we discovered that a number of key counterparties were missing from the daily report. The report generator was a late addition and there were various other issues around it’s development and testing which muddied the waters somewhat but suffice to say that this wasn’t delivered as cleanly as the core system was. (You might consider the more recent meaning of the term to apply here.)

More importantly what transpired was that due to various mergers in the company’s history a few counterparties had the same “unique” code in different back-end systems. This wasn’t just news to my team (we were all recent hires) but also to quite a few people in the business too. Due to only dealing with a limited set of “books” the codes were always unique to them in their context, but our new system cut right across them all.

The Root Cause

The generation of calculations was ultimately based around a Cartesian product of two counterparties, however given that most of those were pointless there was an optimisation which used another source of data to reduce that by more than an order of magnitude.

This optimisation should have been fairly simple but due to a need to initially use some existing manually managed counterparty data to ease the cutover (so regression testing should then reconcile exactly) it was somewhat more complicated than first envisaged.

Our system was designed to use the correct source of data eventually, but do a reverse lookup for the time being. It might sound simple but the lookup actually involved multiple lookups using combinations of keys that had to make assumptions about which legacy back-end system might hold the related data. The right person who could explain how we could do what we needed to do correctly also seemed elusive; there were many people with “heuristics”, but nobody who knew for sure.

In total there were ~100 counterparties out of a total of ~15,000 permutations that suffered from this problem. Unfortunately a handful of those 100 had a significant effect on the “bottom line” and therefore the usefulness of the system as a whole was in doubt at that point.

Entering Into Debt

Naturally once we unearthed this clanger we had to decide how to tackle it. After getting our heads around what this all meant and roughly where in the code the missing logic probably needed to go we had to make a decision – do we try and fix the underlying issue right away or try and put a workaround in place (assuming that’s even possible) to mitigate the problem, at least temporarily.

We were all very aware of going down the dark road of putting a tactical fix in place because we’d all seen where that can lead. We had made a concerted effort over the 12 months required to build the system to refactor relentlessly [1] and squash any bugs as soon as possible. This felt like a backwards step.

On the positive side by adopting a Design for Testability approach in most parts of the code we had extra switches on our processes [2] that allowed us to make per-counterparty requests, usually for diagnostic purposes. Hence the workaround took the form of sticking the list of missing counterparties in a simple text file, then using a command prompt FOR loop [3] to read the file and invoke the tool in “single counterparty mode”. Yes it was a little slow due the constant restarting of the process but it was easy to surgically insert into the workflow with the minimum of testing or risk.

Paying Back the Debt

With the hole plugged for now, and an easy mechanism in place for adding any other missing counterparties – update the text file – we were in a position to sort out the root problem without feeling under pressure to get the system working correctly 100%, ASAP.

As you can probably imagine the real solution wasn’t easy, not least because it was one of a few areas of SQL code that didn’t have any unit tests and was a tangled web of tables and views which had grown organically in an attempt to graft the old and the new worlds together [4].

What Did it Cost?

If we assume that the gung-ho approach would have been to just jump in and start fixing the real code, then what did we lose by not doing that? It’s possible that the final fix was simple and a little more investigation may have lead to that solution instead.

In contrast, the risk is that we end up in one of those “have we fixed it or not” scenarios where we spend an indeterminate amount of time being “real close” to getting towards “done”. The old adage about the last 10% also taking 90% of the time springs immediately to mind. Instead we were almost positive we had a simple workaround that could be deployed and get the system running correctly enough in an estimable amount of time. I believe there is a lot of value in having that degree of confidence.

What I think was critical was being able to remove the pressure on finding the right solution as this gave us time to really consider what needed to be done. Any fix done under pressure is not going to be given the attention to detail that it probably deserves. You then run the risk of making the system worse and having an even deeper hole to dig yourself out of.

The customer does not care about strategic versus tactical decisions per-se, they just want the thing to work. We cared about the solution because we knew it would be a burden in the short term as everyone had to remember about the bit “grafted on the side”. The general trust the team had built up by keeping quality at the forefront meant that the business would be more willing to trust us to reconcile the problem appropriately when the time came.

Use Language With Care

I really hope the term Technical Debt doesn’t continue to get watered down even further as it’s a powerful concept which is incredibly useful in the right hands. We already have far too many words for “alternate implementation” that are pejoratives carrying an air of unprofessionalness about them. I would like this one to remain in the hands of the professionals so they can continue to have “grown up” conversations with their customers about when it’s appropriate to consider taking shortcuts for a short term business need without them rolling their eyes, yet again.

 

[1] See “Relentless Refactoring” for more thoughts around this (unfortunately) contentious topic.

[2] “From Test Harness To Support Tool”, “Building Systems as Toolkits” and “In The Toolbox - Home-Grown Tools” all look at the non-functional side of tooling.

[3] A batch file just wouldn’t be complete without a for loop, see “Every Solution Starts With ‘FOR /F’”.

[4] This one feature seemed destined to plague us forever, see “So Many Wrongs, But No Rights” for another tale of woe.

TODO or TODO Not – Redux

Chris Oldwood from The OldWood Thing

One of my most “successful” posts was one I wrote way back in 2011 about my dislike for TODO style comments in code – “TODO or TODO Not - There Is No Later”. The premise of that post, which includes a number of examples from real codebases I’ve worked on, is that they are fundamentally pointless because they are almost certainly too low in value to get done. If they were valuable enough they either should be a proper feature on the backlog or be left to be handled as part of a relevant story, e.g. refactoring.

At the time I wrote that post I was convinced about them not living in the codebase (past the feature’s release) but I suggested that any potentially useful ones should be converted into bona fide user stories so they could be formally considered and prioritised along with the other items. After receiving a reply to a tweet that referenced my aging blog post from a company that tries to quantify technical debt by analysing such TODO style comments I felt it must be time for an update. (The TL;DR of this post was my reply to them.)

// Learn to Let Go

One of the hardest lessons I have learned over the intervening 7 years is how to let go of stuff. What caused me to cling onto some of those TODOs that I would run across was a fear of forgetting something important. My daily use of a log book (see “Pen & Paper”) to record notes as I go along exemplifies my apparent need to keep track of my current state as my brain is like the proverbial sieve. In an era where change was much harder because the software development QA process was largely manual this makes sense as it was more popular to batch-up changes. Couple this with a general disregard for anything non-functional, such as an appreciation for refactoring, and it’s all too easy to see why people bury their personal backlog in the code rather than open it up for discussion with non-developers.

There is a familiar ring here and that’s because I’ve walked this path before more recently in 2016’s “Developers Can Be Their Own Worst Enemy”. With a modern development process that puts an emphasis on trust and transparency we can let go of the past and should have more confidence in our peers and the management to see the value in our opinions.

It’s also not acceptable to just leave a cryptic comment in the code about why something should be implemented differently or moved elsewhere – the need to change must be backed up with a reason so that we can understand the value in it, and most TODO comments are throwaway rather than well reasoned arguments.

// Measuring Technical Debt

I’ll be honest, I genuinely cannot see the point of attempting to measure the level of technical debt in a codebase, even if it were possible to do so. For a start you need to decide if you’re taking the original interpretation – a conscious decision to temporarily take a shortcut - or the ever more popular one – crap code. Even if you could do that, what are you ever going to do with the output? I once saw a SonarQube report that said we had £X million of technical debt in a codebase I was working on; how exactly does that inform you?

The reason I find it pointless is because it feels like it’s measuring the wrong thing. Just like the story about the man looking for his keys under the streetlight we apply a tool to measure the quality of the code when what really matters is measuring our ability to deliver. The reason poor quality code affects us is because it makes future change hard and slows us down. Hence if it matters there must be a causal link because if the code is that bad our pace of delivery will slow down and we’ll see that (all other things being equal).

But what is more important in my mind is that even if you did know how much debt you had you would only want to focus on the areas that are subject to change, and you do that already by focusing on the most valuable work in the backlog! And the technique for tackling the debt is refactoring. Hence we already have what we need to address the real problem.

// Trimming the Backlog

Unwinding the stack slightly let me return to the topic of reflecting unfinished work in the product’s backlog.

That sentence should already be making your spider-senses tingle – how can it be unfinished? If we’ve not met the acceptance criteria we’re not done, so carry on. If the acceptance criteria was wrong or incomplete then that’s a discussion to have with the product owner for clarification. Note that “of good quality” is an acceptance criteria inherent in every story we do as it is virtually non-negotiable [1].

Perhaps a better term would be undiscovered work. Woody Zuill has this saying “it’s in the doing of the work that we discover the work that needs doing”. What we often unearth are things which never showed up in our initial analysis and therefore we now have to factor them into our schedule or decide to drop them. When we’re knee deep in the feature it may seem really important to do it now, but given time the need may slowly dissolve, and in the end possibly disappear altogether.

What you might initially put on the backlog (or write as a TODO in the code) could be quite specific, e.g. “Need to handle poisoned messages”. As you continue you might also add other scenarios such as “Should refresh token before it expires”  and “Triage malformed messages separately from those well-formed but still invalid”. These are all valuable behaviours which go towards building a reliable service but the initial focus might not be on reliability, maybe its currently just a tactical fix to enable learning about something else. You don’t want to waste time over-engineering but at the same time you also don’t want to forget what else you have learned doing it.

The problem with this mentality is that the backlog just grows and grows, and we’re back to where we were before with TODOs littered all over the code – it’s just an every growing feature list. As time passes the need to implement some of those features will either happen because they have suddenly become important (it’s no longer tactical) or they will vanish (it remains a mere stop-gap). Leaving the stories on the backlog just makes it harder and harder to prioritise as you keep going over old ground again and again.

Once again we need to let go and rely on the fact that if they are important they will eventually resurface. If you feel really uncomfortable about deleting them entirely then you might consider rolling up related stories back into epic sized ones as part of the backlog grooming. Applying this to our recent example you could easily fold them all into a single “Service Reliability” epic that would be easier to handle. You could turn each card’s title into a checklist item.

// Confidence

That said as I get older I become less tolerant of a never-ending backlog and want to be more aggressive in the backlog grooming. Part of that is down to having more confidence in knowing that issues will be addressed [2] in a more timely fashion and that prioritisation will take into account both the technical and functional features with more or less equal consideration because the team is trusted to do The Right Thing.

Being more experienced there is no doubt more than a grain of truth that my sphere of influence is probably wider but that shouldn’t really matter. Every story must be valuable and ideally stand-up on its own merit, where my experience comes in is probably in being able to express that value more succinctly.

// TODO: Redux

I haven’t seen, heard or read anything in the intervening years that has been able to convince me in the value of using TODO comments in the code as a more successful technique of managing what needs to be done. If anything my appetite for tracking any work outside the next few sprints / weeks has begun to wane simply because it is now so easy to change in direction with only a moment’s notice should the situation deteriorate. This does not mean we should be reckless, far from it, but leaving a TODO in the code as a way of conveying a change in architecture hardly seems optimal either.

As for the notion that a TODO in the code can be equated with an increase in technical debt I don’t buy that either. I would posit it is more likely to indicate a failure within the development process itself as the mismatch between behaviour and acceptance criteria either indicates a bug in the code or a bug in the requirements and neither of these outcomes sounds like something that should be brushed off lightly with a comment in the code.

 

[1] I try not to deal in absolutes as there always seems to be an exception, but either way it must be a conscious decision.

[2] Assuming the organisation is behaving in a moderately agile way and not just paying lip service to a couple of the ceremonies or practices.

Delivery Anti-Pattern: Local Optimisations

Chris Oldwood from The OldWood Thing

The daily stand-up mostly went along as usual. I wasn’t entirely sure why there was a stand-up as it wasn’t so much a team as a bunch of people working on the same codebase but with more-or-less individual goals. Applying the microservices analogy to the team you could say we were a bunch of micro-teams – each person largely acting with autonomy rather than striving to work together to achieve a common goal [1].

Time

But I digress, only slightly. What happened at the stand-up was a brief discussion around the delivery of a feature with the suggestion that it should be hidden behind a feature toggle. The implementer explained that they weren’t going to add a feature toggle because “it was a waste of their time”.

This surprised me. Knowing what I do now about how the team operates it isn’t that surprising but at the time I was used to working in teams where every member works towards common goals. One of those common goals is to try and ensure the delivery of features is a continuous flow and is not disrupted by a bad change which then has to be backed out because rolling back has the potential to create all sorts of disruption, not least the delay of those unaffected changes.

You should note that I’m not disagreeing with their choice of whether or not to use a feature toggle – I did not know anywhere near enough about the change or the system at that time to contribute to that decision. No, what disturbed me was the reason why they chose not to take that approach – that their time is more valuable than that of anyone else in the team, or the business for that matter.

In isolation that paints an unpleasant picture of the individual in question and that simply is not the case. However their choice of words, even if done without real consideration, does appear to reinforce the culture that surrounds them. In essence, with a feeling that the focus is on them and their performance, they are naturally going to behave in a way that favours optimising delivering their own features over that of the team at large.

Quality

Another example of favouring a local optimisation over the longer term goal of sustained delivery occurred when I was assigned my first piece of work. This was not so much a story as a couple of epics funded as an entire project (over 4 months solid work in the end). My instinct, after being shown roughly where in the code I needed to dig, was to ask where the existing tests were so that I knew where to add mine. The tech lead’s immediate response was “you won’t have time to write tests”.

My usual response to this statement is a jovially phrased “how will I know if it works then?” which often has the effect of opening a line of dialogue around the testing strategy and where it’s heading. Unfortunately this time around it only succeeded in the tech lead launching into a diatribe about how important delivery was, how much the business trusted us to deliver on time, blah, blah, blah, in fact almost everything that a good test suite enables!

Of course I still went ahead and implemented the entire project TDD-style and easily delivered it on time because I knew the approach was sound and the investment was more than worth it. The subsequent enhancements and repaying of some technical debt also became trivial at that point and meant that anyone, not just me, could safely and quickly make changes to that area of code. It also showed how easy it was to add new tests to cover changes to the older parts of the component when required later.

In the end over 10% of unit tests of the entire system had been written by me during that project for a codebase of probably over 1/2 million lines of code. I also added a command line test harness and a regression testing “framework” [2] in that time too all in an effort to reduce the amount of hoops you needed to go through to diagnose and safely fix any edge cases that showed up later. None of this was rocket science or in the least bit onerous.

Knowledge

I would consider a lack of supporting documentation one further local optimisation too. When only a select few have the knowledge to help support a system you have to continually rely on their help to nurse it through the bad times. This is especially true when the system has enough quirks that the cost of taking the wrong action is quite high (in terms of additional noise). If you need to remember a complex set of conditions and actions you’re going to get it wrong eventually without some form of checklist to work from. Relying on tribal knowledge is a great form of optimisation until core members of the team leave and you unearth the gaping holes in the team’s knowledge.

Better yet, design away the problems entirely, but that’s a different can of worms…

Project Before Product

I believe this was another example of how “projects” are detrimental to the development of a complex system. With the team funded by various projects and those projects being used as a very clear division on the task board through swim lanes [3] it killed the desire to swarm on anything but a production incident because you felt beholden to your specific stakeholders.

For example there were a number of conversations about fixing issues with the system that were slowing down delivery through unreliability that ended with “but who’s going to pay for that?” Although improvements were made they had to be so small as to not really affect the delivery of the project work. Hence the only real choice was to find easier ways to treat the symptoms rather than cure the disease.

Victims of Circumstance

Whenever I bump into this kind of culture my gut instinct is not to assume they are “incompetent” people, on the contrary, they’re clearly intelligent so I’ll assume they are shaped by their environment. Of course we all have our differences, that’s what makes diversity so useful, but we have to remember to stop once in a while and reflect on what we’re doing and question whether it’s still the right approach to take. What works for building Fizz Buzz does not work for a real-time, distributed calculation engine. And even if that approach did work once upon a time the world keeps moving on and so now we might be able to do even better.

 

[1] Pairing was only something you did when you’d already been stuck for some time, and when the mistake was found you went your separate ways again.

[2] I say “framework” because it was really just leveraging a classic technique: a command line tool reading CSV format data which fired requests into a server, the results of which are then diff’d against a known set of results (Golden Master Testing).

[3] The stand-up was originally run in project order, lead by the PM. Unsurprisingly those not involved in the other projects were rarely engaged in the meeting unless it was their turn to speak.