Delivery Anti-Pattern: Local Optimisations

Chris Oldwood from The OldWood Thing

The daily stand-up mostly went along as usual. I wasn’t entirely sure why there was a stand-up as it wasn’t so much a team as a bunch of people working on the same codebase but with more-or-less individual goals. Applying the microservices analogy to the team you could say we were a bunch of micro-teams – each person largely acting with autonomy rather than striving to work together to achieve a common goal [1].

Time

But I digress, only slightly. What happened at the stand-up was a brief discussion around the delivery of a feature with the suggestion that it should be hidden behind a feature toggle. The implementer explained that they weren’t going to add a feature toggle because “it was a waste of their time”.

This surprised me. Knowing what I do now about how the team operates it isn’t that surprising but at the time I was used to working in teams where every member works towards common goals. One of those common goals is to try and ensure the delivery of features is a continuous flow and is not disrupted by a bad change which then has to be backed out because rolling back has the potential to create all sorts of disruption, not least the delay of those unaffected changes.

You should note that I’m not disagreeing with their choice of whether or not to use a feature toggle – I did not know anywhere near enough about the change or the system at that time to contribute to that decision. No, what disturbed me was the reason why they chose not to take that approach – that their time is more valuable than that of anyone else in the team, or the business for that matter.

In isolation that paints an unpleasant picture of the individual in question and that simply is not the case. However their choice of words, even if done without real consideration, does appear to reinforce the culture that surrounds them. In essence, with a feeling that the focus is on them and their performance, they are naturally going to behave in a way that favours optimising delivering their own features over that of the team at large.

Quality

Another example of favouring a local optimisation over the longer term goal of sustained delivery occurred when I was assigned my first piece of work. This was not so much a story as a couple of epics funded as an entire project (over 4 months solid work in the end). My instinct, after being shown roughly where in the code I needed to dig, was to ask where the existing tests were so that I knew where to add mine. The tech lead’s immediate response was “you won’t have time to write tests”.

My usual response to this statement is a jovially phrased “how will I know if it works then?” which often has the effect of opening a line of dialogue around the testing strategy and where it’s heading. Unfortunately this time around it only succeeded in the tech lead launching into a diatribe about how important delivery was, how much the business trusted us to deliver on time, blah, blah, blah, in fact almost everything that a good test suite enables!

Of course I still went ahead and implemented the entire project TDD-style and easily delivered it on time because I knew the approach was sound and the investment was more than worth it. The subsequent enhancements and repaying of some technical debt also became trivial at that point and meant that anyone, not just me, could safely and quickly make changes to that area of code. It also showed how easy it was to add new tests to cover changes to the older parts of the component when required later.

In the end over 10% of unit tests of the entire system had been written by me during that project for a codebase of probably over 1/2 million lines of code. I also added a command line test harness and a regression testing “framework” [2] in that time too all in an effort to reduce the amount of hoops you needed to go through to diagnose and safely fix any edge cases that showed up later. None of this was rocket science or in the least bit onerous.

Knowledge

I would consider a lack of supporting documentation one further local optimisation too. When only a select few have the knowledge to help support a system you have to continually rely on their help to nurse it through the bad times. This is especially true when the system has enough quirks that the cost of taking the wrong action is quite high (in terms of additional noise). If you need to remember a complex set of conditions and actions you’re going to get it wrong eventually without some form of checklist to work from. Relying on tribal knowledge is a great form of optimisation until core members of the team leave and you unearth the gaping holes in the team’s knowledge.

Better yet, design away the problems entirely, but that’s a different can of worms…

Project Before Product

I believe this was another example of how “projects” are detrimental to the development of a complex system. With the team funded by various projects and those projects being used as a very clear division on the task board through swim lanes [3] it killed the desire to swarm on anything but a production incident because you felt beholden to your specific stakeholders.

For example there were a number of conversations about fixing issues with the system that were slowing down delivery through unreliability that ended with “but who’s going to pay for that?” Although improvements were made they had to be so small as to not really affect the delivery of the project work. Hence the only real choice was to find easier ways to treat the symptoms rather than cure the disease.

Victims of Circumstance

Whenever I bump into this kind of culture my gut instinct is not to assume they are “incompetent” people, on the contrary, they’re clearly intelligent so I’ll assume they are shaped by their environment. Of course we all have our differences, that’s what makes diversity so useful, but we have to remember to stop once in a while and reflect on what we’re doing and question whether it’s still the right approach to take. What works for building Fizz Buzz does not work for a real-time, distributed calculation engine. And even if that approach did work once upon a time the world keeps moving on and so now we might be able to do even better.

 

[1] Pairing was only something you did when you’d already been stuck for some time, and when the mistake was found you went your separate ways again.

[2] I say “framework” because it was really just leveraging a classic technique: a command line tool reading CSV format data which fired requests into a server, the results of which are then diff’d against a known set of results (Golden Master Testing).

[3] The stand-up was originally run in project order, lead by the PM. Unsurprisingly those not involved in the other projects were rarely engaged in the meeting unless it was their turn to speak.

Treat All Test Environments Like Production

Chris Oldwood from The OldWood Thing

One of the policies I pushed for from the start when working on a greenfield system many years ago was the notion that we were going to treat all test environments (e.g. dev and UAT) like the production environment.

As you can probably imagine this was initially greeted with a heavy dose of scepticism. However all the complaints I could see against the idea were dysfunctional behaviours of the delivery process. All the little workarounds and hacks that were used to back-up their reasons for granting unfettered access to the environments seemed to be the result of poorly thought out design, inadequate localised testing or organisational problems. (See “Testing Drives the Need for Flexible Configuration” for how we addressed one of those concerns.)

To be clear, I am not suggesting that you should completely disable all access to the environment; on the contrary I believe that this is required even in production for those rare occasions when you just cannot piece together the problem from your monitoring and source code alone. No, what I was suggesting was that we employ the same speed bumps and privileges in our test environments that we would in production. And that went for the database too.

The underlying principle I was trying to enshrine here was that shared testing environments, by their very nature, should be treated with the utmost care to ensure a smooth delivery of change. In the past I have worked on systems where dev and test environments were a free-for-all. The result is that you waste so much time investigating issues that are orthogonal to your actual problem because someone messed with it for their own use and just left it in a broken state. (This is another example of the “Broken Windows” syndrome.)

A secondary point I was trying to make was that your test environments are also, by definition, your practice runs at getting things right. Many organisations have a lot of rigour around how they deploy to production but very little when it comes to the opportunities leading up to it. In essence your dev and test environments give you two chances to get things right before the final performance – if you’re not doing dress rehearsals beforehand how can you expect it to go right on the day? When production deployments go wrong we get fearful of them and then risk aversion kicks in meaning we do them less often and a downward spiral kicks in.

The outcome of this seemingly “draconian” approach to managing the development and test environments was that we also got to practice supporting the system in two other environments, and in a way that prepared us for what we needed to do when the fire was no longer just a drill. In particular we quickly learned what diagnostic tools we should already have on the box and, most importantly, what privileges we needed to perform certain actions. It also affected what custom tools we built and what extra features we added to the services and processes to allow safe use for analysis during support (e.g. a --ReadOnly switch).

The Principle of Least Privilege suggests that for our incident analysis we should only require read access to any resource, such as files, the database, OS logs, etc. If you know that you are protected from making accidental mistakes you can be more aggressive in your approach as you feel confident that the outcome of any mistake will not result in breaking the system any further [1][2]. Only at the point at which you need to make a change to the system configuration or data should the speed bumps kick in and you elevate yourself temporarily, make the change and immediately drop back to mere mortal status again.

The database was an area in particular where we had all been bitten before by support issues made worse through the execution of ad-hoc SQL passed around by email or pasted in off the wiki. Instead we added a new schema (i.e. namespace) specifically for admin and support stored procedures that were developed properly, i.e. they were written test-first. (See “You Write Your SQL Unit Tests in SQL” for more on how and why we did it this way.) This meant applying certain kinds of workarounds were easier to administer because they were essentially part of the production codebase, not just some afterthought that nobody maintained.

On the design front this also started to have an interesting effect as we found ourselves wanting to leverage our production service code in new ways to ensure that we avoided violating invariants by hosting the underlying service components inside new containers, i.e. command line tools or making them scriptable. (See “Building Systems as Toolkits”.)

The Interface Segregation Principle is your friend here as it pushes you towards having separate interfaces for reading and writing making it clearer which components you can direct towards a production service if you’re trying to reproduce an issue locally. For example our calculation engine support tool allowed you to point any “readers” towards real service endpoints whilst redirecting the the writers to /dev/null (i.e. using the Null Object pattern) or to some simple in-memory implementation (think Dictionary) to pass data from one internal task to the next.

I find it somewhat annoying that we went to a lot of effort to give ourselves the best chance of designing and building a supportable system that also provided traceability only for the infrastructure team to disallow our request for personal per-environment support accounts, saying instead that we needed to share a single one! Even getting them to give us a separate account for dev, UAT and production was hard work. It sometimes feel like the people who complain most about a lack of transparency and rigour are the same ones that deny you access to exactly that.

I know there were times when it felt as though we could drop our guard in dev or UAT “just this once” but I don’t remember us ever doing that. Instead we always used it as an opportunity to learn more about what the real need was and how it could become a bona fide feature rather than just a hack.

 

[1] That’s not entirely true. A BA once concocted a SQL query during support that ended up “bug checking” SQL Server and brought the entire system to a grinding halt. They then did it again by accident after it was restarted :o).

[2] A second example was where someone left the Sysinternals DebugView tool running overnight on a server whereupon it filled up the log window and locked up a service due to the way OutputDebugString works under the covers.

Good Stories Assure the Architecture

Chris Oldwood from The OldWood Thing

One of the problems a team can run into when they adopt a more agile way of working is they struggle to frame their backlog in the terms of user focused stories. This is a problem I’ve written about before in “Turning Technical Tasks Into User Stories” which looked at the problem for smaller units of work. Even if the team can buy into that premise for the more run-of-the-mill features it can still be a struggle to see how that works for the big ticket items like the system’s architecture.

The Awkward Silence

What I’ve experienced is that the team can start to regress when faced with discussions around what kind of architecture to aim for. With a backlog chock full of customer pleasing functionality the architectural conversations might begin to take a bit of a back seat as the focus is on fleshing out the walking skeleton with features. Naturally the nervousness starts to set in as the engineers begin to wonder when the architecture is going to get the attention it rightly deserves. It’s all very well supporting a handful of “friendly” users but what about when you have real customers who’ve entrusted you with their data and they want to make use of it without a moments notice at any hour of the day?

The temptation, which should be resisted, can be to see architectural work as outside the scope of the core backlog – creating a separate backlog for stuff “the business does not understand”. This way can lead to a split in the backlog, and potentially even two separate backlogs – a functional and a non-functional one. This just makes prioritisation impossible. Also burying the work kills transparency, eventually erodes trust, and still doesn’t get you the answers you really need.

Instead, the urge should be to frame the architectural concerns in terms the stakeholder does understand, so that the business can be more informed about their actual benefits. In addition, when “The Architecture” is a journey and not a single destination there is no longer one set of benefits to aim for there are multiple trade-offs as the architecture evolves over time, changing at each step to satisfy the ongoing needs of the customer(s) along the way. There is in essence no “final solution” there is only “what we need for the foreseeable future”.

Tell Me a Story

So, what do I mean by “good stories”? Well, the traditional way this goes is for an analyst to solicit some non-functional requirements for some speculative eventual system behaviour. If we’re really lucky it might end up in the right ballpark at one particular point in the future. What’s missing from this scene is a proper conversation, a proper story – one with a beginning, a middle, and an end – where we are today, the short term and the longer term vision.

But not only do we need to get a feel for their aspirations we also need quantifiable metrics about how the system needs to perform. Vague statements like “fast enough” are just not helpful. A globally accessible system with an anticipated latency in the tens of milliseconds will need to break the law of physics unless we trade-off something else. We also need to know how those exceptional events like Cyber Monday are to be factored into the operation side.

It’s not just about performance either. In many cases end users care that their data is secure, both in-flight (over the network) and at rest, although they likely have no idea what this actually means in practice. Patching servers is a technical task, but the bigger story is about how the team responds to a vulnerability which may make patching irrelevant. Similarly database backups are not the issue it’s about service availability – you cannot be highly available if the loss of an entire data centre potentially means waiting for a database to be restored from scratch elsewhere.

Most of the traditional conversations around non-functional requirements focus entirely on the happy path, for me the conversation doesn’t really get going until you start talking about what needs to happen when the system is down. It’s never a case of “if”, but “when” it fails and therefore mitigating these problems features heavily in our architectural choices. It’s an uncomfortable conversation as we never like discussing failure but that’s what having “grown up” conversations mean.

Incremental Architecture

Although I’ve used the term “story” in this post’s title, many of the issues that need discussing are really in the realm of “epics”. However we shouldn’t get bogged down in the terminology, instead the essence is to remember to focus on the outcome from the user’s perspective. Ask yourselves how fast, how secure, how available, etc. it needs to be now, and how those needs might change in response to the system’s, and the business’s growth.

With a clearer picture of the potential risks and opportunities we are better placed to design and build in small increments such that the architecture can be allowed to emerge at a sustainable rate.

Every Commit Needs the Rationale to Support It

Chris Oldwood from The OldWood Thing

Each and every change to a codebase should be performed for a very specific reason – we shouldn’t just change some code because we feel like it. If you follow a checklist (mental or otherwise), such as the one I described in “Commit Checklist”, then each commit should be as cohesive as possible with any unintentional edits reverted to spare our blushes.

However, whilst the code can say what behaviour has changed, we also need to say why it was changed. The old adage “use the source Luke” is great for reminding us that the only source of truth is the code itself, but changes made without any supporting documentation makes software archaeology [1] incredibly difficult in the future.

The Commit Log

Take the following one line change to the JSON serialization settings used when persisting to a database:

DateTimeZoneHandling = DateTimeZoneHandling.Utc;

This single-line edit appeared in a commit all by itself. Now, any change which has the potential to affect the storage or retrieval of the system’s data is something which should not be entered into lightly. Even if the change was done to make what is currently a default setting explicit, this fact still needs to be recorded – the rationale is important.

The first port of call for any documentation around a change is probably the commit message. Given that it lives with the code and is (usually) immutable it stands the best chance of remaining intact over time. In the example above the commit message was simply:

“Bug Fix: added date time zone handling to UTC for database json serialization”

In the same way that poor code comments have a habit of simply stating what the code does, the same malaise can affect commit messages by merely restating what was changed. Our example largely suffers from this, but it also teases us by additionally mentioning that it was done to fix a bug. Suddenly we have so many more unanswered questions about the change.

Code Change Comments

In the dim and distant past it was not unusual to use code comments to annotate changes as well as to describe the behaviour of the code. Before the advent of version control features like “blame” (aka annotate) it was non-trivial to track down the commit where any particular line of code changed. As such it seemed easier to embed the change details in the code itself rather than the VCS tool, especially if the supporting documentation lived in another system; you could just use the Change Request ID as the comment.

As you can imagine this sorta worked okay at first but as the code continued to change and refactoring became more popular these comments became as distracting and pointless as the more traditional kind. It also did nothing to help reduce the overheard of tracking the how-and-why in different places.

Feature Trackers

The situation originally used to be worse than this as new features might be tracked in one place by the business whilst bugs were tracked elsewhere by the development team. This meant that the “why” could be distributed right across time and space without the necessary links to tie them all together.

The desire to track all work in one place in an Enterprise tool like JIRA has at least reduced the number of places you need to look for “the bigger picture”, assuming you use the tool for more than just recording estimates and time spent, but of course there are lightweight alternatives [2]. Hence recording the JIRA number or Trello card number in the commit message is probably the most common approach to linking these two sides of the change.

As an aside, one of the reasons many teams haven’t historically put all their documentation in their source code repo is because it’s often been inaccessible to non-developer colleagues, either due to lack of permissions or technical ability. Fortunately tools like GitHub have started to bridge this divide.

Executable Specifications

One of the oldest problems in software development has been keeping the supporting documentation and code in sync. As features evolve it becomes harder and harder to know what the canonical reason for any change is because the current behaviour may be the sum of all previous related requirements.

An ever-growing technique for combating this has been to express the documentation, i.e. the requirements, in code too, in the form of tests. At a high level these are acceptance tests, with more technical behaviours expressed as unit or integration tests.

This brings me back to my earlier example. It’s incredibly rare that any code change would be committed without some kind of corresponding change to the automated tests. In this instance the bug must have manifested itself in the persistence layer and I’d expect at least one new test to be added (or an existing one fixed) to illustrate what the bug is. Hence the rationale for the change is to fix a bug, and the rationale can largely be described through the use of one or more well written tests rather than in prose.

Exceptions

There are of course no absolutes in life and fixing a spelling mistake should not require pages of notes, although spelling incorrectly on purpose probably does [3].

The point is that there is a balance to be struck if we are to trade-off the short and long term maintenance of the system. It might be tempting to rely on tribal knowledge or the product owner’s notes to avoid thinking about how the rationale is best expressed, but finding a way to encode that information in executable form, such as through tests, provides both the present reviewer and the future software archaeologist with the most usable representation.

 

[1] See my “Software Archaeology” article for more about spelunking a codebase’s history.

[2] I’ve written about the various tools I’ve used in the past in  “Feature Tracking”.

[3] The HTTP “referer” header being a notable exception, See Wikipedia.

Refactoring – Before or After?

Chris Oldwood from The OldWood Thing

I recently worked on a codebase where I had a new feature to implement but found myself struggling to understand the existing structure. Despite paring a considerable amount I realised that without other people to easily guide me I still got lost trying to find where I needed to make the change. I felt like I was walking through a familiar wood but the exact route eluded me without my usual guides.

I reverted the changes I had made and proposed that now might be a good point to do a little reorganisation. The response was met with a brief and light-hearted game of “Ken Beck Quote Tennis” - some suggested we do the refactoring before the feature whilst others preferred after. I felt there was a somewhat superficial conflict here that I hadn’t really noticed before and wondered what the drivers might be to taking one approach over the other.

Refactor After

If you’re into Test Driven Development (TDD) then you’ll have the mantra “Red, Green, Refactor” firmly lodged in your psyche. When practicing TDD you first write the test, then make it pass, and finally finish up by refactoring the code to remove duplication or otherwise simplify it. Ken Beck’s Test Driven Development: By Example is probably the de facto read for adopting this practice.

The approach here can be seen as one where the refactoring comes after you have the functionality working. From a value perspective most of it comes from having the functionality itself – the refactoring step is an investment in the codebase to allow future value to be added more easily later.

Just after adding a feature is the point where you’ve probably learned the most about the problem at hand and so ensuring the design best represents your current understanding is a worthwhile aid to future comprehension.

Refactor Before

Another saying from Kent Beck that I’m particularly fond of is “make the change easy, then make the easy change” [1]. Here he is alluding to a dose of refactoring up-front to mould the codebase into a shape that is more amenable to allowing you to add the feature you really want.

At this point we are not adding anything new but are leaning on all the existing tests, and maybe improving them too, to ensure that we make no functional changes. The value here is about reducing the risk of the new feature by showing that the codebase can safely evolve towards supporting it. More importantly It also gives the earliest visibility to others about the new direction the code will take [2].

We know the least amount about what it will take to implement the new feature at this point but we also have a working product that we can leverage to see how it’s likely to be impacted.

Refactor Before, During & After

Taken at face value it might appear to be contradictory about when the best time to refactor is. Of course this is really a straw man argument as the best time is in fact “all the time” – we should continually keep the code in good shape [3].

That said the act of refactoring should not occur within a vacuum, it should be driven by a need to make a more valuable change. If the code never needed to change we wouldn’t be doing it in the first place and this should be borne in mind when working on a large codebase where there might be a temptation to refactor purely for the sake of it. Seeing stories or tasks go on the backlog which solely amount to a refactoring are a smell and should be heavily scrutinised.

Emergent Design

That said, there are no absolutes and whilst I would view any isolated refactoring task with suspicion, that is effectively what I was proposing back at the beginning of this post. One of the side-effects of emergent design is that you can get yourself into quite a state before a cohesive design finally emerges.

Whilst on paper we had a number of potential designs all vying for a place in the architecture we had gone with the simplest possible thing for as long as possible in the hope that more complex features would arrive on the backlog and we would then have the forces we needed to evaluate one design over another.

Hence the refactoring decision became one between digging ourselves into an even deeper hole first, and then refactoring heavily once we had made the functional change, or doing some up-front preparation to solidify some of the emerging concepts first. There is the potential for waste if you go too far down the up-front route but if you’ve been watching how the design and feature list have been emerging over time it’s likely you already know where you are heading when the time comes to put the design into action.

 

[1] I tend to elide the warning from the original quote about the first part potentially being hard when saying it out loud because the audience is usually well aware of that :o).

[2] See “The Cost of Long-Lived Feature Branches” for a cautionary tale about storing up changes.

[3] See “Relentless Refactoring” for the changes in attitude towards this practice.

Are Refactoring Tools Less Effective Overall?

Chris Oldwood from The OldWood Thing

Prior to the addition of automatic refactoring tools to modern IDEs refactoring was essentially a manual affair. You would make a code change, hit build, and then fix all the compiler errors (at least for statically typed languages). This technique is commonly known as “leaning on the compiler”. Naturally the operation could be fraught with danger if you were far too ambitious about the change, but knowing when you could lean on the compiler was part of the art of refactoring safely back then.

A Hypothesis

Having lived through both eras (manual and automatic) and paired with developers far more skilled with the automatic approach I’ve come up with a totally non-scientific hypothesis that suggests automatic refactoring tools are actually less effective than the manual approach, overall.

I guess the basis of this hypothesis pretty much hinges on what I mean by “effective”. Here I’m suggesting that automatic tools help you easily refactor to a local minima but not to a global minima [1]; consequently the codebase as a whole ends up in a less coherent state.

Shallow vs Deep Refactoring

The goal of an automatic refactoring tool appears to be to not break your code – it will only allow you to use it to perform a simple refactoring that can be done safely, i.e. if the tool can’t fix up all the code it can see [2] it won’t allow you to do it in the first place. The consequence of this is that the tool constantly limits you to taking very small steps. Watching someone refactor with a tool can sometimes seem tortuous as they may need to use so many little refactoring steps to get the code into the desired state because you cannot make the leaps you want in one go unless you switch to manual mode.

This by itself isn’t a bad thing, after all making a safe change is clearly A Good Thing. No, where I see the problem is that by fixing up all the call sites automatically you don’t get to see the wider effects of the refactoring you’re attempting.

For example the reason you’d choose to rename a class or method is because the existing one is no longer appropriate. This is probably because you’re learned something new about the problem domain. However that class or method does not exist in a vacuum, it has dependencies in the guise of variable names and related types. It’s entirely likely that some of these may now be inappropriate too, however you won’t easily see them because the tool has likely hidden them from you.

Hence one of the “benefits” of the old manual refactoring approach was that as you visited each broken call site you got to reflect on your change in the context of where it’s used. This often led to further refactorings as you began to comprehend the full nature of what you had just discovered.

Blue or Red Pill?

Of course what I’ve just described could easily be interpreted as the kind of “black hole” that many, myself included, would see as an unbounded unit of work. It’s one of those nasty rabbit holes where you enter and, before you know it, you’re burrowing close to the Earth’s core and have edited nearly every file in the entire workspace.

Yes, like any change, it takes discipline to stick to the scope of the original problem. Just because you keep unearthing more and more code that no longer appears to fit the new model it does not mean you have to tackle it right now. Noticing the disparity is the first step towards fixing it.

Commit Review

It’s not entirely true that you won’t see the entire outcome of the refactoring – at the very least the impact will be visible when you review the complete change before committing. (For a fairly comprehensive list of the things I go through at the point I commit see my C Vu article “Commit Checklist”.)

This assumes of course that you do a thorough review of your commits before pushing them. However by this point, just as writing tests after the fact are considerably less attractive, so is finishing off any refactoring; perhaps even more so because the code is not broken per-se, it just might not be the best way of representing the solution.

It’s all too easy to justify the reasons why it’s okay to go ahead and push the change as-is because there are more important things to do. Even if you think you’re aware of technical debt it often takes a fresh pair of eyes to see how you’re living in a codebase riddled with inconsistencies that make it hard to see it’s true structure. One is then never quite sure without reviewing the commit logs what is the legacy and what is the new direction.

Blinded by Tools

Clearly this is not the fault of the tool or their vendors. What they offer now is far more favourable than not having them at all. However once again we need to be reminded that we should not be slaves to our tools but that we are the masters. This is a common theme which is regularly echoed in the software development community and something I myself tackled in the past with “Don’t Let Your Tools Pwn You”.

The Boy Scout Rule (popularised by Uncle Bob) says that we should always leave the camp site cleaner than we found it. While picking up a handful of somebody else’s rubbish and putting it in the bin might meet the goal in a literal sense, it’s no good if the site is acquiring rubbish faster than it’s being collected.

Refactoring is a technique for improving the quality of a software design in a piecewise fashion; just be careful you don’t spend so long on your hands and knees cleaning small areas that you fail to spot the resulting detritus building up around you.

 

[1] I wasn’t sure whether to say minima or maxima but I felt that refactoring was about lowering entropy in some way so went with the reduction metaphor.

[2] Clearly there are limits around published APIs which it just has to ignore.

Journey Code

Chris Oldwood from The OldWood Thing

In the move from a Waterfall approach to a more Agile way of working we need to learn to be more comfortable with our software being in a less well defined state. It’s not any less correct per-se, it’s just that we may choose to prioritise our work differently now so that we tackle the higher value features first.

Destination Unknown

A consequence of this approach is that conversations between, say, traditional architects (who are talking about the system as they expect it to end up) and those developing it (who know it’s only somewhere between started and one possible future), need interpretation. Our job as programmers is to break these big features down into much smaller units that can be (ideally) independently scoped and prioritised. Being agile is about adapting to changing requirements and that’s impossible if every feature needs to designed, delivered and tested in it’s entirely.

In an attempt to bridge these two perspectives I’ve somewhat turned into a broken record by frequently saying “but that’s the destination, we only need to start the journey” [1]. This is very much a statement to remind our (often old and habitual) selves that we don’t build software like that anymore.

Build the Right Thing

The knock on effect is that any given feature is likely only partially implemented, especially in the early days when what we are still trying to explore what it is we’re actually trying to build in the first place. For example, data validation is feature we want in the product before going live, but we can probably make do with only touching on the subject lightly whilst we explore what data even needs validating in the first place.

This partially implemented feature has started to go by the name “journey code”. While you can argue that all code is essentially malleable as the system grows from its birth to its eventual demise, what we are really trying to convey here is code which cannot really be described as having been “designed”. As such it does not indicate in any meaningful way the thoughts and intended direction of the original author – they literally did the simplest possible thing to make the acceptance test pass and that’s all.

Destination in Sight

When we finally come to play out the more detailed aspects of the feature that becomes the time at which we intend to replace what we did along the journey with where we think the destination really is. Journey code does not even have to have been written in response to the feature itself, it may just be some infrastructure code required to bootstrap exploring a different one, such as logging or persistence.

What it does mean is that collaboration is even more essential when the story is eventually picked up for fleshing out to ensure that we are not wasting time trying to fit in with some design that never existed in the first place. Unless you are someone who happily ignores whatever code anyone else ever writes anyway, and they do exist, you will no doubt spend at least some time trying to understand what’s gone before you. Hence being able to just say “journey code” has become a nice shorthand for “it wasn’t designed so feel free to chuck it away and do the right thing now we know what really needs doing”.

 

[1] I covered a different side-effect of this waterfall/agile impedance mismatch in “Confusion Over Waste”.

The Cost of Long-Lived Feature Branches

Chris Oldwood from The OldWood Thing

Many moons ago I was working at large financial organisation on one of their back office systems. The ever increasing growth of the business meant that our system, whilst mostly distributed, was beginning to creak under the strain. I had already spent a month tracking down some out-of-memory problems in the monolithic orchestration service [1] and a corporate programme to reduce hardware meant we needed to move to a 3rd party compute platform to save costs by sharing hardware.

Branch Per Project

The system was developed by a team (both on-shore and off-shore) numbering around 50 and the branching strategy was based around the many ongoing projects, each of which typically lasted many months. Any BAU work got done on the tip of whatever the last release branch was.

While this allowed the team to hack around to their heart’s content without bumping into any other projects it also meant merge problems where highly likely when the time came. Most of the file merges were trivial (i.e. automatic) but there were more than a few awkward manual ones. Luckily this was also in a time before “relentless refactoring” too so the changes tended to be more surgical in nature.

Breaking up the Monolith

Naturally any project that involved taking one large essential service apart by splitting it into smaller, more distributable components was viewed as being very risky. Even though I’d managed to help get the UAT environment into a state where parallel running meant regressions were usually picked up, the project was still held at arms length like the others. The system had a history of delays in delivering, which was unsurprising given the size of the projects, and so naturally we would be tarred with the same brush.

After spending a bit of time working out architecturally what pieces we needed and how we were going to break them out we then set about splitting the monolith up. The service really had two clearly distinct roles and some common infrastructure code which could be shared. Some of the orchestration logic that monitored outside systems could also be split out and instead of communicating in-process could just as easily spawn other processes to do the heavy lifting.

Decomposition Approach

The use of an enterprise-grade version control system which allows you to keep your changes isolated means you have the luxury of being able to take the engine to pieces, rebuild it differently and then deliver the new version. This was the essence of the project, so why not do that? As long as at the end you don’t have any pieces left over this probably appears to be the most efficient way to do it, and therefore was the method chosen by some of the team.

An alternative approach, and the one I was more familiar with, was to extract components from the monolith and push them “down” the architecture so they turn into library components. This forces you to create abstractions and decouple the internals. You can then wire them back into both the old and new processes to help verify early on that you’ve not broken anything while you also test out your new ideas. This probably appears less efficient as you will be fixing up code you know you’ll eventually delete when the project is finally delivered.

Of course the former approach is somewhat predicated on things never changing during the life of the project…

Man the Pumps!

Like all good stories we never got to the end of our project as a financial crisis hit which, through various business reorganisations, meant we had to drop what we were doing and make immediate plans to remediate the current version of the system. My protestations that the project we were currently doing would be the answer if we could just finish it, or even just pull in parts of it, were met with rejection.

So we just had to drop months of work (i.e. leave it on the branch in stasis) and look for some lower hanging fruit to solve the impending performance problems that would be caused by a potential three times increase in volumes.

When the knee jerk reaction began to subside the project remained shelved for the foreseeable future as a whole host of other requirements came flooding in. There was an increase in data volume but there were now other uncertainties around the existence of the entire system itself which meant it never got resurrected in the subsequent year, or the year after so I’m informed. This of course was still on our current custom platform and therefore no cost savings could be realised from the project work either.

Epilogue

This project was a real eye opener for me around how software is delivered on large legacy systems. Having come from a background where we delivered our systems in small increments as much out of a lack of tooling (a VCS product with no real branching support) as a need to work out what the users really wanted, it felt criminal to just waste all that effort.

In particular I had tried hard to encourage my teammates to keep the code changes in as shippable state as possible. This wasn’t out of any particular foresight I might have had about the impending economic downturn but just out of the discomfort that comes from trying to change too much in one go.

Ultimately if we had made each small refactoring on the branch next being delivered (ideally the trunk [2]) when the project was frozen we would already have been reaping the benefits from the work already done. Then, with most of the work safely delivered to production, the decision to finish it off becomes easier as the risk has largely been mitigated by that point. Even if a short hiatus was required for other concerns, picking the final work up later is still far easier or could itself even be broken down into smaller deliverable chunks.

That said, it’s easy for us developers to criticise the actions of project managers when they don’t go our way. Given the system’s history and delivery record the decision was perfectly understandable and I know I wouldn’t want to be the one making them under those conditions of high uncertainty both inside and outside the business.

Looking back it seems somewhat ridiculous to think that a team would split up and go off in different directions on different branches for many months with little real coordination and just hope that when the time comes we’d merge the changes [3], fix the niggles and release it. But that’s exactly how some larger teams did (and probably even still do) work.

 

[1] The basis of this work was written up in “Utilising More Than 4GB of Memory in a 32-bit Windows Process”.

[2] See “Branching Strategies”.

[3] One developer on one project spend most of their time just trying to resolve the bugs that kept showing up due to semantic merge problems. They had no automated tests either to reduce the feedback loop and a build & deployment to the test environment took in the order of hours.