Dealing with unplanned but urgent work

Allan Kelly from Allan Kelly Associates

YellowUrgent-2018-10-3-09-36.jpg

3) Maintenance and Evolution
To keep a product alive, we choose backlog stories that will bring value, and do them one after the other.
But… as support of the application may take a huge part the work. And when the problem is critical, there is nothing you can do but stop what you do and fix it. This can blow any estimation.
How do you deal with firefighting in a #NoProjects world?
And techniques to avoid it.
How does #NoProject and DevOps work together?

Let me take the last part of this question first. Operations has never been plagued by the project model the way development has. When does a SysAdmin ever say “The project is finished so I’m not going to restart the server” ?

DevOps (aka Continuous Delivery) and Continuous Digital are a natural fit. The team is responsible and accountable: writing the code, deploying it and supporting it there after: “You built it, you operate it” as DevOps people like to say.

Of course the team needs to contain all the skills needed to service this approach. That might mean having an individual specialist on the team or it might mean that team members have multiple skills. A Continuous team is not just a DevOps team, it is also a Business-Technology team – or #BizTech to coin a hashtag. (This week I heard such a team called a BizDevOps team. That is one portmanteau too far for me.)

Which brings us quite nicely to the first part of this question: how do you manage – and perhaps even plan for: unplanned work?

What I would like to happen when unplanned work appears is that it is written on a card and placed in the backlog. It then takes its place with all the other possible work. But… as the questioner states: this work can’t wait, it is urgent.

Unplanned but urgent simply needs to be done. Quite possibly other work, less valuable work or work which is not time critical may even be interrupted.

At this point I was about to refer readers to an old blog post about Yellow Cards. But it turns out that I never wrote that post. Despite talking about Yellow cards for years I’ve never blogged about them. I wrote about them in Xanpan but for some reason or another I never wrote the blog… so here you go…

When a team is mid-sprint and unplanned work appears the team should:

  • First ask “Can this work wait?” – If so then write it on a regular card and put it in the backlog
  • If not then ask, is this more valuable than work we are doing now? – If not then someone needs to find the source of the request and explain why is isn’t going to get done.
  • Assuming it is urgent then it gets written on a Yellow card.
  • If it is really really urgent then someone drops what they are doing and works on the yellow card immediately.
  • If it can wait a little while then the next person who finishes their current work picks up the card and does it.
  • Once the yellow card is done mark it as done as with any other card and work continues as it was before.

Accepting unplanned work into a sprint impacts the other work the team is doing. I’m not a big fan of the commitment protocol so to me it is no big deal if this work displaces something else. But if your team are expected to hold fast to hard commitments while dealing with unplanned work then you have a problem, call me, we need to talk more.

At the end of the iteration we can look at the cards and reason about them. Now we can see the work we can manage it and decide what to do about it.

I count up the yellow cards – and all the planned work cards. That allows me to calculate a ratio of planned versus unplanned work. (Sometimes teams put a retrospective points estimate on a yellow but doing a card count is often sufficient.)

This can be tracked over time – graph it, make it visible again. Now we can look at the work and the pattern of work, reason about it, maybe do some root-cause analysis. Perhaps:

  • Perhaps much of the urgent work isn’t really so urgent, perhaps the team should push back more. Maybe the team, or one of the team leaders, needs to the authority to say No.
  • Perhaps most of the unplanned work comes from a particular person. Maybe this person doesn’t realise the impact of their unplanned requests, or maybe they need to be included in the planning process, or … a million other reasons.
  • Perhaps the unplanned work is coming from the same sub-system, maybe some remedial work on that sub-system could reduce the amount of unexpected work.
  • Perhaps the unplanned work is just the nature of the business and being responsive is valuable.

Looked at this way we can think about reducing the amount of unplanned work. But also, we can plan for unplanned work.

It is likely that over time a pattern will emerge. One team I know found that 20% to 25% of their work in any sprint was unplanned. They simply planned for 20% less work. They now had the capacity to cope with unplanned work. At the least they could expectation manage stakeholders.

One team found that each sprint they were doing about 20% IT support tasks (new PCs, Word problems, etc.) so they hired a support technician.

Another team who agonised about unplanned work found that actually they only had about one unplanned card a week. Their problem was not excessive unplanned work but the fact that unplanned work tended to have a very high profile in the company.

Teams which find they have very high levels of unplanned work on a regular basis (e.g. over 50% of work for several months) may well decide to adopt a full Kanban system. Indeed, Kanban folk probably recognise my description as a very simple example of quality-of-service and policies.

I say more about Yellow Cards for unplanned but urgent in Xanpan so you might like to continue reading there.


This is the third question carried over from the #NoEstimates/#NoProjects August workshop in Zurich.


If you have any questions about Continuous Digital, Project Myopia and #NoProjects please mail them over and I’ll do my best to answer them in this blog.

Receive these posts by e-mail?

Subscribe to my newsletter & receive a free eBook “Xanpan: Team Centric Agile Software Development”

The post Dealing with unplanned but urgent work appeared first on Allan Kelly Associates.

The Nostradamus argument in software engineering research

Derek Jones from The Shape of Code

The Nostradamus argument in software engineering research goes something like: This idea was proposed in a paper by XX, some years ago.

I regularly encounter the Nostradamus argument when discussing what people in industry are doing, with one or more academics. The same argument is probably made in other fields.

The rules of academic research pretty much guarantee that somebody, at sometime, has published a paper containing an idea related to something being discussed today.

The first researcher(s) to publish an idea gets the credit for the idea, and ‘uses it up’ the idea, that is somebody else cannot subsequently publish a paper claiming that idea (it does happen, either through plagiarism or slip-ups during review).

The job of researchers is to find new ideas (well, actually these days it is to quickly find an idea that will get published; researchers are on a publication treadmill). Sometimes a paper will explicitly point out the novel idea they are claiming (usually a sign of a very poor paper; the author(s) obviously don’t feel confident that the reader will see anything of merit). Researchers also talk of gaps in the literature, i.e., some topic where little, if anything, has been published.

Before starting work in an area, researchers are supposed to read all relevant prior publications; this can be an awful lot of work and take a lot of time. In practice people tend to read the papers in the top 10, or so, journals published in the last few years; maybe looking at more journals and going further back in time if the initial search fails to return many results. I have had many conversations with researchers about a paper, or thesis, they are just completing and been told “I’m just finishing off the literature search”, i.e., they are doing the background checks after completing their research, not before (yes, sometimes rather similar work has already been published and some quick footwork is needed).

So the work of prior researchers is venerated in theory, but rarely in practice.

Technical Debt – Conscious Competence

Chris Oldwood from The OldWood Thing

Once upon a time the term Technical Debt seemed to have a very clear meaning but over the last few years that has been diluted to generally mean any crap code or process which is holding back delivery. I’m sure any scholars of Wittgenstein will be at pains to point out that “meaning is use” and therefore if everyone uses it this way who am I to argue?

For me the canonical source of information on the technical debt metaphor comes from the wiki of the person who coined the phrase in the first place – Ward Cunningham. The entry on Technical Debt there suggests to me that the choice to enter into debt is a wholly conscious one, not the unconscious acts of a less professional bunch of programmers.

By way of example I thought I’d take the opportunity to write up one of those occasions where I’ve been involved with taking debt on (in the original spirit of the term) and how we dealt with it, to show where the distinction lies.

The Bug

Soon after going live with v1.0 of a new calculation system in a large financial organisation we discovered that a number of key counterparties were missing from the daily report. The report generator was a late addition and there were various other issues around it’s development and testing which muddied the waters somewhat but suffice to say that this wasn’t delivered as cleanly as the core system was. (You might consider the more recent meaning of the term to apply here.)

More importantly what transpired was that due to various mergers in the company’s history a few counterparties had the same “unique” code in different back-end systems. This wasn’t just news to my team (we were all recent hires) but also to quite a few people in the business too. Due to only dealing with a limited set of “books” the codes were always unique to them in their context, but our new system cut right across them all.

The Root Cause

The generation of calculations was ultimately based around a Cartesian product of two counterparties, however given that most of those were pointless there was an optimisation which used another source of data to reduce that by more than an order of magnitude.

This optimisation should have been fairly simple but due to a need to initially use some existing manually managed counterparty data to ease the cutover (so regression testing should then reconcile exactly) it was somewhat more complicated than first envisaged.

Our system was designed to use the correct source of data eventually, but do a reverse lookup for the time being. It might sound simple but the lookup actually involved multiple lookups using combinations of keys that had to make assumptions about which legacy back-end system might hold the related data. The right person who could explain how we could do what we needed to do correctly also seemed elusive; there were many people with “heuristics”, but nobody who knew for sure.

In total there were ~100 counterparties out of a total of ~15,000 permutations that suffered from this problem. Unfortunately a handful of those 100 had a significant effect on the “bottom line” and therefore the usefulness of the system as a whole was in doubt at that point.

Entering Into Debt

Naturally once we unearthed this clanger we had to decide how to tackle it. After getting our heads around what this all meant and roughly where in the code the missing logic probably needed to go we had to make a decision – do we try and fix the underlying issue right away or try and put a workaround in place (assuming that’s even possible) to mitigate the problem, at least temporarily.

We were all very aware of going down the dark road of putting a tactical fix in place because we’d all seen where that can lead. We had made a concerted effort over the 12 months required to build the system to refactor relentlessly [1] and squash any bugs as soon as possible. This felt like a backwards step.

On the positive side by adopting a Design for Testability approach in most parts of the code we had extra switches on our processes [2] that allowed us to make per-counterparty requests, usually for diagnostic purposes. Hence the workaround took the form of sticking the list of missing counterparties in a simple text file, then using a command prompt FOR loop [3] to read the file and invoke the tool in “single counterparty mode”. Yes it was a little slow due the constant restarting of the process but it was easy to surgically insert into the workflow with the minimum of testing or risk.

Paying Back the Debt

With the hole plugged for now, and an easy mechanism in place for adding any other missing counterparties – update the text file – we were in a position to sort out the root problem without feeling under pressure to get the system working correctly 100%, ASAP.

As you can probably imagine the real solution wasn’t easy, not least because it was one of a few areas of SQL code that didn’t have any unit tests and was a tangled web of tables and views which had grown organically in an attempt to graft the old and the new worlds together [4].

What Did it Cost?

If we assume that the gung-ho approach would have been to just jump in and start fixing the real code, then what did we lose by not doing that? It’s possible that the final fix was simple and a little more investigation may have lead to that solution instead.

In contrast, the risk is that we end up in one of those “have we fixed it or not” scenarios where we spend an indeterminate amount of time being “real close” to getting towards “done”. The old adage about the last 10% also taking 90% of the time springs immediately to mind. Instead we were almost positive we had a simple workaround that could be deployed and get the system running correctly enough in an estimable amount of time. I believe there is a lot of value in having that degree of confidence.

What I think was critical was being able to remove the pressure on finding the right solution as this gave us time to really consider what needed to be done. Any fix done under pressure is not going to be given the attention to detail that it probably deserves. You then run the risk of making the system worse and having an even deeper hole to dig yourself out of.

The customer does not care about strategic versus tactical decisions per-se, they just want the thing to work. We cared about the solution because we knew it would be a burden in the short term as everyone had to remember about the bit “grafted on the side”. The general trust the team had built up by keeping quality at the forefront meant that the business would be more willing to trust us to reconcile the problem appropriately when the time came.

Use Language With Care

I really hope the term Technical Debt doesn’t continue to get watered down even further as it’s a powerful concept which is incredibly useful in the right hands. We already have far too many words for “alternate implementation” that are pejoratives carrying an air of unprofessionalness about them. I would like this one to remain in the hands of the professionals so they can continue to have “grown up” conversations with their customers about when it’s appropriate to consider taking shortcuts for a short term business need without them rolling their eyes, yet again.

 

[1] See “Relentless Refactoring” for more thoughts around this (unfortunately) contentious topic.

[2] “From Test Harness To Support Tool”, “Building Systems as Toolkits” and “In The Toolbox - Home-Grown Tools” all look at the non-functional side of tooling.

[3] A batch file just wouldn’t be complete without a for loop, see “Every Solution Starts With ‘FOR /F’”.

[4] This one feature seemed destined to plague us forever, see “So Many Wrongs, But No Rights” for another tale of woe.

Beware the Easy Fix

Chris Oldwood from The OldWood Thing

Whenever you get a bug report be sure you can reproduce the problem before you start and check you’ve fixed the bug when you make your change.

This advice might seem blindly obvious and you’re probably wondering who on earth would try and fix a bug without reproducing the problem first and then without testing the fix works afterwards [1]. I wondered that too but I was recently involved in a bug report that seemed so cut-and-dried I thought I might have to reconsider my own obsessive desire to stick rigidly to the process. I was of course mistaken…

The Bug

A bug showed up in a new message queue processing service that meant when the message queue broker was down for longer than a minute or so the consumer lost its connection and never reconnected in the background. In turn this meant the queue would slowly back-up – the process was still alive and kicking, it just wasn’t servicing the queue.

This bug report came by way of a production incident and an experienced colleague had triaged the problem so the ticket came into the team with some useful details attached. In the ticket the final log message from the service before it went dark told us that the dispatcher thread had shut down due to the failure to reconnect. The ticket also pointed us to the bit of code where the dispatcher thread was configured.

Looking at the service code along with a quick read of the third party library documentation made it seem pretty obvious that the recovery options configured for the dispatcher were insufficient. It was set-up with only 3 short retries and a circuit breaker for good measure. As a result of the incident some monitoring had been added to the queue so there was no reason why we couldn’t just enable infinite connection retries [2] and effectively disable the circuit breaker. Fixing the dispatcher code was a doddle as the message consumer library is well designed and has good documentation.

It almost seemed too easy…

The Shortest Path

The problem with bugs in infrastructure code like this is that they almost certainly don’t have any automated test coverage because writing them is really hard [3]. In fact testing this kind of issue can be arduous even when done manually as you need to control the middleware which might be outside your control or just something which sits in the background ticking away and therefore is almost invisible unless you wrote the original code or have had to fix it before. Throw in the fact that the bug wasn’t a showstopper and it’s easy to see how you could apply Sir Tony Hoare’s principle about code “being so simple there are no obvious bugs” and just push the change out based on the ability to compile the code and the fact that it doesn’t make matters any worse (you can show you’ve not broken the ability to connect to the queue).

Of course when the problem shows up in production again you’ll know that you never really fixed the problem and you’ll have to go around the loop once more, but do what you should have done first time around as the second outage will no doubt have made a few more people annoyed.

Another Bug

Unsurprisingly the simple code change suggested by the ticket actually had no effect at all when we came to test it, and this sudden realisation that we didn’t really understand what was going on was the impetus needed to take a step back and start again from the beginning.

Whilst performing a quick disconnection test (by bouncing the middleware) we noticed that the queue was behaving weirdly and not backing up like it said in the bug report. Another rabbit hole later [4] and we discover that the queue was not set-up to be durable, which in itself turned out to be another bug.

Eventually we find a way to reproduce the problem and in the process we learn a bit more about how the middleware and message consumer library both work. However we still don’t understand why the new dispatcher configuration does not appear to be working. Luckily the library is open source and so we can debug the issue ourselves and see what is going on under the hood.

The Real Fix

Who would have guessed that internally the message consumer library had another retry and circuit breaker policy that was used to control the (re)connection attempts to the message queue broker. Unlike the dispatcher thread error recovery policy, which was configured explicitly, the message queue connection policy was controlled by a couple of defaulted arguments on the connection configuration object constructor [5].

Sadly we couldn’t be explicit and use the “wait and retry forever” policy that was available on the dispatcher so instead we had to settle for configurating the number of connection attempts to int.MaxValue.

Problem Solved

Naturally it was far simpler to test the fix because we eventually put the effort into working out how to reproduce the problem in the first place. This can be quite significant from a status reporting perspective because it means you are less likely to be over optimistic about your progress. If you’re struggling to reproduce the problem then you’re going to struggle to prove that you’ve fixed it. If you mistakenly believe that the fix is simple and you then feel under pressure to get the testing done at the end it’s harder to convince yourself to do what needs to be done rather than settle for only potentially being right.

 

[1] This is somewhat disingenuous as there are times where this is not possible, but that’s unusual in the world of mainstream software development.

[2] Without the alert on the queue size we would need to find another way to signal when processing has dropped off. For example the circuit breaker should have triggered some other alert as connection failures are to be expected, but only for a limited time before escalation needs to occur.

[3] See “Automated Integration Testing with TIBCO” for an example of how I’ve done this in the past with a TIBCO message queue.

[4] Yes, the middleware was RabbitMQ but no pun was intended, for once.

[5] I’m not suggesting the library, which is provided for free out of kindness, is at fault. On the contrary the documentation was excellent, as was the support we received on Gitter. I need to help fix this, somehow.

CppQuiz Android App now available!

Anders Schau Knatten from C++ on a Friday

CppQuiz.org is an open source C++ quiz site ran by me, with contributions from the C++ community. If you’re unfamiliar with it, you can read more in its “About” section.

A few weeks ago, I was contacted by Sergey Vasilchenko. Without my knowledge, he’d built a prototype Android app for CppQuiz, and asked for my input. Naturally I was delighted, and since then we’ve been collaborating a bit on the project. I’ve been making some adjustments to the backend (basically adding some API functionality to cppquiz.org) and I’ve also done a bit of testing and review of the app. Most of the work here was by Sergey of course, who implemented the app.

The app works very similarly to the site itself, with the big advantage that it also works when you’re off-line. It downloads all the questions from cppquiz.org on startup, and updates its database regularly. Perfect for your daily, bad reception commute!

Currently training mode is the only available mode, but Sergey is working on a quiz mode too. He’s also planning to open source the app in the near future.

So, go download the C++ Quiz App now, and give it a spin!

If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.

Naked Element proud sponsors of Inspired Women’s Big Conversation Conference

Paul Grenyer from Paul Grenyer

Naked Element are proud to be the main sponsors of Inspired Women’s Big Conversation Conference. Women are under-represented in business, often especially in tech, which is a fact that needs to change. As well as talking and promoting change, Naked Element decided to offer practical support as a sponsor of the Big Conversation. As a local company it makes sense to get behind a regional effort designed to inspire the next generation, and as a forward-thinking company, supporting the Inspired Women Conference was an obvious choice. Naked Element are honoured to see our name alongside such an important and relevant cause.

The world view of research in software engineering

Derek Jones from The Shape of Code

For a long time I have been trying to figure out why so much research in software engineering is so obviously unconnected to the reality of software development.

As might have been guessed, the answer has been staring me in the face for some time.

Many researchers in software engineering have a modified mathematicians’ world view of research, i.e., investigate things we find interesting (the mathematicians’ view) and some years from now industry will discover our work and apply it (the modification). I have had multiple academics essentially say this to me and I had not appreciated that I need to argue against a world view (not specific points of that view). This mathematician world view also explains why my questions about evidence receive such baffled looks; and, I am regularly told that experiments cannot be done, or are meaningless, in software engineering research.

Which research field’s world view might be closest to software engineering? I would nominate drug discovery.

Claims made by researchers in drug discovery are expected to be backed up with evidence. There are problems to be solved (e.g., diseases to be cured) and researchers try out ideas by running experiments. They don’t put lots of time and effort into creating a new drug, propose this drug as cure for some disease and then wait for industry to run some experiments, to see if the claims are true. I’m a regular reader of In The Pipeline, an interesting drug discover blog that is accessible to those outside the field.

How do I argue against a world view? I have no idea; even if I did, I am not looking to start a crusade.

At least I now have a model of the situation that makes sense. Next month, I will be attending some workshops where there will be lots of researchers and I will get to try out my new insight.

How do we organise with with a parallel team?

Allan Kelly from Allan Kelly Associates

Question000009042425XSmall-2018-09-25-15-04.jpg

2) Offshoring – When Project meets #NoProject
In our department, we maintain products, … we deliver small features. In parallel, we have some offshore teams working in traditional project mode
The hard part comes when we need to integrate/merge our code. We have different cycles, schedules and objectives. How do we manage code handover code from offshore? – we feel we are inheriting a pack of legacy and tech debt that is added to our own stack.
No matter the handover contract agreement we set, … they don’t make the right choices for a long term maintenance vision.
They deliver the requested features, it works from a business point of view, but they deliver it in a way that can be difficult to maintain in production.
Of course, they do not maintain it, so they don’t have the experience of it.

This is the second question carried over from the #NoEstimates/#NoProjects workshop in Zurich last month.

While the question is phrased as “working with offshoring” I don’t see this as an offshore specific question. I think the question is about working with a second team, a team which does not hold maintenance responsibility and a team which perhaps doesn’t have the same quality standards as the primary team. I am sure that offshoring – and probably outsourcing – complicate matters because issues need to navigate additional boundaries.

One question in my mind is: if the second team impose such additional costs on the primary team are they actually making more work than they are doing? That is, every hour the primary team spending dealing with the liabilities of handover is an hour not delivering their own work. I’d want to look at that but lets assume the second team are worth it.

Something which worries me here: “No matter the handover contract agreement we set”. Are the second team listening? Are they making an effort to work with the primary team? Or are they ignoring the primary team?

If that is the case then it is a big problem because there is little the primary team can do to fix how the second team work. So a question comes into my mind: are those responsible for having the second team aware of the issues? Would they like to improve things?

If not then, as the second team and those employing them are not concerned, the problem may be unsolvable. The only solution the primary team have is to insulate themselves from the problems of the second team. In the short run that removes the pain for individuals but in the long run it will make things worse.

To my mind it does fall on the primary team to make a case to both groups that they would like to make things better and to work with the second team and management to make things better.

So how do we make things better?

The good news is there is lots that can be done here, there are people changes, process change and technical solutions. The bad news is, there is no silver bullet.

People changes: team members should visit each other.

I know travel budgets get cut but there is a clear case here that if team members could visit each other, understand each other, known each other then they will both work together better and be in a better position to improve things.

Process changes: ask the second team to do smaller pieces of work and deliver more often.

I don’t know the mechanism by which work reaches the second team but someone somewhere is asking them to do things. That person needs to change their requests. Of course, this means moving the second team away from the Project Model and towards a Continuous model.

Technical changes: there are a lot of options here but each of these options is gong to work best if people have visited one another and the process has been changed to lots of little.

So:

  • Have both teams practice continuous integration (if they are not already).
  • Reduce the number of branches, move towards trunk based development.
  • Have both teams practice automated unit testing (preferably TDD) and automated acceptance testing (ATDD).
  • Add static analysis tools to the build.
  • Do manual live code reviews, i.e. developers at one location talk to one at the other location while they do the code review.

None of these changes are unique to the scenario described in the question. They are common quality improvement practices. The only addition I’m making is on the code review, I want the second team to review the primary team’s code, not just the primary reviewing the second. I want this because I want teams to learn from each other.

Hopefully you can spot two themes to my suggestions.

Firstly, I’m treating both teams as equal. That is only fair but it makes sense too. If the onshore team makes the offshore team feel as if they are treated as second class then they will act as if they are second class.

Second, most of my suggestions are straight from the Continuous Delivery playbook: have both teams do lots of small high quality pieces of work and integrate them without delay.

Underlying here is a problem: the second team don’t do maintenance. They have little incentive to do better work and maybe unaware of the problems their style of working is causing.

Now having said all this I might be accused of ignoring the question. The question stated: the offshore team work in project mode. And I’ve just given suggestions which could be considered not project mode. Put it this way, I think changes need to happen one way or another. If the second team want to cling to projects then so be it, but they can still improve their quality as they do so.


If you have any questions about Continuous Digital, Project Myopia and #NoProjects please mail them over and I’ll do my best to answer them in this blog.

Receive these posts by e-mail?

Subscribe to my newsletter & receive a free eBook “Xanpan: Team Centric Agile Software Development”

The post How do we organise with with a parallel team? appeared first on Allan Kelly Associates.