Beware the Easy Fix

Chris Oldwood from The OldWood Thing

Whenever you get a bug report be sure you can reproduce the problem before you start and check you’ve fixed the bug when you make your change.

This advice might seem blindly obvious and you’re probably wondering who on earth would try and fix a bug without reproducing the problem first and then without testing the fix works afterwards [1]. I wondered that too but I was recently involved in a bug report that seemed so cut-and-dried I thought I might have to reconsider my own obsessive desire to stick rigidly to the process. I was of course mistaken…

The Bug

A bug showed up in a new message queue processing service that meant when the message queue broker was down for longer than a minute or so the consumer lost its connection and never reconnected in the background. In turn this meant the queue would slowly back-up – the process was still alive and kicking, it just wasn’t servicing the queue.

This bug report came by way of a production incident and an experienced colleague had triaged the problem so the ticket came into the team with some useful details attached. In the ticket the final log message from the service before it went dark told us that the dispatcher thread had shut down due to the failure to reconnect. The ticket also pointed us to the bit of code where the dispatcher thread was configured.

Looking at the service code along with a quick read of the third party library documentation made it seem pretty obvious that the recovery options configured for the dispatcher were insufficient. It was set-up with only 3 short retries and a circuit breaker for good measure. As a result of the incident some monitoring had been added to the queue so there was no reason why we couldn’t just enable infinite connection retries [2] and effectively disable the circuit breaker. Fixing the dispatcher code was a doddle as the message consumer library is well designed and has good documentation.

It almost seemed too easy…

The Shortest Path

The problem with bugs in infrastructure code like this is that they almost certainly don’t have any automated test coverage because writing them is really hard [3]. In fact testing this kind of issue can be arduous even when done manually as you need to control the middleware which might be outside your control or just something which sits in the background ticking away and therefore is almost invisible unless you wrote the original code or have had to fix it before. Throw in the fact that the bug wasn’t a showstopper and it’s easy to see how you could apply Sir Tony Hoare’s principle about code “being so simple there are no obvious bugs” and just push the change out based on the ability to compile the code and the fact that it doesn’t make matters any worse (you can show you’ve not broken the ability to connect to the queue).

Of course when the problem shows up in production again you’ll know that you never really fixed the problem and you’ll have to go around the loop once more, but do what you should have done first time around as the second outage will no doubt have made a few more people annoyed.

Another Bug

Unsurprisingly the simple code change suggested by the ticket actually had no effect at all when we came to test it, and this sudden realisation that we didn’t really understand what was going on was the impetus needed to take a step back and start again from the beginning.

Whilst performing a quick disconnection test (by bouncing the middleware) we noticed that the queue was behaving weirdly and not backing up like it said in the bug report. Another rabbit hole later [4] and we discover that the queue was not set-up to be durable, which in itself turned out to be another bug.

Eventually we find a way to reproduce the problem and in the process we learn a bit more about how the middleware and message consumer library both work. However we still don’t understand why the new dispatcher configuration does not appear to be working. Luckily the library is open source and so we can debug the issue ourselves and see what is going on under the hood.

The Real Fix

Who would have guessed that internally the message consumer library had another retry and circuit breaker policy that was used to control the (re)connection attempts to the message queue broker. Unlike the dispatcher thread error recovery policy, which was configured explicitly, the message queue connection policy was controlled by a couple of defaulted arguments on the connection configuration object constructor [5].

Sadly we couldn’t be explicit and use the “wait and retry forever” policy that was available on the dispatcher so instead we had to settle for configurating the number of connection attempts to int.MaxValue.

Problem Solved

Naturally it was far simpler to test the fix because we eventually put the effort into working out how to reproduce the problem in the first place. This can be quite significant from a status reporting perspective because it means you are less likely to be over optimistic about your progress. If you’re struggling to reproduce the problem then you’re going to struggle to prove that you’ve fixed it. If you mistakenly believe that the fix is simple and you then feel under pressure to get the testing done at the end it’s harder to convince yourself to do what needs to be done rather than settle for only potentially being right.

 

[1] This is somewhat disingenuous as there are times where this is not possible, but that’s unusual in the world of mainstream software development.

[2] Without the alert on the queue size we would need to find another way to signal when processing has dropped off. For example the circuit breaker should have triggered some other alert as connection failures are to be expected, but only for a limited time before escalation needs to occur.

[3] See “Automated Integration Testing with TIBCO” for an example of how I’ve done this in the past with a TIBCO message queue.

[4] Yes, the middleware was RabbitMQ but no pun was intended, for once.

[5] I’m not suggesting the library, which is provided for free out of kindness, is at fault. On the contrary the documentation was excellent, as was the support we received on Gitter. I need to help fix this, somehow.

CppQuiz Android App now available!

Anders Schau Knatten from C++ on a Friday

CppQuiz.org is an open source C++ quiz site ran by me, with contributions from the C++ community. If you’re unfamiliar with it, you can read more in its “About” section.

A few weeks ago, I was contacted by Sergey Vasilchenko. Without my knowledge, he’d built a prototype Android app for CppQuiz, and asked for my input. Naturally I was delighted, and since then we’ve been collaborating a bit on the project. I’ve been making some adjustments to the backend (basically adding some API functionality to cppquiz.org) and I’ve also done a bit of testing and review of the app. Most of the work here was by Sergey of course, who implemented the app.

The app works very similarly to the site itself, with the big advantage that it also works when you’re off-line. It downloads all the questions from cppquiz.org on startup, and updates its database regularly. Perfect for your daily, bad reception commute!

Currently training mode is the only available mode, but Sergey is working on a quiz mode too. He’s also planning to open source the app in the near future.

So, go download the C++ Quiz App now, and give it a spin!

If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.

Naked Element proud sponsors of Inspired Women’s Big Conversation Conference

Paul Grenyer from Paul Grenyer

Naked Element are proud to be the main sponsors of Inspired Women’s Big Conversation Conference. Women are under-represented in business, often especially in tech, which is a fact that needs to change. As well as talking and promoting change, Naked Element decided to offer practical support as a sponsor of the Big Conversation. As a local company it makes sense to get behind a regional effort designed to inspire the next generation, and as a forward-thinking company, supporting the Inspired Women Conference was an obvious choice. Naked Element are honoured to see our name alongside such an important and relevant cause.

The world view of research in software engineering

Derek Jones from The Shape of Code

For a long time I have been trying to figure out why so much research in software engineering is so obviously unconnected to the reality of software development.

As might have been guessed, the answer has been staring me in the face for some time.

Many researchers in software engineering have a modified mathematicians’ world view of research, i.e., investigate things we find interesting (the mathematicians’ view) and some years from now industry will discover our work and apply it (the modification). I have had multiple academics essentially say this to me and I had not appreciated that I need to argue against a world view (not specific points of that view). This mathematician world view also explains why my questions about evidence receive such baffled looks; and, I am regularly told that experiments cannot be done, or are meaningless, in software engineering research.

Which research field’s world view might be closest to software engineering? I would nominate drug discovery.

Claims made by researchers in drug discovery are expected to be backed up with evidence. There are problems to be solved (e.g., diseases to be cured) and researchers try out ideas by running experiments. They don’t put lots of time and effort into creating a new drug, propose this drug as cure for some disease and then wait for industry to run some experiments, to see if the claims are true. I’m a regular reader of In The Pipeline, an interesting drug discover blog that is accessible to those outside the field.

How do I argue against a world view? I have no idea; even if I did, I am not looking to start a crusade.

At least I now have a model of the situation that makes sense. Next month, I will be attending some workshops where there will be lots of researchers and I will get to try out my new insight.

How do we organise with with a parallel team?

Allan Kelly from Allan Kelly Associates

Question000009042425XSmall-2018-09-25-15-04.jpg

2) Offshoring – When Project meets #NoProject
In our department, we maintain products, … we deliver small features. In parallel, we have some offshore teams working in traditional project mode
The hard part comes when we need to integrate/merge our code. We have different cycles, schedules and objectives. How do we manage code handover code from offshore? – we feel we are inheriting a pack of legacy and tech debt that is added to our own stack.
No matter the handover contract agreement we set, … they don’t make the right choices for a long term maintenance vision.
They deliver the requested features, it works from a business point of view, but they deliver it in a way that can be difficult to maintain in production.
Of course, they do not maintain it, so they don’t have the experience of it.

This is the second question carried over from the #NoEstimates/#NoProjects workshop in Zurich last month.

While the question is phrased as “working with offshoring” I don’t see this as an offshore specific question. I think the question is about working with a second team, a team which does not hold maintenance responsibility and a team which perhaps doesn’t have the same quality standards as the primary team. I am sure that offshoring – and probably outsourcing – complicate matters because issues need to navigate additional boundaries.

One question in my mind is: if the second team impose such additional costs on the primary team are they actually making more work than they are doing? That is, every hour the primary team spending dealing with the liabilities of handover is an hour not delivering their own work. I’d want to look at that but lets assume the second team are worth it.

Something which worries me here: “No matter the handover contract agreement we set”. Are the second team listening? Are they making an effort to work with the primary team? Or are they ignoring the primary team?

If that is the case then it is a big problem because there is little the primary team can do to fix how the second team work. So a question comes into my mind: are those responsible for having the second team aware of the issues? Would they like to improve things?

If not then, as the second team and those employing them are not concerned, the problem may be unsolvable. The only solution the primary team have is to insulate themselves from the problems of the second team. In the short run that removes the pain for individuals but in the long run it will make things worse.

To my mind it does fall on the primary team to make a case to both groups that they would like to make things better and to work with the second team and management to make things better.

So how do we make things better?

The good news is there is lots that can be done here, there are people changes, process change and technical solutions. The bad news is, there is no silver bullet.

People changes: team members should visit each other.

I know travel budgets get cut but there is a clear case here that if team members could visit each other, understand each other, known each other then they will both work together better and be in a better position to improve things.

Process changes: ask the second team to do smaller pieces of work and deliver more often.

I don’t know the mechanism by which work reaches the second team but someone somewhere is asking them to do things. That person needs to change their requests. Of course, this means moving the second team away from the Project Model and towards a Continuous model.

Technical changes: there are a lot of options here but each of these options is gong to work best if people have visited one another and the process has been changed to lots of little.

So:

  • Have both teams practice continuous integration (if they are not already).
  • Reduce the number of branches, move towards trunk based development.
  • Have both teams practice automated unit testing (preferably TDD) and automated acceptance testing (ATDD).
  • Add static analysis tools to the build.
  • Do manual live code reviews, i.e. developers at one location talk to one at the other location while they do the code review.

None of these changes are unique to the scenario described in the question. They are common quality improvement practices. The only addition I’m making is on the code review, I want the second team to review the primary team’s code, not just the primary reviewing the second. I want this because I want teams to learn from each other.

Hopefully you can spot two themes to my suggestions.

Firstly, I’m treating both teams as equal. That is only fair but it makes sense too. If the onshore team makes the offshore team feel as if they are treated as second class then they will act as if they are second class.

Second, most of my suggestions are straight from the Continuous Delivery playbook: have both teams do lots of small high quality pieces of work and integrate them without delay.

Underlying here is a problem: the second team don’t do maintenance. They have little incentive to do better work and maybe unaware of the problems their style of working is causing.

Now having said all this I might be accused of ignoring the question. The question stated: the offshore team work in project mode. And I’ve just given suggestions which could be considered not project mode. Put it this way, I think changes need to happen one way or another. If the second team want to cling to projects then so be it, but they can still improve their quality as they do so.


If you have any questions about Continuous Digital, Project Myopia and #NoProjects please mail them over and I’ll do my best to answer them in this blog.

Receive these posts by e-mail?

Subscribe to my newsletter & receive a free eBook “Xanpan: Team Centric Agile Software Development”

The post How do we organise with with a parallel team? appeared first on Allan Kelly Associates.

Further Still On Natural Analogarithms – student

student from thus spake a.k.

For several months now my fellow students and I have been exploring -space, being the set of infinite dimensional vectors whose elements are the powers of the prime factors of the roots of rational numbers, which we chanced upon whilst attempting to define a rational valued logarithmic function for such numbers.
We have seen how we might define functions of roots of rationals employing the magnitude of their associated -space vectors and that the iterative computation of such functions may yield cyclical sequences, although we conspicuously failed to figure a tidy mathematical rule governing their lengths.
The magnitude is not the only operation of linear algebra that we might bring to bear upon such roots, however, and we have lately busied ourselves investigating another.

New design, for improved readability!

Anders Schau Knatten from C++ on a Friday

Hi, and welcome to to a new and much cleaner “C++ on a Friday”! The old theme had too small fonts and had a bit too much “design” going on, making it a bit distracting to read.

The new theme has larger fonts, good responsivity, a nice colour scheme, and is overall much more pleasant to read. Enjoy the improved reading experience! :)

 

A 1948 viewpoint on developer vs. computer time

Derek Jones from The Shape of Code

For a long time now developer time has been a lot more expensive than computer time. The idea that developers should organize what they do, so as to maximize the efficiency of computer time rather than their own time, is considered to be an echo from a bygone age.

Until recently, I thought the transition from this bygone age, when computer time was considered more important than developer time, started in the late 1960s. Don’t ask me why I thought this, put it down to personal bias.

I was recently reading A Survey of Eniac Operations and Problems: 1946-1952, published in 1952, and what did I find:

“Early in 1948, R. F. Clippinger and some of his associates, in the course of coding the solution of …, were forced to adopt a different method of using the Eniac in order to fit their problem on the machine. …. The experience with this method (first discussed in reference 1), led J. von Neumann to suggest the use of a serial code for control of the Eniac. Such a code was devised and employed with the Eniac beginning in March 1948. Operation of the Eniac with this code was several times slower than either the original method of direct programming or the code for parallel operation. However, the resulting simplification of coding techniques and other advantages far outweighed this disadvantage.

In other words, in 1948, the people using one of the few computers in the world, which clocked at 100KHz, considered developer time to be more important than computer time.