Beware the Easy Fix

Chris Oldwood from The OldWood Thing

Whenever you get a bug report be sure you can reproduce the problem before you start and check you’ve fixed the bug when you make your change.

This advice might seem blindly obvious and you’re probably wondering who on earth would try and fix a bug without reproducing the problem first and then without testing the fix works afterwards [1]. I wondered that too but I was recently involved in a bug report that seemed so cut-and-dried I thought I might have to reconsider my own obsessive desire to stick rigidly to the process. I was of course mistaken…

The Bug

A bug showed up in a new message queue processing service that meant when the message queue broker was down for longer than a minute or so the consumer lost its connection and never reconnected in the background. In turn this meant the queue would slowly back-up – the process was still alive and kicking, it just wasn’t servicing the queue.

This bug report came by way of a production incident and an experienced colleague had triaged the problem so the ticket came into the team with some useful details attached. In the ticket the final log message from the service before it went dark told us that the dispatcher thread had shut down due to the failure to reconnect. The ticket also pointed us to the bit of code where the dispatcher thread was configured.

Looking at the service code along with a quick read of the third party library documentation made it seem pretty obvious that the recovery options configured for the dispatcher were insufficient. It was set-up with only 3 short retries and a circuit breaker for good measure. As a result of the incident some monitoring had been added to the queue so there was no reason why we couldn’t just enable infinite connection retries [2] and effectively disable the circuit breaker. Fixing the dispatcher code was a doddle as the message consumer library is well designed and has good documentation.

It almost seemed too easy…

The Shortest Path

The problem with bugs in infrastructure code like this is that they almost certainly don’t have any automated test coverage because writing them is really hard [3]. In fact testing this kind of issue can be arduous even when done manually as you need to control the middleware which might be outside your control or just something which sits in the background ticking away and therefore is almost invisible unless you wrote the original code or have had to fix it before. Throw in the fact that the bug wasn’t a showstopper and it’s easy to see how you could apply Sir Tony Hoare’s principle about code “being so simple there are no obvious bugs” and just push the change out based on the ability to compile the code and the fact that it doesn’t make matters any worse (you can show you’ve not broken the ability to connect to the queue).

Of course when the problem shows up in production again you’ll know that you never really fixed the problem and you’ll have to go around the loop once more, but do what you should have done first time around as the second outage will no doubt have made a few more people annoyed.

Another Bug

Unsurprisingly the simple code change suggested by the ticket actually had no effect at all when we came to test it, and this sudden realisation that we didn’t really understand what was going on was the impetus needed to take a step back and start again from the beginning.

Whilst performing a quick disconnection test (by bouncing the middleware) we noticed that the queue was behaving weirdly and not backing up like it said in the bug report. Another rabbit hole later [4] and we discover that the queue was not set-up to be durable, which in itself turned out to be another bug.

Eventually we find a way to reproduce the problem and in the process we learn a bit more about how the middleware and message consumer library both work. However we still don’t understand why the new dispatcher configuration does not appear to be working. Luckily the library is open source and so we can debug the issue ourselves and see what is going on under the hood.

The Real Fix

Who would have guessed that internally the message consumer library had another retry and circuit breaker policy that was used to control the (re)connection attempts to the message queue broker. Unlike the dispatcher thread error recovery policy, which was configured explicitly, the message queue connection policy was controlled by a couple of defaulted arguments on the connection configuration object constructor [5].

Sadly we couldn’t be explicit and use the “wait and retry forever” policy that was available on the dispatcher so instead we had to settle for configurating the number of connection attempts to int.MaxValue.

Problem Solved

Naturally it was far simpler to test the fix because we eventually put the effort into working out how to reproduce the problem in the first place. This can be quite significant from a status reporting perspective because it means you are less likely to be over optimistic about your progress. If you’re struggling to reproduce the problem then you’re going to struggle to prove that you’ve fixed it. If you mistakenly believe that the fix is simple and you then feel under pressure to get the testing done at the end it’s harder to convince yourself to do what needs to be done rather than settle for only potentially being right.

 

[1] This is somewhat disingenuous as there are times where this is not possible, but that’s unusual in the world of mainstream software development.

[2] Without the alert on the queue size we would need to find another way to signal when processing has dropped off. For example the circuit breaker should have triggered some other alert as connection failures are to be expected, but only for a limited time before escalation needs to occur.

[3] See “Automated Integration Testing with TIBCO” for an example of how I’ve done this in the past with a TIBCO message queue.

[4] Yes, the middleware was RabbitMQ but no pun was intended, for once.

[5] I’m not suggesting the library, which is provided for free out of kindness, is at fault. On the contrary the documentation was excellent, as was the support we received on Gitter. I need to help fix this, somehow.

Naked Element proud sponsors of Inspired Women’s Big Conversation Conference

Paul Grenyer from Paul Grenyer

Naked Element are proud to be the main sponsors of Inspired Women’s Big Conversation Conference. Women are under-represented in business, often especially in tech, which is a fact that needs to change. As well as talking and promoting change, Naked Element decided to offer practical support as a sponsor of the Big Conversation. As a local company it makes sense to get behind a regional effort designed to inspire the next generation, and as a forward-thinking company, supporting the Inspired Women Conference was an obvious choice. Naked Element are honoured to see our name alongside such an important and relevant cause.

The world view of research in software engineering

Derek Jones from The Shape of Code

For a long time I have been trying to figure out why so much research in software engineering is so obviously unconnected to the reality of software development.

As might have been guessed, the answer has been staring me in the face for some time.

Many researchers in software engineering have a modified mathematicians’ world view of research, i.e., investigate things we find interesting (the mathematicians’ view) and some years from now industry will discover our work and apply it (the modification). I have had multiple academics essentially say this to me and I had not appreciated that I need to argue against a world view (not specific points of that view). This mathematician world view also explains why my questions about evidence receive such baffled looks; and, I am regularly told that experiments cannot be done, or are meaningless, in software engineering research.

Which research field’s world view might be closest to software engineering? I would nominate drug discovery.

Claims made by researchers in drug discovery are expected to be backed up with evidence. There are problems to be solved (e.g., diseases to be cured) and researchers try out ideas by running experiments. They don’t put lots of time and effort into creating a new drug, propose this drug as cure for some disease and then wait for industry to run some experiments, to see if the claims are true. I’m a regular reader of In The Pipeline, an interesting drug discover blog that is accessible to those outside the field.

How do I argue against a world view? I have no idea; even if I did, I am not looking to start a crusade.

At least I now have a model of the situation that makes sense. Next month, I will be attending some workshops where there will be lots of researchers and I will get to try out my new insight.

How do we organise with with a parallel team?

Allan Kelly from Allan Kelly Associates

Question000009042425XSmall-2018-09-25-15-04.jpg

2) Offshoring – When Project meets #NoProject
In our department, we maintain products, … we deliver small features. In parallel, we have some offshore teams working in traditional project mode
The hard part comes when we need to integrate/merge our code. We have different cycles, schedules and objectives. How do we manage code handover code from offshore? – we feel we are inheriting a pack of legacy and tech debt that is added to our own stack.
No matter the handover contract agreement we set, … they don’t make the right choices for a long term maintenance vision.
They deliver the requested features, it works from a business point of view, but they deliver it in a way that can be difficult to maintain in production.
Of course, they do not maintain it, so they don’t have the experience of it.

This is the second question carried over from the #NoEstimates/#NoProjects workshop in Zurich last month.

While the question is phrased as “working with offshoring” I don’t see this as an offshore specific question. I think the question is about working with a second team, a team which does not hold maintenance responsibility and a team which perhaps doesn’t have the same quality standards as the primary team. I am sure that offshoring – and probably outsourcing – complicate matters because issues need to navigate additional boundaries.

One question in my mind is: if the second team impose such additional costs on the primary team are they actually making more work than they are doing? That is, every hour the primary team spending dealing with the liabilities of handover is an hour not delivering their own work. I’d want to look at that but lets assume the second team are worth it.

Something which worries me here: “No matter the handover contract agreement we set”. Are the second team listening? Are they making an effort to work with the primary team? Or are they ignoring the primary team?

If that is the case then it is a big problem because there is little the primary team can do to fix how the second team work. So a question comes into my mind: are those responsible for having the second team aware of the issues? Would they like to improve things?

If not then, as the second team and those employing them are not concerned, the problem may be unsolvable. The only solution the primary team have is to insulate themselves from the problems of the second team. In the short run that removes the pain for individuals but in the long run it will make things worse.

To my mind it does fall on the primary team to make a case to both groups that they would like to make things better and to work with the second team and management to make things better.

So how do we make things better?

The good news is there is lots that can be done here, there are people changes, process change and technical solutions. The bad news is, there is no silver bullet.

People changes: team members should visit each other.

I know travel budgets get cut but there is a clear case here that if team members could visit each other, understand each other, known each other then they will both work together better and be in a better position to improve things.

Process changes: ask the second team to do smaller pieces of work and deliver more often.

I don’t know the mechanism by which work reaches the second team but someone somewhere is asking them to do things. That person needs to change their requests. Of course, this means moving the second team away from the Project Model and towards a Continuous model.

Technical changes: there are a lot of options here but each of these options is gong to work best if people have visited one another and the process has been changed to lots of little.

So:

  • Have both teams practice continuous integration (if they are not already).
  • Reduce the number of branches, move towards trunk based development.
  • Have both teams practice automated unit testing (preferably TDD) and automated acceptance testing (ATDD).
  • Add static analysis tools to the build.
  • Do manual live code reviews, i.e. developers at one location talk to one at the other location while they do the code review.

None of these changes are unique to the scenario described in the question. They are common quality improvement practices. The only addition I’m making is on the code review, I want the second team to review the primary team’s code, not just the primary reviewing the second. I want this because I want teams to learn from each other.

Hopefully you can spot two themes to my suggestions.

Firstly, I’m treating both teams as equal. That is only fair but it makes sense too. If the onshore team makes the offshore team feel as if they are treated as second class then they will act as if they are second class.

Second, most of my suggestions are straight from the Continuous Delivery playbook: have both teams do lots of small high quality pieces of work and integrate them without delay.

Underlying here is a problem: the second team don’t do maintenance. They have little incentive to do better work and maybe unaware of the problems their style of working is causing.

Now having said all this I might be accused of ignoring the question. The question stated: the offshore team work in project mode. And I’ve just given suggestions which could be considered not project mode. Put it this way, I think changes need to happen one way or another. If the second team want to cling to projects then so be it, but they can still improve their quality as they do so.


If you have any questions about Continuous Digital, Project Myopia and #NoProjects please mail them over and I’ll do my best to answer them in this blog.

Receive these posts by e-mail?

Subscribe to my newsletter & receive a free eBook “Xanpan: Team Centric Agile Software Development”

The post How do we organise with with a parallel team? appeared first on Allan Kelly Associates.

Further Still On Natural Analogarithms – student

student from thus spake a.k.

For several months now my fellow students and I have been exploring -space, being the set of infinite dimensional vectors whose elements are the powers of the prime factors of the roots of rational numbers, which we chanced upon whilst attempting to define a rational valued logarithmic function for such numbers.
We have seen how we might define functions of roots of rationals employing the magnitude of their associated -space vectors and that the iterative computation of such functions may yield cyclical sequences, although we conspicuously failed to figure a tidy mathematical rule governing their lengths.
The magnitude is not the only operation of linear algebra that we might bring to bear upon such roots, however, and we have lately busied ourselves investigating another.

A 1948 viewpoint on developer vs. computer time

Derek Jones from The Shape of Code

For a long time now developer time has been a lot more expensive than computer time. The idea that developers should organize what they do, so as to maximize the efficiency of computer time rather than their own time, is considered to be an echo from a bygone age.

Until recently, I thought the transition from this bygone age, when computer time was considered more important than developer time, started in the late 1960s. Don’t ask me why I thought this, put it down to personal bias.

I was recently reading A Survey of Eniac Operations and Problems: 1946-1952, published in 1952, and what did I find:

“Early in 1948, R. F. Clippinger and some of his associates, in the course of coding the solution of …, were forced to adopt a different method of using the Eniac in order to fit their problem on the machine. …. The experience with this method (first discussed in reference 1), led J. von Neumann to suggest the use of a serial code for control of the Eniac. Such a code was devised and employed with the Eniac beginning in March 1948. Operation of the Eniac with this code was several times slower than either the original method of direct programming or the code for parallel operation. However, the resulting simplification of coding techniques and other advantages far outweighed this disadvantage.

In other words, in 1948, the people using one of the few computers in the world, which clocked at 100KHz, considered developer time to be more important than computer time.

How should we organize our teams?

Allan Kelly from Allan Kelly Associates

StartingPoint-2018-09-18-12-23.jpg

Q1: How should we organize our teams?
My team is owner of different trading platforms and the core services around it. But we depend heavily on other products (e.g. financial feeds, client identification, services to send orders to stock markets, etc.). And of course each of the team managing these services have other platforms that are their clients.

When Vasco Duarte and I ran the #NoEstimates/#NoProjects workshop (or #NoNoWorkshop as I think of it) in Switzerland last month the attendees asked some good questions. With Project Myopia done and published, and Continuous Digital almost done it seems like a good time to repeat, and elaborate, the answers publicly. This will take a few blog posts to work through.

(I now have several Continuous Digital workshops and briefings available, please let me know what you think. Vasco and I are looking at repeating the workshop in London later this year, please get in touch if you are interested.)

The picture above is the way I see the question, if you have another interpretation, or another scenario please let me know.

The Continuous Digital model is for stable, long standing, autonomous, value seeking teams staffed with all the skills they need. Much of my thinking derives from Amoeba Management. Importantly each team needs to see how it adds value. In this case the business facing teams can see this – they enable the business do make money. But the back office teams find it hard see how they add value.

Now there are several possible answers to this question most of which involve some sort of re-organizations.

Option 1: Share the value

This solution does not involve reorganisation and comes straight from the pages of Amoeba management: allocate some portion of the value earned by the business facing teams to the teams they depend on. For example, the Trading platforms team might generate $10m each year. It could not do this without the services of the other three teams. Therefore some portion of Trading team’s earned value is passed to those teams.

Think about it, Trading Platforms affectively buys the services of three other teams. If those teams did not exist Trading Platforms would need to do that work themselves. Therefore those teams are contributing and deserve some credit.

This requires a serious conversation and probably needs more senior managers to intervene. Indeed, in Amoeba Management, Kazuo Inamori says that such decisions were among the most difficult ones facing Kyocera and often required more senior managers to make the final decision.

Nor is it always clear who buys from who, does a Sales Amoeba earn the value and pass part of it to the manufacturing team who build the product. Or does the Manufacturing Amoeba hire the Sales Amoeba to get their product to customers and therefore book the revenue and pass some to sales?

In the case above one might find it better to consider the value of the whole trading team including both the traders and the programmers who make the platform. Or perhaps the traders rent the platform from the technologists.

According to Inamori Kyocera standing allocations are set between teams. Alternatively one might create an internal market in which teams bought services from others on a piecemeal basis. On the one hand I like that idea model because it would allow for negotiation and trade-offs. On the other hand I imagine it creating a whole new set of bureaucracy, politics and internal sales. On balance, I’d fix the allocations and review periodically.

Option 2: Vertical slice

If you look at the picture above you might replace the word “team” with “library” or “services” and you would have a module dependency chart. Conway’s Law is at work – the organization and system reflect each other. (Although without knowing the history here it is difficult to say whether this was Conway’s Law or Reverse Conway’s Law at work.)

The services can stay as they are but we just disband the back-office teams and pass their responsibilities to the (enlarged) business teams.

Vertical-2018-09-18-12-23.jpg

The three teams will need to co-operate and co-ordinate with each other as they now have shared responsibilities. This itself can be a problem – two developers changing the same code anyone? But the world has moved on. Technology has improved.

In the days of SCCS, Visual Source-Unsafe, manual testing and monthly deployments it was a pain to have two teams working on the same code. But distributed source code control, automated testing and continuous delivery make this option far more viable than it once was.

On the plus side each team can work at their own pace on their own priorities and knowledge is spread around. On the downside teams can still trip up each other, they may duplicate work and specialist knowledge can get lost. (Note I am not saying “nobody has overall design authority” is a downside because while a single Linus can be an advantage it can also be a liability.)

One more problem here: this solution directly breaks Conway’s Law. In theory it could work but quite possibly the homomorphic force behind Conway’s Law might reassert itself. This might create some problems further down the line so needs monitoring.

Option 3: Independence

Taking option 2 to the extreme you might even separate the teams completely. Again there are plus and minuses.

Indepence-2018-09-18-12-23.jpg

On the one hand the teams are completely independent, they can move at their own pace, with their own priorities, value is clearly attributed and there is now resilience in the system and risk is reduced.

However, there is duplication. Not only does this mean more work it means that there may be inconsistencies, a client recognised by Trading might not be recognised by Yet Another.

Both options 2 and 3 demand larger teams and this option might requires more people overall. One can’t be sure because teams might come up with innovative solutions or come up with some new mechanism for sharing.

I’m sure some readers will discount this option very quickly but there are big benefits to complete independence – particularly when teams are separated geographically (e.g. Trading in London, Some Other in Frankfurt and Yet Another in Singapore) or when they are addressing different markets. One of the dangers of shared modules is that they become bloated by generic features nobody really wants but someone has to pay for.

This approach might also be advantageous when the company is in a growth and innovation mode. Let each team grow as fast as they can and innovate. In time a “winner” might emerge or common elements appear naturally.

Another variation on option 3 would be to have one team take the lead. Say Trading, this would be a larger team who developed the share services as part of their business facing work. But they would not “genericise” those services. The other, smaller teams, would do what they needed, when they needed, to service their own value streams.


That is three options. I could come up with some more, none is perfect. The important things are:

  • Create a clear way for teams to see the effects of their work and share in the value.
  • Allow teams autonomy in decision making and reduce dependencies.
  • Keep it simple so everyone can see cause and effect.
  • And of course, keep the teams stable – don’t break them up.

If you have any questions about Continuous Digital and #NoProjects please mail them over and I’ll do my best to answer them in this blog.

Receive these posts by e-mail?

Subscribe to my newsletter & receive a free eBook “Xanpan: Team Centric Agile Software Development”

The post How should we organize our teams? appeared first on Allan Kelly Associates.