Learning useful stuff from the Reliability chapter of my book

Derek Jones from The Shape of Code

What useful, practical things might professional software developers learn from my evidence-based software engineering book?

Once the book is officially released I need to have good answers to this question (saying: “Well, I decided to collect all the publicly available software engineering data and say something about it”, is not going to motivate people to read the book).

This week I checked the reliability chapter; what useful things did I learn (combined with everything I learned during all the other weeks spent working on this chapter)?

A casual reader skimming the chapter would conclude that little was known about software reliability, and they would be right (I already knew this, but I learned that we know even less than I thought was known), and many researchers continue to dig in unproductive holes.

A reader with some familiarity with reliability research would be surprised to see that some ‘major’ topics are not discussed.

The train wreck that is machine learning has been avoided (not forgetting that the data used is mostly worthless), mutation testing gets mentioned because of some interesting data (the underlying problem is that mutation testing assumes that coding mistakes are local to one line, but in practice coding mistakes often involve multiple lines), and the theory discussions don’t mention non-homogeneous Poisson process as the basis for software fault models (because this process is not capable of solving the questions asked).

What did I learn? My highlights include:

  • Anne Choa‘s work on population estimation. The takeaway from this work is that if people want to estimate the number of remaining fault experiences, based on previous experienced faults, then every occurrence (i.e., not just the first) of a fault needs to be counted,
  • Janet Dunham’s top read work on software testing,
  • the variability in the numeric percentage that people assign to probability terms (e.g., almost all, likely, unlikely) is much wider than I would have thought,
  • the impact of the distribution of input values on fault experiences may be detectable,
  • really a lowlight, but there is a lot less publicly available data than I had expected (for the other chapters there was more data than I had expected).

The last decade has seen fuzzing grow to dominate the headlines around software reliability and testing, and provide data for people who write evidence-based books. I don’t have much of a feel for how widely used it is in industry, but it is a very useful tool for reliability researchers.

Readers might have a completely different learning experience from reading the reliability chapter. What useful things did you learn from the reliability chapter?

Weekend Maintenance as Chaos Engineering

Chris Oldwood from The OldWood Thing

I was working on a new system – a grid based calculation engine for an investment bank – and I was beginning to read about some crazy ideas by Netflix around how they would kill off actual production servers to test their resilience to failure. I really liked this idea as it had that “put your money where your mouth is” feel to it and I felt we were designing a system that should cope with this kind of failure, and if it didn’t, then we had learned something and needed to fix it.

Failure is Expected

We had already had a few minor incidents during its early operation caused by dodgy data flowing down from upstream systems and had tackled that by temporarily remediating the data to get the system working but then immediately fixed the code so that the same kind of problem would not cause an issue in future. The project manager, who had also worked on a sister legacy system to one I’d worked on before, had made it clear from the start that he didn’t want another “support nightmare” like we’d both seen before [1] and pushed the “self-healing” angle which was a joy to hear. Consequently reliability was always foremost in our minds.

Once the system went live and the business began to rely on it the idea of randomly killing off services and servers in production was a hard prospect to sell. While the project manager had fought to help us get a UAT environment that almost brought us parity with production and was okay with us using that for testing the system’s reliability he was less happy about going to whole hog and adopting the Netflix approach. (The organisation was already very reserved and despite our impeccable record some other teams had some nasty failures that caused the organisation to become more risk adverse rather than address then root problems.)

Planned Disruption is Good!

Some months after we had gone live I drew the short straw and was involved with a large-scale DR test. We were already running active/active by making use of the DR facilities during the day and rotated the database cluster nodes every weekend [2] to avoid a node getting stale, hence we had a high degree of confidence that we would cope admirably with the test. Unfortunately there was a problem with one of the bank’s main trade systems such that it wouldn’t start after failover to DR that we never really got to do a full test and show that it was a no-brainer for us.

While the day was largely wasted for me as I sat around waiting for our turn it did give me time to think a bit more about how we would show that the system was working correctly and also when the DR test was finished and failed back over again that it had recovered properly. At that point I realised we didn’t need to implement any form of Chaos Engineering ourselves as the Infrastructure team were already providing it, every weekend!

It’s common for large enterprises to only perform emergency maintenance during the week and then make much more disruptive changes at the weekend, e.g. tearing parts of the network up, patching and rebooting servers, etc. At that time it was common for support teams to shut systems down and carefully bring them back up after the maintenance window to ensure they were operating correctly when the eastern markets opened late Sunday evening [3]. This was the perfect opportunity to do the complete opposite – drive the system hard over the weekend and see what state it was after the maintenance had finished – if it wasn’t still operating normally we’d missed some failure modes.

An Aria of Canaries

We were already pushing through a simple canary request every few minutes which allowed us to spot when things had unusually gone south but we wanted something heavier that might drive out subtler problems so we started pushing through heavy loads during the weekend too and then looked at what state they were in at the end of the weekend. These loads always had a lower priority than any real work so we could happily leave them to finish in the background rather than need to kill them off before the working week started. (This is a nice example of using the existing features of the system to avoid it disrupting the normal workload.)

This proved to be a fruitful idea as it unearthed a couple of places where the system wasn’t quite as reliable as we’d thought. For example we were leaking temporary files when the network was glitching and the calculation was restarted. Also the load pushed the app servers over the edge memory-wise and highlighted a bug in the nanny process when the machine was short of memory. There was also a bug in some exponential back-off code that backed off a little too far as it never expected an outage to last most of the weekend :o).

Order From Chaos

When they finally scheduled a repeat DR test some months later after supposedly ironing out the wrinkles with their key trade capture systems our test was a doddle as it just carried on after being brought back to life in the DR environment and similarly after reverting back to PROD it just picked up where it had left off and retried those jobs that had failed when the switchover started. Rather than shying away from the weekend disruption we had used it to our advantage to help improve its reliability.

 

[1] Eventually the team spends so much time fire-fighting there is no time left to actually fix the system and it turns into an endless soul-destroying job.

[2] Rotating the database cluster primary causes the database to work with an empty cache which is a great way to discover how much your common queries rely on heavily cached data. In one instance a 45-second reporting query took over 15 minutes when faced with no cached pages!

[3] See Arbitrary Cache Timeouts for an example where constant rebooting masked a bug.

Validate in Production

Chris Oldwood from The OldWood Thing

The change was reasonably simple: we had to denormalise some postcode data which was currently held in a centralised relational database into some new fields in every client’s database to remove some cross-database joins that would be unsupported on the new SQL platform we were migrating too [1].

As you might imagine the database schema changes were fairly simple – we just needed to add the new columns as nullable strings into every database. The next step was to update the service code to start populating these new fields as addresses were added or edited by using data from the centralised postcode database [2].

At this point any new data or data that changed going forward would have the correctly denormalised state. However we still needed to fix up any existing data and that’s the focus of this post.

Migration Plan

To fix-up all the existing client data we needed to write a tool which would load each client’s address data that was missing its new postcode data, look it up against the centralised list, and then write back any changes. Given we were still using the cross-database joins in live for the time being to satisfy the existing reports we could roll this out in the background and avoiding putting any unnecessary load on the database cluster.

The tool wasn’t throw-away because the postcode dataset gets updated regularly and so the denormalised client data needs to be refreshed whenever the master list is updated. (This would not be that often but enough to make it worth spending a little extra time writing a reusable tool for the job for ops to run.)

Clearly this isn’t rocket science, it just requires loading the centralised data into a map, fetching the client’s addresses, looking them up, and writing back the relevant fields. The tool only took a few hours to write and test and so it was ready to run for the next release during a quiet period.

When that moment arrived the tool was run across the hundreds of client databases and plenty of data was fixed up in the process, so the task appeared completed.

Next Steps

With all the existing postcode data now correctly populated too we should have been in a position to switch the report generation feature toggle on so that it used the new denormalised data instead of doing a cross-database join to the existing centralised store.

While the team were generally confident in the changes to date I suggested we should just do a sanity check and make sure that everything was working as intended as I felt this was a reasonably simple check to run.

An initial SQL query someone knocked up just checked how many of the new fields had been populated and the numbers seemed about right, i.e. very high (we’d expect some addresses to be missing data due to missing postcodes, typos and stale postcode data). However I still felt that we should be able to get a definitive answer with very little effort by leveraging the existing we SQL we were about to discard, i.e. use the cross-database join one last time to verify the data population more precisely.

Close, but No Cigar

I massaged the existing report query to show where data from the dynamic join was different to that in the new columns that had been added (again, not rocket science). To our surprise there were quite a significant number of discrepancies.

Fortunately it didn’t take long to work out that those addresses which were missing postcode data all had postcodes which were at least partially written in lowercase whereas the ones that had worked were entirely written in uppercase.

Hence the bug was fairly simple to track down. The tool loaded the postcode data into a dictionary (map) keyed on the string postcode and did a straight lookup which is case-sensitive by default. A quick change to use a case-insensitive comparison and the tool was fixed. The data was corrected soon after and the migration verified.

Why didn’t this show up in the initial testing? Well, it turned out the tools used to generate the test data sets and also to anonymize real client databases were somewhat simplistic and this helped to provide a false level of confidence in the new tool.

Testing in Production

Whenever we make a change to our system it’s important that we verify we’ve delivered what we intended. Oftentimes the addition of a feature has some impact on the front-end and the customer and therefore it’s fairly easy to see if it’s working or not. (The customer usually has something to say about it.)

However back-end changes can be harder to verify thoroughly, but it’s still important that we do the best we can to ensure they have the expected effect. In this instance we could easily check every migrated address within a reasonable time frame and know for sure, but on large data sets this might unfeasible so you might have to settle for less. Also the use of feature switches and incremental delivery meant that even though there was a bug it did not affect the customers and we were always making forward progress.

Testing does not end with a successful run of the build pipeline or a sign-off from a QA team – it also needs to work in real life too. Ideally the work we put in up-front will make that more likely but for some classes of change, most notably where actual customer data is involved, we need to follow through and ensure that practice and theory tie up.

 

[1] Storage limitations and other factors precluded simply moving the entire postcode database into each customer DB before moving platforms. The cost was worth it to de-risk the overall migration.

[2] There was no problem with the web service having two connections to two different databases, we just needed to stop writing SQL queries that did cross-database joins.

Changes in the shape of code during the twenties?

Derek Jones from The Shape of Code

At the end of 2009 I made two predictions for the next decade; Chinese and Indian developers having a major impact on the shape of code (ok, still waiting for this to happen), and scripting languages playing a significant role (got that one right, but then they were already playing a large role).

Since this blog has just entered its second decade, I will bring the next decade’s predictions forward a year.

I don’t see any new major customer ecosystems appearing. Ecosystems are the drivers of software development, and no new ecosystems has several consequences, including:

  • No major new languages: Creating a language is a vanity endeavor. Vanity project can take off if they are in the right place at the right time. New ecosystems provide opportunities for new languages to become widely used by being in at the start and growing with the ecosystem. There is another opportunity locus; it is fashionable for companies that see themselves as thought-leaders to have their own language, e.g., Google, Apple, and Mozilla. Invent your language at the right time, while working for a thought-leader company and your language could become well-known enough to take-off.

    I don’t see any major new ecosystems appearing and all the likely companies already have their own language.

    Any new language also faces the problem of not having a large collection packages.

  • Software will be more thoroughly tested: When an ecosystem is new, the incentives drive early and frequent releases (to build a customer base); software just has to be good enough. Once a product is established, companies can invest in addressing issues that customers find annoying, like faulty behavior; the incentive change results in more testing.

    There are other forces at work around testing. Companies are experiencing some very expensive faults (testing may be expensive, but not testing may be more expensive) and automatic test generation is becoming commercially usable (i.e., the cost of some kinds of testing is decreasing).

The evolution of widely used languages.

  • I think Fortran and C will have new features added, with relatively little fuss, and will quietly continue to be widely used (to the dismay of the fashionista).
  • There is a strong expectation that C++ and Java should continue to evolve:

    • I expect the ISO C++ work to implode, because there are too many people pulling in too many directions. It makes sense for the gcc and llvm teams to cooperate in taking C++ in a direction that satisfies developers’ needs, rather than the needs of bored consultants. What are Microsoft’s views? They only have their own compiler for strategic reasons (they make little if any profit selling compilers, compilers are an unnecessary drain on management time; who cares what happens to the language).
    • It is going to be interesting watching the impact of Oracle’s move to charging for runtimes. I have no idea what might happen to Java.

In terms of code volume, the future surely has to be scripting languages, and in particular Python, Javascript and PHP. Ten years from now, will there be a widely used, single language? People have been predicting, for many years, that web languages will take over the world; perhaps there will be a sudden switch and I will see that the choice is obvious.

Moore’s law is now dead, which means researchers are going to have to look for completely new techniques for building logic gates. If photonic computers happen, then ternary notation may reappear again (it was used in at least one early Russian computer); I’m not holding my breath for this to occur.

Distorting the input profile, to stress test a program

Derek Jones from The Shape of Code

A fault is experienced in software when there is a mistake in the code, and a program is fed the input values needed for this mistake to generate faulty behavior.

There is suggestive evidence that the distribution of coding mistakes and inputs generating fault experiences both have an influence of fault discovery.

How might these coding mistakes be found?

Testing is one technique, it involves feeding inputs into a program and checking the resulting behavior. What are ‘good’ input values, i.e., values most likely to discover problems? There is no shortage of advice for manually writing tests, suggesting how to select input values, but automatic generation of inputs is often somewhat random (relying on quantity over quality).

Probabilistic grammar driven test generators are trivial to implement. The hard part is tuning the rules and the probability of them being applied.

In most situations an important design aim, when creating a grammar, is to have one rule for each construct, e.g., all arithmetic, logical and boolean expressions are handled by a single expression rule. When generating tests, it does not always make sense to follow this rule; for instance, logical and boolean expressions are much more common in conditional expressions (e.g., controlling an if-statement), than other contexts (e.g., assignment). If the intent is to mimic typical user input values, then the probability of generating a particular kind of binary operator needs to be context dependent; this might be done by having context dependent rules or by switching the selection probabilities by context.

Given a grammar for a program’s input (e.g., the language grammar used by a compiler), decisions have to be made about the probability of each rule triggering. One way of obtaining realistic values is to parse existing input, counting the number of times each rule triggers. Manually instrumenting a grammar to do this is a tedious process, but tool support is now available.

Once a grammar has been instrumented with probabilities, it can be used to generate tests.

Probabilities based on existing input will have the characteristics of that input. A recent paper on this topic (which prompted this post) suggests inverting rule probabilities, so that common becomes rare and vice versa; the idea is that this will maximise the likelihood of a fault being experienced (the assumption is that rarely occurring input will exercise rarely executed code, and such code is more likely to contain mistakes than frequently executed code).

I would go along with the assumption about rarely executed code having a greater probability of containing a mistake, but I don’t think this is the best test generation strategy.

Companies are only interested in fixing the coding mistakes that are likely to result of a fault being experienced by a customer. It is a waste of resources to fix a mistake that will never result in a fault experienced by a customer.

What input is likely to interact with coding mistakes to be the root cause of faults experienced by a customer? I have no good answer to this question. But, given there are customer input contains patterns (at least in the world of source code, and I’m told in other application domains), I would generate test cases that are very similar to existing input, but with one sub-characteristic changed.

In the academic world the incentive is to publish papers reporting loads-of-faults-found, the more the merrier. Papers reporting only a few faults are obviously using inferior techniques. I understand this incentive, but fixing problems costs money and companies want a customer oriented rationale before they will invest in fixing problems before they are reported.

The availability of tools that automate the profiling of a program’s existing input, followed by the generation of input having slightly, or very, different characteristics make it easier to answer some very tough questions about program behavior.

Beware the Easy Fix

Chris Oldwood from The OldWood Thing

Whenever you get a bug report be sure you can reproduce the problem before you start and check you’ve fixed the bug when you make your change.

This advice might seem blindly obvious and you’re probably wondering who on earth would try and fix a bug without reproducing the problem first and then without testing the fix works afterwards [1]. I wondered that too but I was recently involved in a bug report that seemed so cut-and-dried I thought I might have to reconsider my own obsessive desire to stick rigidly to the process. I was of course mistaken…

The Bug

A bug showed up in a new message queue processing service that meant when the message queue broker was down for longer than a minute or so the consumer lost its connection and never reconnected in the background. In turn this meant the queue would slowly back-up – the process was still alive and kicking, it just wasn’t servicing the queue.

This bug report came by way of a production incident and an experienced colleague had triaged the problem so the ticket came into the team with some useful details attached. In the ticket the final log message from the service before it went dark told us that the dispatcher thread had shut down due to the failure to reconnect. The ticket also pointed us to the bit of code where the dispatcher thread was configured.

Looking at the service code along with a quick read of the third party library documentation made it seem pretty obvious that the recovery options configured for the dispatcher were insufficient. It was set-up with only 3 short retries and a circuit breaker for good measure. As a result of the incident some monitoring had been added to the queue so there was no reason why we couldn’t just enable infinite connection retries [2] and effectively disable the circuit breaker. Fixing the dispatcher code was a doddle as the message consumer library is well designed and has good documentation.

It almost seemed too easy…

The Shortest Path

The problem with bugs in infrastructure code like this is that they almost certainly don’t have any automated test coverage because writing them is really hard [3]. In fact testing this kind of issue can be arduous even when done manually as you need to control the middleware which might be outside your control or just something which sits in the background ticking away and therefore is almost invisible unless you wrote the original code or have had to fix it before. Throw in the fact that the bug wasn’t a showstopper and it’s easy to see how you could apply Sir Tony Hoare’s principle about code “being so simple there are no obvious bugs” and just push the change out based on the ability to compile the code and the fact that it doesn’t make matters any worse (you can show you’ve not broken the ability to connect to the queue).

Of course when the problem shows up in production again you’ll know that you never really fixed the problem and you’ll have to go around the loop once more, but do what you should have done first time around as the second outage will no doubt have made a few more people annoyed.

Another Bug

Unsurprisingly the simple code change suggested by the ticket actually had no effect at all when we came to test it, and this sudden realisation that we didn’t really understand what was going on was the impetus needed to take a step back and start again from the beginning.

Whilst performing a quick disconnection test (by bouncing the middleware) we noticed that the queue was behaving weirdly and not backing up like it said in the bug report. Another rabbit hole later [4] and we discover that the queue was not set-up to be durable, which in itself turned out to be another bug.

Eventually we find a way to reproduce the problem and in the process we learn a bit more about how the middleware and message consumer library both work. However we still don’t understand why the new dispatcher configuration does not appear to be working. Luckily the library is open source and so we can debug the issue ourselves and see what is going on under the hood.

The Real Fix

Who would have guessed that internally the message consumer library had another retry and circuit breaker policy that was used to control the (re)connection attempts to the message queue broker. Unlike the dispatcher thread error recovery policy, which was configured explicitly, the message queue connection policy was controlled by a couple of defaulted arguments on the connection configuration object constructor [5].

Sadly we couldn’t be explicit and use the “wait and retry forever” policy that was available on the dispatcher so instead we had to settle for configurating the number of connection attempts to int.MaxValue.

Problem Solved

Naturally it was far simpler to test the fix because we eventually put the effort into working out how to reproduce the problem in the first place. This can be quite significant from a status reporting perspective because it means you are less likely to be over optimistic about your progress. If you’re struggling to reproduce the problem then you’re going to struggle to prove that you’ve fixed it. If you mistakenly believe that the fix is simple and you then feel under pressure to get the testing done at the end it’s harder to convince yourself to do what needs to be done rather than settle for only potentially being right.

 

[1] This is somewhat disingenuous as there are times where this is not possible, but that’s unusual in the world of mainstream software development.

[2] Without the alert on the queue size we would need to find another way to signal when processing has dropped off. For example the circuit breaker should have triggered some other alert as connection failures are to be expected, but only for a limited time before escalation needs to occur.

[3] See “Automated Integration Testing with TIBCO” for an example of how I’ve done this in the past with a TIBCO message queue.

[4] Yes, the middleware was RabbitMQ but no pun was intended, for once.

[5] I’m not suggesting the library, which is provided for free out of kindness, is at fault. On the contrary the documentation was excellent, as was the support we received on Gitter. I need to help fix this, somehow.

Test Language: Behaviours, Not Examples

Chris Oldwood from The OldWood Thing

Naming is hard, as we know from the old adage about the two hardest problems in Computer Science, and naming in tests is no different. I’ve documented my own journey around how I structure tests in two previous posts: “Unit Testing Evolution Part II – Naming Conventions” and “Other Test Naming Conventions”. I’ve also covered some similar ground before quite recently in “Overly Prescriptive Tests” but that was more about the content of the tests themselves, whereas here I’m trying to focus more on the language aspects.

Describing the Example

Something which I’ve observed, both from reviewing Fizz Buzz submissions with tests [1] and from real tests, is that there is often that missing leap from writing a test which describes a single example to generalising the language to describe the effective behaviour [2]. For example, imagine you’re writing a unit test for a calculator, if you literally encode your example as your test name you might write:

[Test]
public void two_plus_two_is_equal_to_four()

Given that you could accidentally implement it with multiplication and still make the test pass you might add another scenario to be sure you don’t fall into that trap:

[Test]
public void three_plus_seven_is_equal_to_ten()

The problem with these test names is that they only tell you about the specific scenario covered by the test, not about the bigger picture. One potential refactoring might be to parameterise the test thereby forcing you to generalise the name:

[TestCase(2, 2, 4)]
[TestCase(3, 7, 10)]
public void adding_two_numbers_together_returns_their_sum(. . .)

One way this often shows up in FizzBuzz tests is with examples for the various rules, e.g.

[Test]
public void three_returns_the_word_fizz()

[Test]
public void five_returns_the_word_buzz()

The rules of a basic calculator are already known to pretty much everyone but here, unless you know the rules of the game Fizz Buzz, you would not be able to derive them from these examples alone and one very important role of tests are to document, nay specify, the behaviour of our code.

Describing the Behaviour

Hence to encode the rules you need to think more generally:

a_number_divisible_by_three_returns_the_word_fizz

There are a couple of issues here, namely that technically any number is divisible by three (just not wholly), and also that it won’t be true once we start bringing in the more advanced rules. It’s not easy trying to be precise and yet also somewhat vague at the same time, but we can try:

a_number_wholly_divisible_by_three_generally_returns_the_word_fizz

Once we bring in the “divisible by three and divisible by five” rule it becomes much harder to be precise in our test names as we’d have to include the overriding rules too which likely makes them harder to read and comprehend:

a_number_wholly_divisible_by_three_but_not_also_wholly_divisible_by_five_returns_the_word_fizz

You might just get away with it this time but its not really scalable and test names, much like code comments, often have a habit of getting out of sync with reality. Even when they break due to new functionality it’s easy to end up fixing the test and forgetting to check whether the “documentation” aspect still reflects the new behaviour.

Hence I personally prefer to use words in test names that suggest “broad strokes” when necessary and guide the reader (top to bottom) from the more general scenarios to the more specific. This, in my mind, is similar to putting the happy path cases before the various error handling ones.

Validating Collections

These examples might be a little too trivial but the impetus for this post came from similar scenarios where the test language talked about the outcome of the example itself rather than the behaviour of the logic in general. The knock-on effect of doing this, apart from making the intent of the example harder to comprehend in the future, was that it also became brittle as the specific scenario outcome was encoded in the test and any change in logic that might be orthogonal to it could break it unnecessarily. (As mentioned earlier, “Overly Prescriptive Tests” looks at brittle tests from a different angle.)

A common place where this shows up is when asserting behaviours around collections. For example imagine writing tests for querying the seats available in a cinema where there are seats in different price bands. When testing the “seat query” method for an exhausted price band you might be inclined to write:

[TestFixture]
public class when_querying_for_seats_and_none_left_in_band
{
  [Test]
  public void then_the_result_is_empty()
  {
    auditorium.Add(“Posh Seats”, new Seats[0]);

    var seats = auditorium.FindAvailableSeats();

    Assert.That(seats, Is.Empty);
  }
}

The example, being minimal in nature, means that technically in this scenario the result will be empty. However that is an artefact of the way the example is expressed and the test has been written. If I were to change the test set-up and add the following line, the test would break:

auditorium.Add(“Cheap Seats”, new Seats[100]);

While the outcome of the example above might be “empty”, that is not the general behaviour of the logic under test and our test language should be changed to describe that:

[Test]
public void then_no_seats_in_that_band_are_returned()

Now we’re not making a statement about what else might or might not be in that result, only what our expectations are for seats in the band in question. Once we have fixed the test language we can address how we validate that in the example. Instead of looking at what is in the collection we should be looking at what isn’t there as the test name tells us to expect that something should be absent, and the assert should reflect that language:

Assert.That(seats.Where(s => s.Band == “Posh Seats”), Is.Empty);

Now I should only be able to break this test by changing the data or logic specific to the example, orthogonal behaviours should not break it by accident. (See “Manual Mutation Testing” for more on how you can test the quality of your tests.)

Invest in Tests

If you’ve ever worked on a codebase with brittle tests you’ll know how frustrating it can be when your feature mushrooms because you broke a bunch of badly written tests. If we’re lucky we see the failed assertion and if it’s not obvious then we can look back at the test name to see if the scenario rings any bells. If we’re unlucky we have to immediately reach for the debugger and likely add “refactor tests” to the yak stack.

If you “pay it forward” by taking the time to write good tests up front you’ll find it easier to sustain delivery in the future.

 

[1] A company I once worked for used Fizz Buzz in their candidate early screening process. Despite being overkill in practice (as was pointed out to candidates) a suite of tests was requested as part of the submission to help get a feel for what style they used. IMHO the tests said much more about candidates than the production code.

[2] Yes, “property based testing” takes this entire concept a step further so that it exercises the behaviour with multiple examples generated differently each time. That’s the destination, this post is about one possible journey.

Test the Code, Not the Mock

Chris Oldwood from The OldWood Thing

About 18 months or so ago I wrote a post about how I’d seen tests written that were self-reinforcing (“Tautologies in Tests”). The premise was about the use of the same production code to verify the test outcome as that which was supposedly under test. As such any break in the production code would likely not get picked up because the test behaviour would naturally change too.

It’s also possible to see the opposite kind of effect where the test code really becomes the behaviour under test rather than the production code. The use of mocking within tests is a magnet for this kind of situation as a developer mistakenly believes they can save time [1] by writing a more fully featured mock [2] that can be reused across tests. This is a false economy.

Example - Database Querying

I recently saw an example of this in some database access code. The client code (under test) first configured a filter where it calculated an upper and lower bound based on timestamps, e.g.

// non-trivial time based calculations
var minTime = ...
var maxTime = ...

query.Filter[“MinTime”] = minTime;  
query.Filter[“MaxTime”] = maxTime;

The client code then executed the query and performed some additional processing on the results which were finally returned.

The test fixture created some test data in the form of a simple list with a couple of items, presumably with one that lies inside the filter and another that lies outside, e.g.

var orders = new[]
{
  new Order { ..., Timestamp = “2016-05-12 18:00:00” },
  new Order { ..., Timestamp = “2018-05-17 02:15:00” },
};

The mocked out database read method then implemented a proper filter to apply the various criteria to the list of test data, e.g.

{
  var result = orders;

  if (filter[“MinTime”])
    ...
  if (filter[“MaxTime”])
    ...
  if (filter[...])
    ...

  return result;
}

As you can imagine this starts out quite simple for the first test case but as the production code behaviour gets more complex, so does the mock and the test data. Adding new test data to cater for the new scenarios will likely break the existing tests as they all share a single set and therefore you will need to go back and understand them to ensure the test still exercises the behaviour it used to. Ultimately you’re starting to test whether can actually implement a mock that satisfies all the tests rather than write individual tests which independently validate the expected behaviours.

Shared test data (not just placeholder constants like AnyCustomerId) is rarely a good idea as it’s often not obvious which piece of data is relevant to which test. The moment you start adding comments to annotate the test data you have truly lost sight of the goal. Tests are not just about verifying behaviour either they are a form of documentation too.

Roll Back

If we reconsider the feature under test we can see that there are a few different behaviours that we want to explore:

  • Is the filter correctly formed?
  • Are the query results correctly post-processed?

Luckily the external dependency (i.e. the mock) provides us with a seam which allows us to directly verify the filter configuration and also to control the results which are returned for post-processing. Consequently rather than having one test that tries to do everything, or a few tests that try and cover both aspect together we can separate them out, perhaps even into separate test fixtures based around the different themes, e.g.

public static class reading_orders 
{
  [TestFixture]
  public class filter_configuration    
  ...    
  [TestFixture]
  public class post_processing    
  ...
}

The first test fixture now focuses on the logic used to build the underlying query filter by asserting the filter state when presented to the database. It then returns, say, an empty result set as we wish to ignore what happens later (by invoking as little code as possible to avoid false positives).

The following example attempts to define what “yesterday” means in terms of filtering:

[Test]
public void filter_for_yesterday_is_midnight_to_midnight()
{
  DateTime? minTime = null;
  DateTime? maxTime = null;

  var mockDatabase = CreateMockDatabase((filter) =>
  {
    minTime = filter[“MinTime”];
    maxTime = filter[“MaxTime”];
  });
  var reader = new OrderReader(mockDatabase);
  var now = new DateTime(2001, 2, 3, 9, 32, 47);

  reader.FindYesterdaysOrders(now);

  Assert.That(minTime, Is.EqualTo(
                new DateTime(2001, 2, 2, 0, 0, 0)));
  Assert.That(maxTime, Is.EqualTo(
                new DateTime(2001, 2, 3, 0, 0, 0)));
}

As you can hopefully see the mock in this test is only configured to extract the filter state which we then verify later. The mock configuration is done inside the test to make it clear that the only point of interest is the the filter’s eventual state. We don’t even bother capturing the final output as it’s superfluous to this test.

If we had a number of tests to write which all did the same mock configuration we could extract it into a common [SetUp] method, but only if we’ve already grouped the tests into separate fixtures which all focus on exactly the same underlying behaviour. The Single Responsibility Principle applies to the design of tests as much as it does the production code.

One different approach here might be to use the filter object itself as a seam and sense the calls into that instead. Personally I’m very wary of getting too specific about how an outcome is achieved. Way back in 2011 I wrote “Mock To Test the Outcome, Not the Implementation” which showed where this rabbit hole can lead, i.e. to brittle tests that focus too much on the “how” and not enough on the “what”.

Mock Results

With the filtering side taken care of we’re now in a position to look at the post-processing of the results. Once again we only want code and data that is salient to our test and as long as the post-processing is largely independent of the filtering logic we can pass in any inputs we like and focus on the final output instead:

[Test]
public void upgrade_objects_to_latest_schema_version()
{
  var anyTime = DateTime.Now;
  var mockDatabase = CreateMockDatabase(() =>
  {
    return new[]
    {
      new Order { ..., Version = 1, ... },
      new Order { ..., Version = 2, ... },
    }
  });
  var reader = new OrderReader(mockDatabase);

  var orders = reader.FindYesterdaysOrders(anyTime);

  Assert.That(orders.Count, Is.EqualTo(2));
  Assert.That(orders.Count(o => o.Version == 3),
              Is.EqualTo(2));
}

Our (simplistic) post-processing example here ensures that all re-hydrated objects have been upgraded to the latest schema version. Our test data is specific to verifying that one outcome. If we expect other processing to occur we use different data more suitable to that scenario and only use it in that test. Of course in reality we’ll probably have a set of “builders” that we’ll use across tests to reduce the burden of creating and maintaining test data objects as the data models grow over time.

Refactoring

While reading this post you may have noticed that certain things have been suggested, such as splitting out the tests into separate fixtures. You may have also noticed that I discovered “independence” between the pre and post phases of the method around the dependency being mocked which allows us to simplify our test setup in some cases.

Your reaction to all this may well be to suggest refactoring the method by splitting it into two separate pieces which can then be tested independently. The current method then just becomes a simple composition of the two new pieces. Additionally you might have realised that the simplified test setup probably implies unnecessary coupling between the two pieces of code.

For me those kind of thoughts are the reason why I spend so much effort on trying to write good tests; it’s the essence of Test Driven Design.

 

[1] My ACCU 2017 talk “A Test of Strength” (shorter version) shows my own misguided attempts to optimise the writing of tests.

[2] There is a place for “heavier” mocks (which I still need to write up) but it’s not in unit tests.

Top, must-read paper on software fault analysis

Derek Jones from The Shape of Code

What is the top, must read, paper on software fault analysis?

Software Reliability: Repetitive Run Experimentation and Modeling by Phyllis Nagel and James Skrivan is my choice (it’s actually a report, rather than a paper). Not only is this report full of interesting ideas and data, but it has multiple replications. Replication of experiments in software engineering is very rare; this work was replicated by the original authors, plus Scholz, and then replicated by Janet Dunham and John Pierce, and then again by Dunham and Lauterbach!

I suspect that most readers have never heard of this work, or of Phyllis Nagel or James Skrivan (I hadn’t until I read the report). Being published is rarely enough for work to become well-known, the authors need to proactively advertise the work. Nagel, Dunham & co worked in industry and so did not have any students to promote their work and did not spend time on the academic seminar circuit. Given enough effort it’s possible for even minor work to become widely known.

The study run by Nagel and Skrivan first had three experienced developers independently implement the same specification. Each of these three implementations was then tested, multiple times. The iteration sequence was: 1) run program until fault experienced, 2) fix fault, 3) if less than five faults experienced, goto step (1). The measurements recorded were fault identity and the number of inputs processed before the fault was experienced.

This process was repeated 50 times, always starting with the original (uncorrected) implementation; the replications varied this, along with the number of inputs used.

For a fault to be experienced, there has to be a mistake in the code and the ‘right’ input values have to be processed.

How many input values need to be processed, on average, before a particular fault is experienced? Does the average number of inputs values needed for a fault experience vary between faults, and if so by how much?

The plot below (code+data) shows the numbers of inputs processed, by one of the implementations, before individual faults were experienced, over 50 runs (sorted by number of inputs):

Number of inputs processed before particular fault experienced

Different faults have different probabilities of being experienced, with fault a being experienced on almost any input and fault e occurring much less frequently (a pattern seen in the replications). There is an order of magnitude variation in the number of inputs processed before particular faults are experienced (this pattern is seen in the replications).

Faults were fixed as soon as they were experienced, so the technique for estimating the total number of distinct faults, discussed in a previous post, cannot be used.

A plot of number of faults found against number of inputs processed is another possibility. More on that another time.

Suggestions for top, must read, paper on software faults, welcome (be warned, I think that most published fault research is a waste of time).