Top, must-read paper on software fault analysis

Derek Jones from The Shape of Code

What is the top, must read, paper on software fault analysis?

Software Reliability: Repetitive Run Experimentation and Modeling by Phyllis Nagel and James Skrivan is my choice (it’s actually a report, rather than a paper). Not only is this report full of interesting ideas and data, but it has multiple replications. Replication of experiments in software engineering is very rare; this work was replicated by the original authors, plus Scholz, and then replicated by Janet Dunham and John Pierce, and then again by Dunham and Lauterbach!

I suspect that most readers have never heard of this work, or of Phyllis Nagel or James Skrivan (I hadn’t until I read the report). Being published is rarely enough for work to become well-known, the authors need to proactively advertise the work. Nagel, Dunham & co worked in industry and so did not have any students to promote their work and did not spend time on the academic seminar circuit. Given enough effort it’s possible for even minor work to become widely known.

The study run by Nagel and Skrivan first had three experienced developers independently implement the same specification. Each of these three implementations was then tested, multiple times. The iteration sequence was: 1) run program until fault experienced, 2) fix fault, 3) if less than five faults experienced, goto step (1). The measurements recorded were fault identity and the number of inputs processed before the fault was experienced.

This process was repeated 50 times, always starting with the original (uncorrected) implementation; the replications varied this, along with the number of inputs used.

For a fault to be experienced, there has to be a mistake in the code and the ‘right’ input values have to be processed.

How many input values need to be processed, on average, before a particular fault is experienced? Does the average number of inputs values needed for a fault experience vary between faults, and if so by how much?

The plot below (code+data) shows the numbers of inputs processed, by one of the implementations, before individual faults were experienced, over 50 runs (sorted by number of inputs):

Number of inputs processed before particular fault experienced

Different faults have different probabilities of being experienced, with fault a being experienced on almost any input and fault e occurring much less frequently (a pattern seen in the replications). There is an order of magnitude variation in the number of inputs processed before particular faults are experienced (this pattern is seen in the replications).

Faults were fixed as soon as they were experienced, so the technique for estimating the total number of distinct faults, discussed in a previous post, cannot be used.

A plot of number of faults found against number of inputs processed is another possibility. More on that another time.

Suggestions for top, must read, paper on software faults, welcome (be warned, I think that most published fault research is a waste of time).

Statement sequence length for error/non-error paths

Derek Jones from The Shape of Code

One of the folk truisms of the compiler/source code analysis business is that error paths are short, i.e., when an error situation is detected (such as failing to open a file), few statements are executed before the functions returns.

Having repeated this truism for many decades, figure 2 from the paper APEx: Automated Inference of Error Specifications for C APIs jumped off the page at me; thanks to Yuan Kang, I now have a copy of the data.

The plots below (code+data) show two representations of the non-error/error path lengths (measured in statements within individual functions of libc; counting starts at a library call that could return an error value). The upper plot shows statement sequence lengths for error/non-error paths, and the lower is a kernel density plot of the error/non-error sequence lengths.

Statements contains in error and non-error paths

Another truism is that people tend to write positive tests, i.e., tests that do not involve error handling (some evidence).

Code coverage measurements (e.g., number of statements or branches that are executed by a test suite) often show the pattern seen in the plot below (code+data; thanks to the authors of the paper Code Coverage for Suite Evaluation by Developers for making the data available). The data was obtained by measuring the coverage of 1,043 Java programs executing their associated test suite (circles denote program size). Lines are fitted regression models for different sized programs.

Statement coverage against decision coverage

If people are preferentially writing positive tests, test suites with low coverage would be expected to execute a greater percentage of statements than branches (an if-statement has two branches, taken/not-taken), i.e., the behavior seen in the plot above (grey line shows equal statement/branch coverage). Once the low hanging fruit is tested (i.e., the longer, non-error, cases), tests have to be written for the shorter, more likely to be error handling, cases.

The plot would also be explained by typical execution paths favoring longer basic blocks, but I don’t have any data that could show this one way or another.

Manual Mutation Testing

Chris Oldwood from The OldWood Thing

One of the problems when making code changes is knowing whether there is good test coverage around the area you’re going to touch. In theory, if a rigorous test-first approach is taken no production code should be written without first being backed by a failing test. Of course we all know the old adage about how theory turns out in practice [1]. Even so, just because a test has been written, you don’t know what the quality of it and any related ones are.

Mutation Testing

The practice of mutation testing is one way to answer the perennial question: how do you test the tests? How do you know if the tests which have been written adequately cover the behaviours the code should exhibit? Alternatively, as a described recently in “Overly Prescriptive Tests”, are the tests too brittle because they require too exacting a behaviour?

There are tools out there which will perform mutation testing automatically that you can include as part of your build pipeline. However I tend to use them in a more manual way to help me verify the tests around the small area of functionality I’m currently concerned with [2].

The principle is actually very simple, you just tweak the production code in a small way that would likely mimic a real change and you see what tests fail. If no tests fail at all then you probably have a gap in your spec that needs filling.

Naturally the changes you make on the production code should be sensible and functional in behaviour; there’s no point in randomly setting a reference to null if that scenario is impossible to achieve through the normal course of events. What we’re aiming for here is the simulation of an accidental breaking change by a developer. By tweaking the boundaries of any logic we can also check that our edge cases have adequate coverage too.

There is of course the possibility that this will also unearth some dead code paths too, or at least lead you to further simplify the production code to achieve the same expected behaviour.

Example

Imagine you’re working on a service and you spy some code that appears to format a DateTime value using the default formatter. You have a hunch this might be wrong but there is no obvious unit test for the formatting behaviour. It’s possible the value is observed and checked in an integration or acceptance test elsewhere but you can’t obviously [3] find one.

Naturally if you break the production code a corresponding test should break. But how badly do you break it? If you go too far all your tests might fail because you broke something fundamental, so you need to do it in varying degrees and observe what happens at each step.

If you tweak the date format, say, from the US to the UK format nothing may happen. That might be because the tests use a value like 1st January which is 01/01 in both schemes. Changing from a local time format to an ISO format may provoke something new to fail. If the test date is particularly well chosen and loosely verified this could well still be inside whatever specification was chosen.

Moving away from a purely numeric form to a more natural, wordy one should change the length and value composition even further. If we reach this point and no tests have failed it’s a good chance nothing will. We can then try an empty string, nonsense strings and even a null string reference to see if someone only cares that some arbitrary value is provided.

But what if after all that effort still no lights start flashing and the klaxon continues to remain silent?

What Does a Test Pass or Fail Really Mean?

In the ideal scenario as you slowly make more and more severe changes you would eventually hope for one or maybe a couple of tests to start failing. When you inspect them it should be obvious from their name and structure what was being expected, and why. If the test name and assertion clearly specifies that some arbitrary value is required then its probably intentional. Of course It may still be undesirable for other reasons [4] but the test might express its intent well (to document and to verify).

If we only make a very small change and a lot of tests go red we’ve probably got some brittle tests that are highly dependent on some unrelated behaviour, or are duplicating behaviours already expressed (probably better) elsewhere.

If the tests stay green this does not necessary mean we’re still operating within the expected behaviour. It’s entirely possible that the behaviour has been left completely unspecified because it was overlooked or forgotten about. It might be that not enough was known at the time and the author expected someone else to “fill in the blanks” at a later date. Or maybe the author just didn’t think a test was needed because the intent was so obvious.

Plugging the Gaps

Depending on the testing culture in the team and your own appetite for well defined executable specifications you may find mutation testing leaves you with more questions than you are willing to take on. You only have so much time and so need to find a way to plug any holes in the most effective way you can. Ideally you’ll follow the Boy Scout Rule and at least leave the codebase in a better state than you found it, even if that isn’t entirely to your own satisfaction.

The main thing I get out of using mutation testing is a better understanding of what it means to write good tests. Seeing how breaks are detected and reasoned about from the resulting evidence gives you a different perspective on how to express your intent. My tests definitely aren’t perfect but by purposefully breaking code up front you get a better feel for how to write less brittle tests than you might by using TDD alone.

With TDD you are the author of both the tests and production code and so are highly familiar with both from the start. Making a change to existing code by starting with mutation testing gives you a better orientation of where the existing tests are and how they perform before you write your own first new failing test.

Refactoring Tests

Refactoring is about changing the code without changing the behaviour. This can also apply to tests too in which case mutation testing can provide the technique for which you start by creating failing production code that you “fix” when the test is changed and the bar goes green again. You can then commit the refactored tests before starting on the change you originally intended to make.

 

[1] “In theory there is no difference between theory and practice; in practice there is.” –- Jan L. A. van de Snepscheut.

[2] Just as with techniques like static code analysis you really need to adopt this from the beginning if you want to keep the noise level down and avoid exploring too large a rabbit hole.

[3] How you organise your tests is a subject in its own right but suffice to say that it’s usually easier to find a unit test than an acceptance test that depends on any given behaviour.

[4] The author may have misunderstood the requirement or the requirement was never clear originally and so the the behaviour was left loosely specified in the short term.

Overly Prescriptive Tests

Chris Oldwood from The OldWood Thing

In my recent post “Tautologies in Tests” I adapted one of Einstein’s apocryphal sayings and suggested that tests should be “as precise as possible, but not too precise”. But what did I mean by that? How can you be too precise, in fact isn’t that the point?

Mocking

One way is to be overly specific when tracking the interactions with mocks. It’s very easy when using a mocking framework to go overboard with your expectations, just because you can. My personal preference (detailed before in “Mock To Test the Outcome, Not the Implementation”) is to keep the details of any interactions loose, but be specific about the outcomes. In other words what matters most is (usually) the observable behaviour, not necessarily how it’s achieved.

For example, rather than set-up detailed instructions on a mock that cover all the expected parameters and call counts I’ll mostly use simple hand-crafted mocks [1] where the method maps to a delegate where I’ll capture only the salient details. Then in the assertions at the end I verify whatever I need to in the same style as the rest of the test. Usually though the canned response is test case specific and so rarely needs any actual logic.

In essence what I’m creating some people prefer to call stubs as they reserve the term “mocks” for more meatier test fakes that record interactions for you. I’d argue that using the more complex form of mock is largely unnecessary and will hurt in the long run. To date (anecdotally speaking) I’ve wasted too much time “fixing” broken tests that overused mocks by specifying every little detail and were never written to give the implementation room to manoeuvre, e.g. during refactoring. In fact an automated refactoring tool is mandatory on code like this because the methods are referenced in so many tests it would take forever to fix-up manually.

I often feel that some of the interactions with dependencies I’ve seen in the past have felt analogous to testing private methods. Another of my previous posts that was inspired by mocking hell is “Don’t Pass Factories, Pass Workers”. Naturally there is a fine line here and maybe I’ve just not seen enough of it done well to appreciate how this particular tool can be used effectively.

White-Box Testing 

The other form of overly specific test I’ve seen comes from what I believe is relying too much on a white-box testing approach so that the tests express the output exactly.

The problem with example based tests is that they are often taken literally, which I guess is kind of the point, but as software engineers we should try and see passed the rigid examples and verify the underlying behaviour instead, which is what we’re really after.

For example, consider a pool of numbers [2] up to some predefined limit, say, 10. A naïve approach to the problem might test the pool by asserting a very specific sequence, i.e. the starting one:

[Test]
public void returns_sequence_up_to_limit()
{
  var pool = new NumberPool(10);
  var expected = new[] { 1, 2, 3, ... , 9, 10 };

  for (var number in expected)
    Assert.That(pool.Acquire(), Is.EqualTo(number));
}

From a white-box testing approach we can look inside the NumberPool and probably see that it’s initially generating numbers using the ++ operator. The implementation might eagerly generate that sequence in the constructor, add them to the end of a queue, and then divvy out the front of the queue.

From a “programmer’s test” point of view (aka unit test) it does indeed verify that, if my expectation is that the implementation should return the exact sequence 1..10, then it will. But how useful is that for the maintainer of this code? I’d argue that we’ve over-specified the way this unit should be allowed to behave.

Verify Behaviours

And that, I think, lies at that heart of the problem. For tests to be truly effective they should not describe exactly what they do, but should describe how they need to behave. Going back to our example above the NumberPool class does not need to return the exact sequence 1..10, it needs to satisfy some looser constraints, such as not returning a duplicate value (until re-acquired), and limiting the range of numbers to between 1 and 10.

[Test]
public void sequence_will_be_unique()
{
  var pool = new NumberPool(10);
  var sequence = new List<int>();

  for (var i in Enumerable.Range(1, 10))
    sequence.Add(pool.Acquire());

  Assert.That(sequence.Distinct().Count(),
              Is.EqualTo(10)); 
}

[Test]
public void sequence_only_contains_one_to_limit()
{
  var pool = new NumberPool(10);
  var sequence = new List<int>();

  for (var i in Enumerable.Range(1, 10))
    sequence.Add(pool.Acquire());

  Assert.That(sequence.Where(n => (n < 1) || (n > 10)),
              Is.Empty);
}

With these two tests we are free to change the implementation to generate a random sequence in the constructor instead if we wanted, and they would still pass, because it conforms to the looser, albeit still well defined, behaviour. (It may have unpredictable performance characteristics but that is a different matter.)

Once again we are beginning to enter the realm of property based testing which forces us to think harder about what behaviours our code exhibits rather than what it should do in one single scenario.

This does not mean there is no place for tests that take a specific set of inputs and validate the result against a known set of outputs. On the contrary they are an excellent starting point for thinking about what the real test should do. They are also important in scenarios where you need some smoke tests that “kick the tyres” or you are naturally handling a very specific scenario.

Indicative Inputs

Sometimes we don’t intend to make our test look specific but it just turns out that way to the future reader. For example in our NumberPool tests above what is the significance of the number “10”? Hopefully in this example it is fairly obvious that it is an arbitrary value as the test names only talk about “a limit”. But what about a test for code that handles, say, an HTTP error?

[Test]
public void client_throws_when_service_unavailable()
{
  using (FakeServer.Returns(InternalServerError))
  {
    var client = new RestClient(. . .);

    Assert.That(client.SendRequest(. . .),
                Throws.InstanceOf<RequestException>());
  }
}

In this test we have a mock (nay stub) HTTP server that will return a non-2XX style result code. Now, what is the significance of the InternalServerError result code returned by the stub? Is it a specific result code we’re handling here, or an indicative one in the 5XX range? The test name uses the term “service unavailable” which maps to the more specific HTTP code 503, so is this in fact a bug in the code or test?

Unless the original author is around to ask (and even remembers) we don’t know. We can surmise what they probably meant by inspecting the production code and seeing how it processes the result code (e.g. a direct comparison or a range based one). From there we might choose to see how we can avoid the ambiguity by refactoring the test. In the case where InternalServerError is merely indicative we can use a suitably named constant instead, e.g.

[Test]
public void throws_when_service_returns_5xx_code()
{
  const int CodeIn5xxRange = InternalServerError;

  using (FakeServer.Returns(CodeIn5xxRange))
  {
    var client = new RestClient(. . .);

    Assert.That(client.SendRequest(. . .),
                Throws.InstanceOf<RequestException>());
  }
}

A clue that there is a disconnect is when the language used in the test name isn’t correctly reflected in the test body itself. So if the name isn’t specific then nor should the test be, but also vice-versa, if the name is specific then expect the test to be. A corollary to this is that if your test name is vague don’t surprised when the test itself turns out equally vague.

Effective Tests

For a suite of tests to be truly effective you need them to remain quietly in the background until you change the code in a way that raises your awareness around some behaviour you didn’t anticipate. The fact that you didn’t anticipate it means that you’ll be relying heavily on the test rather than the code you just changed to make sense of the original intended behaviour.

When it comes under the spotlight (fails) a test needs to convince you that it was well thought out and worthy of your consideration. To be effective a guard dog has to learn the difference between friend and foe and when we write tests we need to learn how to leave enough room for safe manoeuvring without forgetting to bark loudly when we exceed our remit.

 

[1] When you keep your interfaces simple and focused this is pretty easy given how much a modern IDE can generate for you when using a statically typed language.

[2] This example comes from a real one where the numbers where identifiers used to distinguish compute engines in a grid.

Automated Integration Testing with TIBCO

Chris Oldwood from The OldWood Thing

In the past few years I’ve worked on a few projects where TIBCO has been the message queuing product of choice within the company. Naturally being a test-oriented kind of guy I’ve used unit and component tests for much of the donkey work, but initially had to shy away from writing any automated integration tests due to the inherent difficulties of getting the system into a known state in isolation.

Organisational Barriers

For any automated integration tests to run reliably we need to control the whole environment, which ideally is our development workstations but also our CI build environment (see “The Developer’s Sandbox”). The main barriers to this with a commercial product like TIBCO are often technological, but also more often than not, organisational too.

In my experience middleware like this tends to be proprietary, very expensive, and owned within the organisation by a dedicated team. They will configure the staging and production queues and manage the fault-tolerant servers, which is probably what you’d expect as you near production. A more modern DevOps friendly company would recognise the need to allow teams to test internally first and would help them get access to the product and tools so they can build their test scaffolding that provides the initial feedback loop.

Hence just being given the client access libraries to the product is not enough, we need a way to bring up and tear down the service endpoint, in isolation, so that we can test connectivity and failover scenarios and message interoperability. We also need to be able develop and test our logic around poisoned messages and dead-letter queues. And all this needs to be automatable so that as we develop and refactor we can be sure that we’ve not broken anything; manually testing this stuff is not just not scalable in a shared test environment at the pace modern software is developed.

That said, the TIBCO EMS SDK I’ve been working with (v6.3.0) has all the parts I needed to do this stuff, albeit with some workarounds to avoid needing to run the tests with administrator rights which we’ll look into later.

The only other thorny issue is licensing. You would hope that software product companies would do their utmost to get developers on their side and make it easy for them to build and test their wares, but it is often hard to get clarity around how the product can be used outside of the final production environment. For example trying to find out if the TIBCO service can be run on a developer’s workstation or in a cloud hosted VM solely for the purposes of running some automated tests has been a somewhat arduous task.

This may not be solely the fault of the underlying product company, although the old fashioned licensing agreements often do little to distinguish production and modern development use [1]. No, the real difficulty is finding the right person within the client’s company to talk to about such matters. Unless they are au fait with the role modern automated integrated testing takes place in the development process you will struggle to convince them your intended use is in the interests of the 3rd party product, not stealing revenue from them.

Okay, time to step down from the soap box and focus on the problems we can solve…

Hosting TIBEMSD as a Windows Service

From an automated testing perspective what we need access to is the TIBEMSD.EXE console application. This provides us with one or more TIBCO message queues that we can host on our local machine. Owning thing process means we can therefore create, publish to and delete queues on demand and therefore tightly control the environment.

If you only want to do basic integration testing around the sending and receiving of messages you can configure it as a Windows service and just leave it running in the background. Then your tests can just rely on it always being there like a local database or the file-system. The build machine can be configured this way too.

Unfortunately because it’s a console application and not written to be hosted as a service (at least v6.3 isn’t), you need to use a shim like SRVANY.EXE from the Windows 2003 Resource Kit or something more modern like NSSM. These tools act as an adaptor to the console application so that the Windows SCM can control them.

One thing to be careful of when running TIBEMSD in this way is that it will stick its data files in the CWD (Current Working Directory), which for a service is %SystemRoot%\System32, unless you configure the shim to change it. Putting them in a separate folder makes them a little more obvious and easier to delete when having a clear out [2].

Running TIBEMSD On Demand

Running the TIBCO server as a service makes certain kinds of tests easier to write as you don’t have to worry about starting and stopping it, unless that’s exactly the kinds of test you want to write.

I’ve found it’s all too easy when adding new code or during a refactoring to accidentally break the service so that it doesn’t behave as intended when the network goes up and down, especially when you’re trying to handle poisoned messages.

Hence I prefer to have the TIBEMSD.EXE binary included in the source code repository, in a known place so that it can be started and stopped on demand to verify the connectivity side is working properly. For those classes of integration tests where you just need it to be running you can add it to your fixture-level setup and even keep it running across fixtures to ensure the tests running at an adequate pace.

If, like me, you don’t run as an Administrator all the time (or use elevated command prompts by default) you will find that TIBEMSD doesn’t run out-of-the-box in this way. Fortunately it’s easy to overcome these two issues and run in a LUA (Limited User Account).

Only Bind to the Localhost

One of the problems is that by default the server will try and listen for remote connections from anywhere which means it wants a hole in the firewall for its default port. This of course means you’ll get that firewall popup dialog which is annoying when trying to automate stuff. Whilst you could grant it permission with a one-off NETSH ADVFIREWALL command I prefer components in test mode to not need any special configuration if at all possible.

Windows will allow sockets that only listen for connections from the local host to avoid generating the annoying firewall popup dialog (and this was finally extended to include HTTP too). However we need to tell the TIBCO server to do just that, which we can achieve by creating a trivial configuration file (e.g. localhost.conf) with the following entry:

listen=tcp://127.0.0.1:7222

Now we just need to start it with the –conf switch:

> tibemsd.exe -config localhost.conf

Suppressing the Need For Elevation

So far so good but our other problem is that when you start TIBEMSD it wants you to elevate its permissions. I presume this is a legacy thing and there may be some feature that really needs it but so far in my automated tests I haven’t hit it.

There are a number of ways to control elevation for legacy software that doesn’t have a manifest, like using an external one, but TIBEMSD does and that takes priority. Luckily for us there is a solution in the form of the __COMPAT_LAYER environment variable [3]. Setting this, either through a batch file or within our test code, supresses the need to elevate the server and it runs happily in the background as a normal user, e.g.

> set __COMPAT_LAYER=RunAsInvoker
> tibemsd.exe -config localhost.conf

Spawning TIBEMSD From Within a Test

Once we know how to run TIBEMSD without it causing any popups we are in a position to do that from within an automated test running as any user (LUA), e.g. a developer or the build machine.

In C#, the language where I have been doing this most recently, we can either hard-code a relative path [4] to where TIBEMSD.EXE resides within the repo, or read it from the test assembly’s app.config file to give us a little more flexibility.

<appSettings>
  <add key=”tibemsd.exe”
       value=”..\..\tools\TIBCO\tibemsd.exe” />
  <add key=”conf_file”
       value=”..\..\tools\TIBCO\localhost.conf” />
</appSettings>

We can also add our special .conf file to the same folder and therefore find it in the same way. Whilst we could generate it on-the-fly it never changes so I see little point in doing this extra work.

Something to be wary of if you’re using, say, NUnit to write your integration tests is that it (and ReSharper) can copy the test assemblies to a random location to aid in insuring your tests have no accidental dependencies. In this instance we do, and a rather large one at that, so we need the relative distance between where the test assemblies are built and run (XxxIntTests\bin\Debug) and the TIBEMSD.EXE binary to remain fixed. Hence we need to disable this copying behaviour with the /noshadow switch (or “Tools | Unit Testing | Shadow-copy assemblies being tested” in ReSharper).

Given that we know where our test assembly resides we can use Assembly.GetExecutingAssembly() to create a fully qualified path from the relative one like so:

private static string GetExecutingFolder()
{
  var codebase = Assembly.GetExecutingAssembly()
                         .CodeBase;
  var folder = Path.GetDirectoryName(codebase);
  return new Uri(folder).LocalPath;
}
. . .
var thisFolder = GetExecutingFolder();
var tibcoFolder = “..\..\tools\TIBCO”;
var serverPath = Path.Combine(
            thisFolder, tibcoFolder, “tibemsd.exe”);
var configPath = Path.Combine(
            thisFolder, tibcoFolder, “localhost.conf”);

Now that we know where the binary and config lives we just need to stop the elevation by setting the right environment variable:

Environment.SetEnvironmentVariable("__COMPAT_LAYER", "RunAsInvoker");

Finally we can start the TIBEMSD.EXE console application in the background (i.e. no distracting console window) using Diagnostics.Process:

var process = new System.Diagnostics.Process
{
  StartInfo = new ProcessStartInfo(path, args)
  {
    UseShellExecute = false,
    CreateNoWindow = true,
  }
};
process.Start();

Stopping the daemon involves calling Kill(). There are more graceful ways of remotely stopping a console application which you can try first, but Kill() is always the fall-back approach and of course the TIBCO server has been designed to survive such abuse.

Naturally you can wrap this up with the Dispose pattern so that your test code can be self-contained:

// Arrange
using (RunTibcoServer())
{
  // Act
}

// Assert

Or if you want to amortise the cost of starting it across your tests you can use the fixture-level set-up and tear down:

private IDisposable _server;

[FixtureSetUp]
public void GivenMessageQueueIsAvailable()
{
  _server = RunTibcoServer();
}

[FixtureTearDown]
public void StopMessageQueue()
{
  _server?.Dispose();
  _server = null;
}

One final issue to be aware of, and it’s a common one with integration tests like this which start a process on demand, is that the server might still be running unintentionally across test runs. This can happen when you’re debugging a test and you kill the debugger whilst still inside the test body. The solution is to ensure that the server definitely isn’t already running before you spawn it, and that can be done by killing any existing instances of it:

Process.GetProcessesByName(“tibemsd”)
       .ForEach(p => p.Kill());

Naturally this is a sledgehammer approach and assumes you aren’t using separate ports to run multiple disparate instances, or anything like that.

Other Gottchas

This gets us over the biggest hurdle, control of the server process, but there are a few other little things worth noting.

Due to the asynchronous nature and potential for residual state I’ve found it’s better to drop and re-create any queues at the start of each test to flush them. I also use the Assume.That construct in the arrangement to make it doubly clear I expect the test to start with empty queues.

Also if you’re writing tests that cover background connect and failover be aware that the TIBCO reconnection logic doesn’t trigger unless you have multiple servers configured. Luckily you can specify the same server twice, e.g.

var connection= “tcp://localhost,tcp://localhost”;

If you expect your server to shutdown gracefully, even in the face of having no connection to the queue, you might find that calling Close() on the session and/or connection blocks whilst it’s trying to reconnect (at least in EMS v6.3 it does). This might not be an expected production scenario, but it can hang your tests if something goes awry, hence I’ve used a slightly distasteful workaround where the call to Close() happens on a separate thread with a timeout:

Task.Run(() => _connection.Close()).Wait(1000);

Conclusion

Writing automated integration tests against a middleware product like TIBCO is often an uphill battle that I suspect many don’t have the appetite or patience for. Whilst this post tackles the technical challenges, as they are at least surmountable, the somewhat harder problem of tackling the organisation is sadly still left as an exercise for the reader.

 

[1] The modern NoSQL database vendors appear to have a much simpler model – use it as much as you like outside production.

[2] If the data files get really large because you leave test messages in them by accident they can cause your machine to really grind after a restart as the service goes through recovery.

[3] How to Run Applications Manifested as Highest Available With a Logon Script Without Elevation for Members of the Administrators Group

[4] A relative path means the repo can then exist anywhere on the developer’s file-system and also means the code and tools are then always self-consistent across revisions.

Tautologies in Tests

Chris Oldwood from The OldWood Thing

Imagine you’re writing a test for a simple function like abs(). You would probably write something like this:

[Test]
public void abs_returns_the_magnitude_of_the_value()
{
  Assert.That(Math.Abs(-1), Is.EqualTo(1));
}

It’s a simple function, we can calculate the expected output in our head and just plug the expectation (+1) directly in. But what if I said I’ve seen this kind of thing written:

[Test]
public void abs_returns_the_magnitude_of_the_value()
{
  Assert.That(Math.Abs(-1), Is.EqualTo(Math.Abs(-1)));
}

Of course in real life it’s not nearly as obvious as this, the data is lifted out into variables and there is more distance between the action and the way the expectation is derived:

[Test]
public void abs_returns_the_magnitude_of_the_value()
{
  const int negativeValue = –1;

  var expectedValue = Math.Abs(-1);

  Assert.That(Math.Abs(negativeValue),
              Is.EqualTo(expectedValue));
}

I still doubt anyone would actually write this and a simple function like abs() is not what’s usually under test when this crops up. A more realistic scenario would need much more distance between the test and production code, say, a component-level test:

[Test]
public void processed_message_contains_the_request_time()
{
  var requestTime = new DateTime(. . .);
  var input = BuildTestMessage(requestTime, . . . );
  var expectedTime = Processor.FormatTime(requestTime);

  var output = Processor.Process(input, . . .);

  Assert.That(output.RequestTime,
              Is.EqualTo(expectedTime));
}

What Does the Test Say?

If we mentally inline the derivation of the expected value what the test is saying is “When a message is processed the output contains a request time which is formatted by the processor”. This is essentially a tautology because the test is describing its behaviour in terms of the thing under test, it’s self-reinforcing [1].

Applying the advice from Antoine de Saint-Exupéry [2] about perfection being achieved when there is nothing left take away, lets implement FormatTime() like this:

public string FormatTime(DateTime value)
{
  return null;
}

The test will still pass. I know this change is perverse and nobody would ever make that ridiculous a mistake, but the point is that the test is not really doing its job. Also as we rely more heavily on refactoring tools we have to work harder to verify that we have not silently broken a different test that was also inadvertently relying on some aspect of the original behaviour.

Good Duplication

We duplicate work in the test for a reason, as a cross-check that we’ve got it right. This “duplication” is often just performed mentally, e.g. formatting a string, but for a more complex behaviour could be done in code using an alternate algorithm [3]. In fact one of the advantages of a practice like TDD is that you have to work it out beforehand and therefore are not tempted to paste the output from running the test on the basis that you’re sure it’s already correct.

If we had duplicated the work of deriving the output in the example above my little simplification would not have worked as the test would then have failed. Once again, adopting the TDD practice of starting with a failing test and transitioning to green by putting the right implementation in proves that the test will fail if the implementation changes unexpectedly.

This is a sign to watch out for – if you’re not changing the key part of the implementation to make the test pass you might have overly-coupled the test and production code.

What is the Test Really Saying?

The problem with not being the person that wrote the test in the first place is that it may not be telling you what you think it is. For example the tautology may be there because what I just described is not what the author intended the reader to deduce.

The test name only says that the output will contain the time value, the formatting of that value may well be the responsibility of another unit test somewhere else. This is a component level test after all and so I would need to drill into the tests further to see if that were true. A better approach might be to make the breaking change above and see what actually fails. Essentially I would be doing a manual form of Mutation Testing to verify the test coverage.

Alternatively the author may be trying to avoid creating a brittle test which would fail if the formatting was tweaked and so decided the best way to do that would be to reuse the internal code. The question is whether the format matters or not (is it a published API?), and with no other test to specifically answer that question one has to work on an assumption.

This is a noble cause (not writing brittle tests) but there is a balance between the test telling you about a fault in the code and it just being overly specific and annoying by failing on unimportant changes. Sometimes we just need to work a little harder to express the true specification in looser terms. For example maybe we only need to assert that a constituent part of the date is included, say, the year as that is usually the full 4 digits these days:

Assert.That(output.RequestTime,
            Is.StringContaining(“2010”));

If we are careful about the values we choose we can ensure that multiple formats can still conform to a looser contract. For example 10:33:44 on 22/11/2016 contains no individual fields that could naturally be formatted in a way where a simple substring search could give a false positive (e.g. the hour being mistaken for the day of the month).

A Balancing Act

Like everything in software engineering there is a trade-off. Whilst we’d probably prefer to be working with a watertight specification that leaves as little room for ambiguity as possible, we often have details that are pretty loose. When that happens we have to decide how we want to trigger a review of this lack of clarity in the future. If we make the test overly restrictive it runs the risk of becoming brittle, whilst making it overly vague could allow breaking changes to go unnoticed until too late.

Borrowing (apocryphally) from Einstein we should strive to make our tests as precise as possible, but not overly precise. In the process we need to ensure we do not accidentally reuse production code in the test such that we find ourselves defining the behaviour of it, with itself.

 

[1] I’ve looked at the self-reinforcing nature of unit tests before in “Man Cannot Live by Unit Testing Alone”.

[2] See “My Favourite Quotes” for some of the other programming related quotes I find particularly inspiring.

[3] Often one that is slower as correctness generally takes centre stage over performance.

A Game of Tag

Phil Nash from level of indirection

One of the tent-pole features of Catch is the ability to write test names as free-form strings. When you run a Catch executable from the command line you can specify a test case by name, to run just that one:

./MyTestExe "a very nice test case"

or you can use wildcards to run a group of test cases (or just one with less typing):

./MyTestExe "*very nice*"

If you want to use wildcards but you're not sure what they'll match you can combine this with the listing option, -l, to see which test cases match the pattern:

./MyTestExe "*very nice*" -l
Matching test cases:
  a very nice test case
  a not very nice test case
2 matching test cases

This is already quite a powerful way to group test cases into ad-hoc "suites". However we don't want to twist our test names into artificial schemes for this purposes (although, early on, that's exactly what I proposed). Instead Catch allows you to add "tags" to test cases.

TEST_CASE( "a very nice test case", "[nice][good]" ) { /* ... */ }
TEST_CASE( "a not very nice test case", "[nice][bad]" ) { /* ... */ }

Now we can run all tests with a certain tag:

./MyTestExe [good]

or combination of tags:

./MyTestExe [nice][good]

also with exclusions:

./MyTestExe [nice]~[bad]

unions are supported with ,:

./MyTestExe [nice],[pleasant]

Very powerful! And this functionality has been around for a while.

More recent, and less well known (mostly because they weren't documented until recently) are a set of "special tags": Instruction Tags, Hiding Tags, Tag Aliases and some automatically generated tags.

Let's see what they're all about.

Instruction Tags

In general all tags that start with a symbol are reserved by Catch (or, put another way, user defined tag names must start with an alpha-numeric character). This allows a nice rich range of namespaces for special tags. Tags that start with the ! character are Instruction tags. They inform Catch something about the test case that they apply to. At time of writing the following are defined:

  • [!hide] This "hides" the test from the default run (i.e. if you run the test executable without specifying any names or tags). This feature was originally introduced with the [hide] tag (note, no: !) - and is still supported, though deprecated. There is also a shortcut form, [.] which we'll revisit in a moment.
  • [!throws] This tells Catch that an exception may be thrown in the course of executing the test - even if it is caught and dealt with. If you've ever tried to track down a rogue exception in your debugger - and so have set the debugger to break on exceptions as they're thrown - you'll know how frustrating all the false positives coming from such tests are! So Catch provides a way to suppress exceptions it is expecting - through the -e or --nothrow options on the command line. This already skips over REQUIRE_THROWS... or CHECK_THROWS... assertions. The [!throws] tag covers you for cases where the exception is caught and handled in the code under test (or your test code).
  • [!shouldfail] This tells Catch that you're expecting this test to fail! Furthermore, if it does fail then it should treat that as a pass!
  • [!mayfail] Rather than explicitly inverting the pass/ fail logic as the previous tag does, this tag just says that the test may fail but that's ok (although it is still reported). It's also ok if it passes.

Hiding Tags

We already looked at [!hide] (and the deprecated [hide]) above, and mentioned that [.] was a shortcut for the same.

It turns that when one of these tags is used it is often combined with another tag that is used when you do want to run the test. The classic example is where you write integration tests in the same executable as unit tests. By default you don't want the integration tests to run as you want the shortest possible path to running just unit tests. So you hide them but also tag them [integration], or something similar (the word "integration" has no significance to Catch). So pairings like, [.][integration] or [.][performance] are frequently found together.

So, as a convenience, Catch now supports . as a tag prefix. The rest of the tag can be completely custom and works exactly like any other normal tag - except that the test is also hidden. Our examples would, thus, be written as [.integration] and [.performance]

One final point to mention about hiding tags is that, due to the way they have evolved through a number of forms (including the severely deprecated "./" name prefix) whichever form is used will not only hide the test, but any of the other forms will match it in a tag pattern. e.g. if you tag a test with [.] you can match it with [!hide].

Tag Aliases

As we saw earlier, tags can be combined in fairly complex ways. While this is powerful and flexible, it can be a bit awkward if you often want to use the same tag expression. Wouldn't it be nice if there was a way of writing the expression once then getting Catch to remember it for you - and associate it with an easier to remember name?

Well there is! You can associate any tag pattern with a name that you can use just like any normal tag - except that it must begin with the @ character.

You create a tag alias, in code, using the CATCH_REGISTER_TAG_ALIAS macro. E.g.

CATCH_REGISTER_TAG_ALIAS( "[@not nice]", "~[nice]~[!hide]" );

This registers a tag alias, [@not nice] which, when expanded will match all tests that are not tagged [nice] but also are not hidden. The second part is important because if you have any hidden tests then they will usually be included any time you use a not expression (~) because the rule is that tests are only hidden if no pattern is specified!

Also did you notice that we had a space in the tag name? Surprised? I never said that tags could not include spaces. Of course they can.

You can register as many aliases as you like and you can put them anywhere you like (as long as catch.hpp is #included). However I recommend keeping them all in your main source file (the one you #define CATCH_CONFIG_MAIN, or equivalent) - simply so you only have to look in one place for them.

Filenames As Tags

The newest special tag form is the result of automatically generating a set of tags. The tags all begin with the # character (I've resisted the urge to call them "hash tags"). The rest of the tag is generated from the name of the source file that the test is implemented in. The full path (as reported by __FILE__) is stripped of its directories and extension - so all tests in /Development/Tests/SquirrelTests.cpp would be tagged, [#SquirrelTests].

At time of writing this feature is only available on the develop branch on GitHub - and must be specifically enabled running with the --filenames-as-tags or -# command line options. It's possible that situation may change by the time it makes it onto master.

The Tag Line

So tags not only provide a rich grouping mechanism in Catch - they also allow you to control some aspects of how Catch runs and treats test cases. Some tags can be generated for you - and some tags can be expanded from simpler forms. We've covered here the complete set of special tags at the time of writing. If you're reading this in the future there may be more - I'll try and be better at keeping the docs up-to-date there. Also any stock price tips you might have from the future would be welcome too.

Modern C++ Testing

Phil Nash from level of indirection

Back in August I took my family to stay for a week at my brother's house in (Old) South Wales. I've not been to Wales for a long time and it was great to be back there - but that's not what this post is about, of course.

The thing about Wales is that it has mountains (and where there are no mountains there are plenty of hills and valleys and cliffs). Mobile cell coverage is non-existent in much of the country - particularly where my brother lives.

So the timing was particularly bad when, just as we were driving along the south cost (somewhere between Cardiff and Swansea, I think), I started getting emails and tweets from people pointing out that Catch was riding high on HackerNews! Someone had recently discovered Catch and was enjoying it enough that they wanted to share it with the community. Which is awesome!

Except that, between lack of mobile reception and spending time with my family, I didn't have opportunity to join the discussion.

When I got back home a week later I read through the comments. One of them stuck out because it called me out on describing Catch as a "Modern C++" framework (the commenter recommended another framework, Bandit, as being "more modern").

When I first released Catch, back in 2010, C++11 was still referred to as C++1x (or even C++0x!) and the final release date was still uncertain. So Catch was written to target C++03. It used a few "modern" idioms of the time - but the modernity was intended more as being a break from the past - where most C++ frameworks were just reimplementations of JUnit in C++. So I think the label was somewhat justified at the time.

Of course since then C++11 has not only been standardised but is fully, or nearly fully, implemented by many leading, mainstream, compilers. I think adoption is still not high enough, at this point, that I'd be willing to drop support for C++03 in Catch (there is even an actively maintained fork for VC6!). But it is enough that the baseline for what constitutes "modern C++" has definitely moved on. And now C++14 is here too - pushing it even further forward.

"Modern" is not what it used to be

What does it mean to be a "Modern C++ Test Framework" these days anyway? Well the most obvious thing for the user is probably the use of lambdas. Along with a few other features, lambdas allow for a lot of what previously required macros to be done in pure C++. I'm usually the first to hold this up as A Good Thing. In a moment I'll get to why I don't think it's necessarily as good a step as you might think.

But before I get to that; one other thing: For me, as a framework author, the biggest difference C++11/14 would make to something like Catch would be in the internals. Large chunks of code could be removed, reduced or at least cleaned up. The "no dependencies" policy means that Catch has complete implementations of things like shared pointers, optional types and function objects - as well as many things that must be done the long way round (such as iterating collections - I long for range for loops - or at least BOOST_FOREACH).

The competition

I've come across three frameworks that I'd say qualify as truly trying to be "modern C++ test frameworks". I'm sure there are others - and I've not really even used these ones extensively - but these are the ones I'll reference in this discussion. The three frameworks are:

  • Lest - by Martin Moene, an active contributor to Catch - and partly based on some Catch ideas - re-imagined for a C++11 world.
  • Bandit - this is the one mentioned in the Hacker News comment I kicked off with
  • Mettle - I saw this in a tweet from @MeetingCpp last week and it's what kicked off the train of thought that led me to this post

The case for test case macros

But why did I say that the use of lambdas is not such a good idea? Actually I didn't quite say that. I think lambdas are a very good idea - and in many ways they would certainly clean up at least the mechanics of defining and registering test cases and sections.

Before lambdas C++ had only one place you could write a block of imperative code: in a function (or method). That means that, in Catch, test cases are really just functions - which must have a function signature - including a name (which we hide - because in Catch the test name is a string). Those functions must be captured somehow. This is done by passing a pointer to the function to the constructor of a small class - who's sole purposes is to forward the function pointer onto a global registry. Later, when the tests are being run, the registry is iterated and the function pointers invoked.

So a test case like this:

    TEST_CASE( "test name", "[tags]" )
    {
        /* ... */
    }

...written out in full (after macro expansion) looks something like:

    static void generatedFunctionName();
    namespace{ ::Catch::AutoReg generatedNameAutoRegistrar
        (   &generatedFunctionName, 
       	    ::Catch::SourceLineInfo( __FILE__, static_cast<std::size_t>( __LINE__ ) ), 
            ::Catch::NameAndDesc( "test name", "[tags]") ); 
    }
    static void generatedFunctionName()
    {
        /* .... */
    }

(generatedFunctionName is generated by yet another macro, which combines root with the current line number. Because the function is declared static the identifier is only visible in the current translation unit (cpp file), so this should be unique enough)

So there's a lot of boilerplate here - you wouldn't want to write this all by hand every time you start a new test case!

With lambdas, though, blocks of code are now first class entities, and you can introduce them anonymously. So you could write them like:

    Catch11::TestCase( "test name", "[tags]", []() 
    {
        /* ... */
    } );

This is clearly far better than the expanded macro. But it's still noisier than the version that uses the macro. Most of the C++11/14 test frameworks I've looked at tend to group tests together at a higher level. The individual tests are more like Catch's sections - but the pattern is still the same - you get noise from the lambda syntax in the form of the []() or [&]() to introduce the lambda and an extra ); at the end.

Is that really worth worrying about?

Personally I find it's enough extra noise that I think I'd prefer to continue to use a macro - even if it used lambdas under the hood. But it's also small enough that I can certainly see the case for going macro free here.

Assert yourself

But that's just test cases (and sections). Assertions have traditionally been written using macros too. In this case the main reasons are twofold:

  1. It allows the expression evaluation to be wrapped in an exception handler.
  2. It allows us the capture the file and line number to report on.

(1) can arguably be handled in whatever is holding the current lambda (e.g. it or describe in Bandit, suite, subsuite or expect in Mettle). If these blocks are small enough we should get sufficient locality of exception handling - but it's not as tight as the per-expression handling with the macro approach.

(2) simply cannot be done without involving the preprocessor in some way (whether it's to pass __FILE__ and __LINE__ manually, or to encapsulate that with a macro). How much does that matter? Again it's a matter of taste but you get several benefits from having that information. Whether you use it to manually locate the failing assertion or if you're running the reporter in an IDE window that automatically allows you to double-click the failure message to take you to the line - it's really useful to be able to go straight to it. Do you want to give that up in order to go macro free? Perhaps. Perhaps not.

Interestingly lest still uses a macro for assertions

Weighing up

So we've seen that a truly modern C++ test framework, using lambdas in particular, can allow you to write tests without the use of macros - but at a cost!

So the other side of the equation must be: what benefit do you get from eschewing the macros?

Personally I've always striven to minimise or eliminate the use of macros in C++. In the early days that was mostly about using const, inline and templates. Now lambdas allow us to address some of the remaining cases and I'm all for that.

But I also tend to associate a much higher "cost" to macro usage when it generates imperative code. This is code that you're likely to find yourself needing to step through in a debugger at runtime - and macros really obfuscate this process. When I use macros it tends to be in declarative code. Code that generates purely declarative statements, or effectively declarative statements (such as the test case function registration code). It tends to always generate the exact same machinery - so should not be sensitive to its inputs in ways that will require debugging.

How do Catch's macros play out in that regard? Well the test case registration macros get a pass. Sections are a grey area - they are on the path of code that needs to be stepped over - and, worse, hide a conditional (a section is really just an if statement on a global variable!). So score a few points down there. Assertions are also very much runtime executable - and are frequently on the debugging path! In fact stepping into expressions being asserted on in Catch tests can be quite a pain as you end up stepping into some of the "hidden" calls before you get to the expression you supplied (in Visual Studio, at least, this can be mitigated by excluding the Catch namespace using the StepOver registry key).

Now, interestingly, the use of macros for the assertions was never really about C++03 vs C++11. It was about capturing extra information (file/ line) and wrapping in a try-catch. So if you're willing to make that trade-off there's no reason you can't have non-macro assertions even in C++03!

Back to the future

One of my longer arcs of development on Catch (that I edge towards on each refactoring) is to decouple the assertion mechanism from the guts of the test runner. You should be able to provide your own assertions that work with Catch. Many other test frameworks work this way and it allows them to be much more flexible. In particular it will allow me to decouple the matcher framework (and maybe allow third-party matchers to work with Catch).

Of course this would also allow macro-less assertions to be used (as it happens the assertions in bandit and mettle are both matcher-like already).

So, while I think Catch is committed to supporting C++03 for some time yet, that doesn't mean there is no scope for modernising it and keeping it relevant. And, modern or not, I still believe it is the simplest C++ test framework to get up and running with, and the least noisy to work with.

The Primacy of Testability: Modularity

Austin Bingham from Good With Computers

In the first post in this series I set the stage for a discussion of how testability can serve as a proxy or enabler for other, more directly desirable qualities in a software system. In this post I'd like to look at the first such quality, modularity.

Modularity is perhaps the most obvious quality to correlate with testability. Modularity is a bit like motherhood and apple pie, and we generally recognize it as a "good thing." Most good developers instinctively foster modularity and reject non-modular designs, both because they've been taught that modularity is good and because they've learned through experience that it's important for many, many reasons.

What do we mean by modularity? A simple definition of modularity is "breaking designs into pieces", but that obviously misses important points. We don't just pull a system into arbitrary pieces, but instead we create modules with specific qualities in order to support and foster the goals of whatever it is we're building. A typical (and important) goal with modules is to have high internal cohesion and low external coupling. A highly cohesive module is one in which all of the module elements are related, work together, and are necessary for performing the work that the module does. A module with low coupling is one that doesn't rely unnecessarily or too extensively on other modules.

There are plenty of ways to ways to discuss and even measure modularity in software designs [1], but for our purposes the basic concepts of cohesion and coupling suffice. So how does modular code - code structured as modules with high internal cohesion and low external coupling - correlate with testability? The short answer is that modular code is generally very testable, so let's see how and why.

One way to see the relationship between testability and modularity is to recognize that modular code, by virtue of its low coupling, is easier to instantiate and execute in isolation. The fewer things (classes, services, whatever) that a module needs, the less work it is to write tests for it. That is, you simply have to type less for each test. Likewise, tests for modules with fewer external dependencies often execute more quickly - at least marginally so - because there's less stuff to do for each test. The result is that developers can run the tests more often and will be more inclined to do so.

If you use mocks in your testing system, low coupling means that you generally have to design fewer mock objects (because your modules have fewer dependencies that need mocking). Moreover, your mocks will often be simpler because they won't have to simulate interactions between far-flung dependencies.

The relationship between modularity and testability goes even deeper, though. Beyond the somewhat mechanical questions of instantiation times, test length, and so forth, there are serious cognitive problems associated with testing poorly-modularized code. The first of these problems is that your tests can end up testing the wrong thing. If a test for some bit of functionality has to invoke code in other modules, the success or failure of the test depends not just on the code under test but on all of those other modules. On some level this is unavoidable [2], but unnecessary dependencies unnecessarily increase the risk that the results you're seeing for your tests are caused by faults in other modules. These kinds of external failures take time to debug, can distort tests as developers try to work around them, or - in the case of incorrectly passing tests - can mask actual defects. [3] Ideally, the output from your tests should give you precise and unambiguous information about just the element under test, and reducing coupling helps developers approach that ideal.

A second way that poorly modularized code increases cognitive load for test developers is that it increases the state space with which a test developer needs to contend. For every external element that plays a role in a module's functionality, the state space that needs to be considered for testing grows to include the potential states of that element. For example, if a test relies on the some global state managed by an external module - perhaps something like an application object managed by the GUI library - then test developers need to ensure that this external object is in the correct state when the test is executed. This means that no matter what other tests have been run, what order they've been run in, or their results, the test developer needs to ensure that this external state is acceptable. As external test dependencies grow, this management of external state can become a significant or even dominant aspect of test development. In other words, the more unnecessary dependencies a module has, the more time a developer will need to spend thinking about those dependencies rather than focusing on testing the functionality at hand. Babysitting unnecessary dependencies doesn't generate information or value. [4]

So we can see that poorly modularized code is generally harder to test, and that's good to know. But the goal of this series is to explore how testability reflects other qualities, not the other way around. How, then, can the testability of our code inform us about its modularity?

The important point is this: if you find that your code is easy to test, then that's a good indication that your code is modular! If you sit down to write some tests and you find that your fixtures are easy to write, the extent of the tests is easy to define, and you don't find that you're constantly needing to consider distant side-effects, then you're probably testing modular code. To carry on with the barometer metaphor, you've got high-pressure, and you can see blue skies and little white puffy clouds. It's time for a picnic.

On the other hand, if your testing efforts involve lots of ceremony - instantiating objects, wiring them together, and bootstrapping subsystems - or if you feel like your really just shotgunning tests at the code rather than pinpointing the "obvious" testing points, then chances are that you've got some modularity issues.

Like I said, modularity is perhaps the most obvious quality that we can associate with testability. And the way to leverage testability in this case is to pay attention to what your testing efforts tell you. There's a lot of information in the work involved in developing tests, and drawing insight from this work is a valuable doubling-up of that effort.

A corollary to this insight is that, when you find yourself having difficulty writing some tests, step back and ask yourself if these problems stem from poor modularity. Testing can be a bit repetitive, though often with minor tweaks, and this is a wonderful environment in which to find repeated bad patterns. And sometime consideration of these patterns will show where you might have an opportunity for better modularization of your code.

[1]See for example some of Martin Fowler's thoughts on the topic, discussions about the interaction between type-systems and modularity, and even some information-theoretic approaches.
[2]Operating systems and standard libraries are "other modules" by most measures.
[3]Consider a test for a specific exception. If some dependency is throwing that exception and the function under test is, in fact, not throwing the exception, then the test is passing erroneously.
[4]And as Gerald Weinberg says, tests are about producing information.

The Primacy of Testability

Austin Bingham from Good With Computers

The job of a software architect [1] is difficult, just like almost every role in software development. They have to keep track of many subtly interacting quality attributes, often on multiple projects, any one of which may be too big or evolving too quickly to meaningfully keep in mental cache. To make matters worse, architects don't have near the level of tool support - compilers, static analysis tools, auto-completion - available to developers. They are much more reliant on experience, awareness, intuition, and heuristics.

In light of this, it's interesting and useful to consider what tools are available to help architects. In particular, I want to look at the role of testability in the architect's job, and to try to show how it can serve as a meaningful proxy for other, perhaps more important qualities in a software system. Testability is a quality that can promote the health of other desirable qualities, and it can serve as an indicator of whether these requirements are being met. The metaphor I like to use is that testability is a kind of barometer for software architects. A barometer only really tells you the air-pressure, but you can often use this to determine if there's going to be rain. Testability only really tells you how amenable your code is to useful testing, but you can often use this to help determine if your system is modular, organizationally scalable, and so forth.

What is testability anyway?

To meaningfully discuss testability as a tool, we need to establish some definition of what it means. Like "software architect", there is no perfect answer. On some level all software is testable in that you can test it. By hook or by crook you can write some code that verifies the behavior of pretty much anything with a specification. So clearly just "being testable" isn't a sufficient definition.

At the same time, it's also pretty clear that it's simpler to test some software than other. It may be easy to test for a number of reasons. Perhaps it's easy to understand, so that you have a clear understanding of how to test it thoroughly and properly. Perhaps the chunk of code is easy to instantiate without requiring a whole bunch of scaffolding and support objects. This not only saves on keystrokes but it also has other big benefits: it isolates behavior, it may mean your tests are faster, and it generally means that your tests are easier to understand and thus maintain.

If you do a little poking around you'll find that people have hit upon certain code qualities that generally influence the testability of a piece of code. [2] But in the end we don't really have a "testability-o-meter" that we can point at a piece of code. There's no accepted way to assign a "testability rating" to software that tells you if code is more or less testable than other code, or even if it's "easy to test" or "hard to test". We can sometimes get these kinds of numbers for other qualities like modularity or complexity, and things like "scalability" also lend themselves to being measured, but testability isn't (yet) in that realm.

Instead, determining if something is testable is a decision that people need to make, and it's a decision that you can only make in an informed way if you understand code. And this is why my definition of "software architect" - from a practical standpoint - includes being able to understand code well at many levels. You have to be able to recognize when, say, dependency injection could replace local object construction to reduce coupling in a system. You need to be able to spot - or at least know to be on the lookout for - circular dependencies between modules. And in general you're going to need to be able to do this not only with code that you're writing but with code that you only see in reviews or maybe only see described in documents.

Why testability?

So I've just told you that testability is hard to measure or even to define. In fact, I've told you that to make heads or tails of it you need to be an experienced programmer. On its face, then, it sounds like the cure is worse than the disease: yes, you've got complexity in your projects to deal with, but now I want to you do something even harder to make those problems go away.

On some level that's true! Gauging testability isn't simple and it's not perfect, but by targetting testability we get a couple of important benefits because testability is special.

Testability represents your first customer, your first users: your tests! Tests are very often the first place your code is used outside of your head. This means that this is where you'll first spot difficult APIs or awkward relationships that slipped through your design.

Tests force us to use code, and they force us to consider it at many different zoom levels - from unit tests to functional tests to integration tests, we get to see it all. And tests can - and should - happen early and often in the development process. This is how you get maximum benefit from them.

If you're paying attention you'll notice that I just made a significant shift in terminology. I went from talking about "testability" to "tests", from "code that can be tested" to "code with tests". I guess it's arguable that you can have testable code without actually having tests, but that seems a bit academic to me. I've gone on and on about how difficult it is to measure testability, but one of most effective and practical ways to asses testability is to simply test your code!

So for my purposes, testable code is also tested code. I won't quibble able precisely how much testing is enough, or at what level it should be done; there are plenty of other people who are happy to tell you that. [3]

But if your tests add value to your software system, then I'd wager that they exercise your code enough to highlight a lot of the software qualities for which testability is a barometer.

Qualities correlated with testability

This article lays the foundation for the rest of this series in which we'll look at various software qualities that correlate with testability. Some of these qualities, such as modularity, are directly reflected in the testability of a system. Other qualities, like performance, are supported or enabled by testability but aren't directly related to it.

[1]This is an ill-defined term, to be sure, but I'm essentially talking about the person tasked with shepherding the so-called non-functional requirements...whether their job title is "software architect" or not.
[2]Wikipedia's got a nice, non-controversial list of things like "observability", "heterogeneity", and "understandability", and all of these things certainly would influence how easy or hard it is to test a piece of code.
[3]See for example TDD, BDD, or James Coplien's thoughts.