One of the problems when making code changes is knowing whether there is good test coverage around the area you’re going to touch. In theory, if a rigorous test-first approach is taken no production code should be written without first being backed by a failing test. Of course we all know the old adage about how theory turns out in practice . Even so, just because a test has been written, you don’t know what the quality of it and any related ones are.
The practice of mutation testing is one way to answer the perennial question: how do you test the tests? How do you know if the tests which have been written adequately cover the behaviours the code should exhibit? Alternatively, as a described recently in “Overly Prescriptive Tests”, are the tests too brittle because they require too exacting a behaviour?
There are tools out there which will perform mutation testing automatically that you can include as part of your build pipeline. However I tend to use them in a more manual way to help me verify the tests around the small area of functionality I’m currently concerned with .
The principle is actually very simple, you just tweak the production code in a small way that would likely mimic a real change and you see what tests fail. If no tests fail at all then you probably have a gap in your spec that needs filling.
Naturally the changes you make on the production code should be sensible and functional in behaviour; there’s no point in randomly setting a reference to null if that scenario is impossible to achieve through the normal course of events. What we’re aiming for here is the simulation of an accidental breaking change by a developer. By tweaking the boundaries of any logic we can also check that our edge cases have adequate coverage too.
There is of course the possibility that this will also unearth some dead code paths too, or at least lead you to further simplify the production code to achieve the same expected behaviour.
Imagine you’re working on a service and you spy some code that appears to format a DateTime value using the default formatter. You have a hunch this might be wrong but there is no obvious unit test for the formatting behaviour. It’s possible the value is observed and checked in an integration or acceptance test elsewhere but you can’t obviously  find one.
Naturally if you break the production code a corresponding test should break. But how badly do you break it? If you go too far all your tests might fail because you broke something fundamental, so you need to do it in varying degrees and observe what happens at each step.
If you tweak the date format, say, from the US to the UK format nothing may happen. That might be because the tests use a value like 1st January which is 01/01 in both schemes. Changing from a local time format to an ISO format may provoke something new to fail. If the test date is particularly well chosen and loosely verified this could well still be inside whatever specification was chosen.
Moving away from a purely numeric form to a more natural, wordy one should change the length and value composition even further. If we reach this point and no tests have failed it’s a good chance nothing will. We can then try an empty string, nonsense strings and even a null string reference to see if someone only cares that some arbitrary value is provided.
But what if after all that effort still no lights start flashing and the klaxon continues to remain silent?
What Does a Test Pass or Fail Really Mean?
In the ideal scenario as you slowly make more and more severe changes you would eventually hope for one or maybe a couple of tests to start failing. When you inspect them it should be obvious from their name and structure what was being expected, and why. If the test name and assertion clearly specifies that some arbitrary value is required then its probably intentional. Of course It may still be undesirable for other reasons  but the test might express its intent well (to document and to verify).
If we only make a very small change and a lot of tests go red we’ve probably got some brittle tests that are highly dependent on some unrelated behaviour, or are duplicating behaviours already expressed (probably better) elsewhere.
If the tests stay green this does not necessary mean we’re still operating within the expected behaviour. It’s entirely possible that the behaviour has been left completely unspecified because it was overlooked or forgotten about. It might be that not enough was known at the time and the author expected someone else to “fill in the blanks” at a later date. Or maybe the author just didn’t think a test was needed because the intent was so obvious.
Plugging the Gaps
Depending on the testing culture in the team and your own appetite for well defined executable specifications you may find mutation testing leaves you with more questions than you are willing to take on. You only have so much time and so need to find a way to plug any holes in the most effective way you can. Ideally you’ll follow the Boy Scout Rule and at least leave the codebase in a better state than you found it, even if that isn’t entirely to your own satisfaction.
The main thing I get out of using mutation testing is a better understanding of what it means to write good tests. Seeing how breaks are detected and reasoned about from the resulting evidence gives you a different perspective on how to express your intent. My tests definitely aren’t perfect but by purposefully breaking code up front you get a better feel for how to write less brittle tests than you might by using TDD alone.
With TDD you are the author of both the tests and production code and so are highly familiar with both from the start. Making a change to existing code by starting with mutation testing gives you a better orientation of where the existing tests are and how they perform before you write your own first new failing test.
Refactoring is about changing the code without changing the behaviour. This can also apply to tests too in which case mutation testing can provide the technique for which you start by creating failing production code that you “fix” when the test is changed and the bar goes green again. You can then commit the refactored tests before starting on the change you originally intended to make.
 “In theory there is no difference between theory and practice; in practice there is.” –- Jan L. A. van de Snepscheut.
 Just as with techniques like static code analysis you really need to adopt this from the beginning if you want to keep the noise level down and avoid exploring too large a rabbit hole.
 How you organise your tests is a subject in its own right but suffice to say that it’s usually easier to find a unit test than an acceptance test that depends on any given behaviour.
 The author may have misunderstood the requirement or the requirement was never clear originally and so the the behaviour was left loosely specified in the short term.