Writing: Becoming a Better Programmer

Pete Goodliffe from Pete Goodliffe


It's finally here!

My new book, Becoming a Better Programmer, is fully edited, laid out, and is now available as a final product for your reading pleasure, published by O'Reilly. You can purchase it in printed form or as a digital version for your e-reader of choice.

Find out more about the book from the O'Reilly product page. You can view the full table of contents there. Or head over to Amazon to purchase. If you are a Safari subscriber, you can read it here. Grab your iBook here.

The cover image is a flying fish. I'll leave it to your imagination to work out the significance.

It's great to finally see this labour of love come to fruition, and I do hope that stands as a useful resource for programmers today.

One of my favourite parts of the book is a family in-joke in the "advance praise" at the front. Nestled amongst the luminaries and expert programmers who graciously contributed their honest thoughts on the book is another very honest opinion:


How Visual Lint parses projects and makefiles

Products, the Universe and Everything from Products, the Universe and Everything

Code analysis tools can require a lot of configuration to be useful. Whilst some (e.g. Vera++ or cpplint.py) need very little configuration to make use if effectively, others such as PC-lint (and even, to a lesser extent, CppCheck) may need to be fed pretty much the same configuration as the compiler itself to be useful. As a result the command lines you need to use with some analysis tools are pretty complex in practice (which is of course a disincentive to using them, and so where Visual Lint comes in. But I digress...).

In fact, more capable code analysis tools may need to be given even more information than the compiler iteself to generate meaningful results - for example they may need to be configured with the built-in preprocessor settings of the compiler itself. In a PC-lint analysis configuration, this is part of the job compiler indirect files such as co-msc110.lnt do.

As a result of the above, a product such as Visual Lint which effectively acts as a "front end" to complex (and almost infinitely configurable) code analysis tools like PC-lint needs to care about the details of how your compiler actually sees your code every bit as much as the analysis tool itself does.

What this means in practice is that Visual Lint needs be able to determine the properties of each project it sees and understand how they are affected by the properties of project platforms, configurations and whatever compiler(s) a project uses. When it generates an analysis command line, it may need to reflect those properties so that the analysis tool sees the details of the preprocessor symbols, include folders etc. each file in the project would normally be compiled with - including not only the properties exposed in the corresponding project or makefile, but also built-in symbols etc. normally provided by the corresponding compiler.

That's a lot of data to get right, and inevitably sometimes there will be edge cases where don't quite get it right the first time. It goes without saying that if you find one of those edge cases - please tell us!

So, background waffle over - in this post I'm going to talk about one of the things Visual Lint does - parsing project files to identify (among other things) the preprocessor symbols and include folders for each file in order to be able to reflect this information in the analysis tool configuration.

When analysing with CppCheck, preprocessor and include folder data read in this way can be included on the generated command line as -D and -I directives. Although we could do the same with PC-lint, it is generally better to write the preprocessor and include folder configuration to an indirect ("project.lnt") file which also includes details of which implementation (.c, .cpp, .cxx etc.) files are included in the project -as well as any other project specific options. For example:

// Generated by Visual Lint 4.5.6.233 from file: SourceVersioner_vs90.vcproj
// -dConfiguration=Release|Win32

//
-si4 -sp4 // Platform = "Win32"
//
+ffb // ForceConformanceInForLoopScope = "TRUE"
-D_UNICODE;UNICODE // CharacterSet = "1"
-DWIN32;NDEBUG;_CONSOLE // PreprocessorDefinitions = "WIN32;NDEBUG;_CONSOLE"
-D_CPPRTTI // RuntimeTypeInfo = "TRUE"
-D_MT // RuntimeLibrary = "0"
//
// AdditionalIncludeDirectories = ..\Include"
-save -e686 //
-i"..\Include" //
-restore //
//
// SystemIncludeDirectories = "
// F:\Projects\Libraries\boost\boost_1_55_0;
// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\include;

// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\atlmfc\include;

// C:\Program Files\
// Microsoft SDKs\Windows\v6.0A\include;

// F:\Projects\Libraries\Wtl\8.1\Include"
//
-save -e686 //
+libdir(F:\Projects\Libraries\boost\boost_1_55_0)
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include")
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir(F:\Projects\Libraries\Wtl\8.1\Include)
-iF:\Projects\Libraries\boost\boost_1_55_0
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include"
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include"
-i"C:\Program Files\Microsoft SDKs\Windows\v6.0A\include"
-iF:\Projects\Libraries\Wtl\8.1\Include
-restore //
//
SourceVersioner.cpp // RelativePath = "SourceVersioner.cpp"
SourceVersionerImpl.cpp // RelativePath = "SourceVersionerImpl.cpp"
stdafx.cpp // RelativePath = "stdafx.cpp"
Shared\FileUtils.cpp // RelativePath = "Shared\FileUtils.cpp"
Shared\FileVersioner.cpp // RelativePath = "Shared\FileVersioner.cpp"
Shared\PathUtils.cpp // RelativePath = "Shared\PathUtils.cpp"
Shared\ProjectConfiguration.cpp
// RelativePath = "Shared\ProjectConfiguration.cpp"
Shared\ProjectFileReader.cpp // RelativePath = "Shared\ProjectFileReader.cpp"
Shared\SolutionFileReader.cpp // RelativePath = "Shared\SolutionFileReader.cpp"
Shared\SplitPath.cpp // RelativePath = "Shared\SplitPath.cpp"
Shared\StringUtils.cpp // RelativePath = "Shared\StringUtils.cpp"
Shared\XmlUtils.cpp // RelativePath = "Shared\XmlUtils.cpp"

A file like this is written for every project configuration we analyse, but the mechanism used to actually read the configuration data from projects varies slightly depending on the project type and structure.

In the case of the Visual Studio 2002, 2003, 2005 and 2008 .vcproj files Visual Lint was originally designed to work with, this is straightforward as the project file contains virtually all of the information needed in a simple to read, declarative form. Hence to parse a .vcproj file we simply load its contents into an XML DOM object and query the properties in a straightforward manner. Built-in preprocessor symbols such as _CPPUNWIND are defined by reading the corresponding properties (EnableExceptions in the case above) or inferred from the active platform etc.

Similarly, for Visual C++ 6.0 and eMbedded Visual C++ 4.0 project files we just parse the compiler command lines in the .dsp or .vcp file directly. This is pretty straightforward as well as although .dsp and .vcp files are really makefiles they have a very predictable structure. Some other development environments (e.g. Green Hills MULTI, CodeVisionAVR) have bespoke project file formats which are generally easy to parse using conventional techniques.

Visual Studio 2010, 2012 and 2013 .vcxproj project files are far more tricky, as the MSBuild XML format they use is effectively a scripting language rather than a declarative format. To load them, we effectively have to execute them (but obviously without running the commands they contain).

As you can imagine this is rather tricky! We basically had to write a complete MSBuild parsing engine* for this task, which effectively executes the MSBuild script in memory with defined parameters to identifiy its detailed properties.

* Although there are MSBuild APIs which could conceivably help, there are implemented in managed code - which we can't use in a plug-in environment due to .NET framework versioning restrictions. Instead, our MSBuild parsing engine is written natively in C++.

To work out the parameters to apply to the MSBuild script, we prescan the contents of the .vcxproj file to identify the configurations and platforms available. Once we have those, we can run the script in memory and inspect the properties it generates for the required build configuration. This process is also used with CodeGear C++ and Atmel Studio projects, both of which are MSBuild based.

Eclipse projects (.project and .cproject files) are also XML based and are not too difficult to parse, but whereas it is straightforward to work out how things are defined in a .vcproj file, Eclipse project files are much less clearly defined and more variable (not surprising as they effectively represent serialised Java object data).

To make matters worse, each compiler toolchain can have its own sub-format, so things can change in very subtle ways between based projects for different Eclipse based environments such as Eclipse/CDT, QNX Momentics and CodeWarrior. In addition, Eclipse C/C++ projects come in two flavours - managed builder and makefile.

Of the two, managed builder projects are easy to deal with - for example to determine the built-in settings of the compiler in a managed builder we read the scanner (*.sc) file produced by the IDE when it builds the project, and add that data to the configuration data we have been able to read from the project file.

The scanner (.sc) file is a simple XML file located within the Eclipse workspace .metadata folder. Here's an extract to give you an idea what it looks like:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?scdStore version="2"?>
<scannerInfo id="org.eclipse.cdt.make.core.discoveredScannerInfo">
<instance id="cdt.managedbuild.config.gnu.mingw.exe.debug.116390618;
cdt.managedbuild.config.gnu.mingw.exe.debug.116390618.;
cdt.managedbuild.tool.gnu.cpp.compiler.mingw.exe.debug.1888704458;
cdt.managedbuild.tool.gnu.cpp.compiler.input.391761176">
<collector id="org.eclipse.cdt.make.core.PerProjectSICollector">
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/mingw32"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/backward"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include-fixed"/>
<definedSymbol symbol="__STDC__=1"/>
<definedSymbol symbol="__cplusplus=1"/>
<definedSymbol symbol="__STDC_HOSTED__=1"/>
<definedSymbol symbol="__GNUC__=4"/>
<definedSymbol symbol="__GNUC_MINOR__=5"/>
<definedSymbol symbol="__GNUC_PATCHLEVEL__=2"/>
<definedSymbol symbol="__GNUG__=4"/>
<definedSymbol symbol="__SIZE_TYPE__=unsigned int"/>
.
.
.
</collector>
</instance>
</scannerInfo>

Unfortunately scanner files are not generated for makefile projects, and the enclosing project files do not give details of the preprocessor symbols and include folders needed to analyse them. So how do we do it?

The answer is similar to the way we load MSBuild projects - by running the makefile without executing the commands within. In this case however the make tools themselves can (fortunately!) lend a hand, as both NMake and GNU Make have an option which runs the makefile and echoes the commands which would be executed without actually running them. For NMake this is /n and GNU Make -n or --just-print.

The /B (NMake) -B (GNU Make) switch can also be used to ensure that all targets are executed regardless of the build status of the project (otherwise we wouldn't be able to read anything if the project build was up to date).

If we run a makefile with these switches and capture the output we can then parse the compiler command lines themselves and extract preprocessor symbols etc. from it. As so many compilers have similar command line switches for things like preprocessor symbols and include folders this is generally a pretty straightforward matter. Many variants of GCC also have command line options which can report details of built-in preprocessor symbols and system include folders, which we can obviously make use of - so the fact that many embedded C/C++ compilers today are based on GCC is very useful indeed.

There's a lot of detail I've not covered here, but I hope you get the general idea. It follows from the above that adding support for new project file types to Visual Lint is generally not difficult - provided we have access to representative project files and (preferably) a test installation of the IDE in question. If there is a project file type you would like us to look at, please tell us.

How Visual Lint parses projects and makefiles

Products, the Universe and Everything from Products, the Universe and Everything

Code analysis tools can require a lot of configuration to be useful. Whilst some (e.g. Vera++ or cpplint.py) need very little configuration to make use if effectively, others such as PC-lint (and even, to a lesser extent, CppCheck) may need to be fed pretty much the same configuration as the compiler itself to be useful. As a result the command lines you need to use with some analysis tools are pretty complex in practice (which is of course a disincentive to using them, and so where Visual Lint comes in. But I digress...). In fact, more capable code analysis tools may need to be given even more information than the compiler iteself to generate meaningful results - for example they may need to be configured with the built-in preprocessor settings of the compiler itself. In a PC-lint analysis configuration, this is part of the job compiler indirect files such as co-msc110.lnt do. As a result of the above, a product such as Visual Lint which effectively acts as a "front end" to complex (and almost infinitely configurable) code analysis tools like PC-lint needs to care about the details of how your compiler actually sees your code every bit as much as the analysis tool itself does. What this means in practice is that Visual Lint needs be able to determine the properties of each project it sees and understand how they are affected by the properties of project platforms, configurations and whatever compiler(s) a project uses. When it generates an analysis command line, it may need to reflect those properties so that the analysis tool sees the details of the preprocessor symbols, include folders etc. each file in the project would normally be compiled with - including not only the properties exposed in the corresponding project or makefile, but also built-in symbols etc. normally provided by the corresponding compiler. That's a lot of data to get right, and inevitably sometimes there will be edge cases where don't quite get it right the first time. It goes without saying that if you find one of those edge cases - please tell us! So, background waffle over - in this post I'm going to talk about one of the things Visual Lint does - parsing project files to identify (among other things) the preprocessor symbols and include folders for each file in order to be able to reflect this information in the analysis tool configuration. When analysing with CppCheck, preprocessor and include folder data read in this way can be included on the generated command line as -D and -I directives. Although we could do the same with PC-lint, it is generally better to write the preprocessor and include folder configuration to an indirect ("project.lnt") file which also includes details of which implementation (.c, .cpp, .cxx etc.) files are included in the project -as well as any other project specific options. For example:
// Generated by Visual Lint 4.5.6.233 from file: SourceVersioner_vs90.vcproj
// -dConfiguration=Release|Win32
//
-si4 -sp4 // Platform = "Win32"
//
+ffb // ForceConformanceInForLoopScope = "TRUE"
-D_UNICODE;UNICODE // CharacterSet = "1"
-DWIN32;NDEBUG;_CONSOLE // PreprocessorDefinitions = "WIN32;NDEBUG;_CONSOLE"
-D_CPPRTTI // RuntimeTypeInfo = "TRUE"
-D_MT // RuntimeLibrary = "0"
//
// AdditionalIncludeDirectories = ..\Include"
-save -e686 //
-i"..\Include" //
-restore //
//
// SystemIncludeDirectories = "
// F:\Projects\Libraries\boost\boost_1_55_0;
// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\include;
// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\atlmfc\include;
// C:\Program Files\
// Microsoft SDKs\Windows\v6.0A\include;
// F:\Projects\Libraries\Wtl\8.1\Include"
//
-save -e686 //
+libdir(F:\Projects\Libraries\boost\boost_1_55_0)
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include")
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir(F:\Projects\Libraries\Wtl\8.1\Include)
-iF:\Projects\Libraries\boost\boost_1_55_0
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include"
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include"
-i"C:\Program Files\Microsoft SDKs\Windows\v6.0A\include"
-iF:\Projects\Libraries\Wtl\8.1\Include
-restore //
//
SourceVersioner.cpp // RelativePath = "SourceVersioner.cpp"
SourceVersionerImpl.cpp // RelativePath = "SourceVersionerImpl.cpp"
stdafx.cpp // RelativePath = "stdafx.cpp"
Shared\FileUtils.cpp // RelativePath = "Shared\FileUtils.cpp"
Shared\FileVersioner.cpp // RelativePath = "Shared\FileVersioner.cpp"
Shared\PathUtils.cpp // RelativePath = "Shared\PathUtils.cpp"
Shared\ProjectConfiguration.cpp
// RelativePath = "Shared\ProjectConfiguration.cpp"
Shared\ProjectFileReader.cpp // RelativePath = "Shared\ProjectFileReader.cpp"
Shared\SolutionFileReader.cpp // RelativePath = "Shared\SolutionFileReader.cpp"
Shared\SplitPath.cpp // RelativePath = "Shared\SplitPath.cpp"
Shared\StringUtils.cpp // RelativePath = "Shared\StringUtils.cpp"
Shared\XmlUtils.cpp // RelativePath = "Shared\XmlUtils.cpp"
A file like this is written for every project configuration we analyse, but the mechanism used to actually read the configuration data from projects varies slightly depending on the project type and structure. In the case of the Visual Studio 2002, 2003, 2005 and 2008 .vcproj files Visual Lint was originally designed to work with, this is straightforward as the project file contains virtually all of the information needed in a simple to read, declarative form. Hence to parse a .vcproj file we simply load its contents into an XML DOM object and query the properties in a straightforward manner. Built-in preprocessor symbols such as _CPPUNWIND are defined by reading the corresponding properties (EnableExceptions in the case above) or inferred from the active platform etc. Similarly, for Visual C++ 6.0 and eMbedded Visual C++ 4.0 project files we just parse the compiler command lines in the .dsp or .vcp file directly. This is pretty straightforward as well as although .dsp and .vcp files are really makefiles they have a very predictable structure. Some other development environments (e.g. Green Hills MULTI, CodeVisionAVR) have bespoke project file formats which are generally easy to parse using conventional techniques. Visual Studio 2010, 2012 and 2013 .vcxproj project files are far more tricky, as the MSBuild XML format they use is effectively a scripting language rather than a declarative format. To load them, we effectively have to execute them (but obviously without running the commands they contain). As you can imagine this is rather tricky! We basically had to write a complete MSBuild parsing engine* for this task, which effectively executes the MSBuild script in memory with defined parameters to identifiy its detailed properties. * Although there are MSBuild APIs which could conceivably help, there are implemented in managed code - which we can't use in a plug-in environment due to .NET framework versioning restrictions. Instead, our MSBuild parsing engine is written natively in C++. To work out the parameters to apply to the MSBuild script, we prescan the contents of the .vcxproj file to identify the configurations and platforms available. Once we have those, we can run the script in memory and inspect the properties it generates for the required build configuration. This process is also used with CodeGear C++ and Atmel Studio projects, both of which are MSBuild based. Eclipse projects (.project and .cproject files) are also XML based and are not too difficult to parse, but whereas it is straightforward to work out how things are defined in a .vcproj file, Eclipse project files are much less clearly defined and more variable (not surprising as they effectively represent serialised Java object data). To make matters worse, each compiler toolchain can have its own sub-format, so things can change in very subtle ways between based projects for different Eclipse based environments such as Eclipse/CDT, QNX Momentics and CodeWarrior. In addition, Eclipse C/C++ projects come in two flavours - managed builder and makefile. Of the two, managed builder projects are easy to deal with - for example to determine the built-in settings of the compiler in a managed builder we read the scanner (*.sc) file produced by the IDE when it builds the project, and add that data to the configuration data we have been able to read from the project file. The scanner (.sc) file is a simple XML file located within the Eclipse workspace .metadata folder. Here's an extract to give you an idea what it looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?scdStore version="2"?>
<scannerInfo id="org.eclipse.cdt.make.core.discoveredScannerInfo">
<instance id="cdt.managedbuild.config.gnu.mingw.exe.debug.116390618;
cdt.managedbuild.config.gnu.mingw.exe.debug.116390618.;
cdt.managedbuild.tool.gnu.cpp.compiler.mingw.exe.debug.1888704458;
cdt.managedbuild.tool.gnu.cpp.compiler.input.391761176">
<collector id="org.eclipse.cdt.make.core.PerProjectSICollector">
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/mingw32"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/backward"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include-fixed"/>
<definedSymbol symbol="__STDC__=1"/>
<definedSymbol symbol="__cplusplus=1"/>
<definedSymbol symbol="__STDC_HOSTED__=1"/>
<definedSymbol symbol="__GNUC__=4"/>
<definedSymbol symbol="__GNUC_MINOR__=5"/>
<definedSymbol symbol="__GNUC_PATCHLEVEL__=2"/>
<definedSymbol symbol="__GNUG__=4"/>
<definedSymbol symbol="__SIZE_TYPE__=unsigned int"/>
.
.
.
</collector>
</instance>
</scannerInfo>
Unfortunately scanner files are not generated for makefile projects, and the enclosing project files do not give details of the preprocessor symbols and include folders needed to analyse them. So how do we do it? The answer is similar to the way we load MSBuild projects - by running the makefile without executing the commands within. In this case however the make tools themselves can (fortunately!) lend a hand, as both NMake and GNU Make have an option which runs the makefile and echoes the commands which would be executed without actually running them. For NMake this is /n and GNU Make -n or --just-print. The /B (NMake) -B (GNU Make) switch can also be used to ensure that all targets are executed regardless of the build status of the project (otherwise we wouldn't be able to read anything if the project build was up to date). If we run a makefile with these switches and capture the output we can then parse the compiler command lines themselves and extract preprocessor symbols etc. from it. As so many compilers have similar command line switches for things like preprocessor symbols and include folders this is generally a pretty straightforward matter. Many variants of GCC also have command line options which can report details of built-in preprocessor symbols and system include folders, which we can obviously make use of - so the fact that many embedded C/C++ compilers today are based on GCC is very useful indeed. There's a lot of detail I've not covered here, but I hope you get the general idea. It follows from the above that adding support for new project file types to Visual Lint is generally not difficult - provided we have access to representative project files and (preferably) a test installation of the IDE in question. If there is a project file type you would like us to look at, please tell us.

A More Full-Featured Emacs company-mode Backend

Austin Bingham from Good With Computers

In the first article in this series we looked at how to define the simplest company-mode backend. [1] This backend drew completion candidates from a predefined list of options, and allowed you to do completion in buffers in fundamental mode. The main purpose of that article was to introduce the essential plumbing of a company-mode backend.

In this article we'll expand upon the work of the first, adding some useful UI elements like annotations and metadata. We'll also implement a rough form of fuzzy matching, wherein candidates will be presented to the user when they mostly match the prefix. After this article you'll know almost everything you need to know about writing company-mode backends, and you'll be in a great position to learn the rest on your own.

Most of what we'll be doing in the article revolves around handling completion candidate "metadata", data associated in some way with our completion candidates. In practice this kind of data covers things like documentation strings, function signatures, symbols types, and so forth, but for our purposes we'll simply associate some biographical data with the names in our completion set sample-completions.

company-mode provides affordances for displaying metadata as part of the completion process. For example, if your backend is showing completions for function names, you could display the currently-selected function's signature in the echo area. We'll develop a backend that displays a sentence about the selected candidate in the echo area, and we'll also display their initials as an annotation in the candidate selection popup menu.

Adding more data to our completion candidates

First we need to add some metadata to our existing completion candidates. To do this we'll use Emacs text properties. ((Text properties allow you to associate arbitrary data with strings. You can read about them here. Specifically, we use the special read syntax for text properties.)) For each completion candidate we define an :initials property containing their initials and a :summary property containing a one-sentence summary of the candidate. [2] To add these properties, update sample-completions to look like this:

(defconst sample-completions
  '(#("alan" 0 1
      (:initials
      "AMT"
      :summary
      (concat "Alan Mathison Turing, OBE, FRS (/ˈtjʊərɪŋ/ "
              "tewr-ing; 23 June 1912 – 7 June 1954) was a "
              "British mathematician, logician, cryptanalyst, "
              "philosopher, pioneering computer scientist, "
              "mathematical biologist, and marathon and ultra "
              "distance runner.")))
    #("john" 0 1
      (:initials
      "JVN"
      :summary
      (concat "John von Neumann (/vɒn ˈnɔɪmən/; December 28, "
              "1903 – February 8, 1957) was a Hungarian and "
              "American pure and applied mathematician, physicist, "
              "inventor and polymath.")))
    #("ada" 0 1
      (:initials
      "AAK"
      :summary
      (concat "Augusta Ada King, Countess of Lovelace (10 December "
              "1815 – 27 November 1852), born Augusta Ada Byron "
              "and now commonly known as Ada Lovelace, was an "
              "English mathematician and writer chiefly known for "
              "her work on Charles Babbage's early mechanical "
              "general-purpose computer, the Analytical Engine.")))
    #("don" 0 1
      (:initials
      "DEK"
      :summary
      (concat "Donald Ervin Knuth (/kəˈnuːθ/[1] kə-nooth; born "
              "January 10, 1938) is an American computer "
              "scientist, mathematician, and Professor Emeritus "
              "at Stanford University.")))))

Attaching properties like this is a very convenient way to store metadata for completion candidates. Of course in a real backend you probably wouldn't have a hard-coded list of candidates, and you'd be fetching them dynamically from a server, database, or external process. In that case, you'd need to also dynamically fetch the metadata you want and attach it to the candidate strings you serve through your backend. In the end, text properties work well in this context because they transparently transport the metadata - which company-mode doesn't know about - with the completion strings that company-mode definitely knows about.

Adding completion menu annotations

This change by itself doesn't really do anything, of course. All we've done is add properties to some strings, and we need to instruct company-mode on how to actually use them for display. The first way we'll use this metadata, then, is to add a small annotation to each entry in the popup menu used for candidate selection. To add this annotation, we need to update company-sample-backend to respond to the annotation command. This command should resolve to the annotation you want to use for the given candidate. Typically this means calling a function taking the completion candidate string arg and returning the annotation string.

First let's define a function that takes a completion candidate string and returns an annotation. Remember that our candidate strings store their metadata as text properties, so fundamentally this function simply needs to extract a property. For the annotation, we'll extract the :initials property and return it (prefixed with a blank.) That function looks like this:

(defun sample-annotation (s)
  (format " [%s]" (get-text-property 0 :initials s)))

Next we need to update our backend to respond to the annotation command like this:

(defun company-sample-backend (command &optional arg &rest ignored)
  (interactive (list 'interactive))``

  (case command
    (interactive (company-begin-backend 'company-sample-backend))
    (prefix (and (eq major-mode 'fundamental-mode)
                (company-grab-symbol)))
    (candidates
    (remove-if-not
      (lambda (c) (string-prefix-p arg c))
      sample-completions))
    (annotation (sample-annotation arg))))

In the last line we tell the backend to call sample-annotation with the candidate string to produce an annotation.

Now when we do completion we see the candidates' initials in the popup menu:

candidates-initials

Displaying metadata in the echo area

Where the annotation command adds a small annotation to the completion popup menu, the meta backend command produces text to display in the echo area. [3] The process for producing the metadata string is almost exactly like that of producing the annotation string. First we write a function that extracts the string from the candidate text properties. Then we wire that function into the backend through the meta command.

As you've probably guessed, the function for extracting the metadata string will simply read the :summary property from a candidate string. It looks like this:

(defun sample-meta (s)
  (get-text-property 0 :summary s))

The changes to the backend look like this:

(defun company-sample-backend (command &optional arg &rest ignored)
  (interactive (list 'interactive))

  (case command
    (interactive (company-begin-backend 'company-sample-backend))
    (prefix (and (eq major-mode 'fundamental-mode)
                (company-grab-symbol)))
    (candidates
    (remove-if-not
      (lambda (c) (string-prefix-p arg c))
      sample-completions))
    (annotation (sample-annotation arg))
    (meta (sample-meta arg))))

As before, in the last line we associate the meta command with our sample-meta function.

Here's how the metadata looks when displayed in the echo area:

Screen Shot 2014-11-03 at 12.02.10 PM

Fuzzy matching

As a final improvement to our backend, let's add support for fuzzy matching. This will let us do completion on prefixes which don't exactly match a candidate, but which are close enough. [4] For our purposes we'll implement a very crude form of fuzzy matching wherein a prefix matches a candidate if the set of letters in the prefix is a subset of the set of letters in the candidate. The function for performing fuzzy matching looks like this:

(defun sample-fuzzy-match (prefix candidate )
  (cl-subsetp (string-to-list prefix)
              (string-to-list candidate)))

Now we just need to modify our backend a bit. First we need to modify our response to the candidates command to use our new fuzzy matcher. Then we need to respond to the no-cache command by returning true. [5] Here's how that looks:

(defun company-sample-backend (command &optional arg &rest ignored)
  (interactive (list 'interactive))

  (case command
    (interactive (company-begin-backend 'company-sample-backend))
    (prefix (and (eq major-mode 'fundamental-mode)
                (company-grab-symbol)))
    (candidates
    (remove-if-not
      (lambda (c) (sample-fuzzy-match arg c))
      sample-completions))
    (annotation (sample-annotation arg))
    (meta (sample-meta arg))
    (no-cache 't)))

As you can see, we've replaced string-prefix-p in the candidates response with sample-fuzzy-match, and we've added (no-cache 't).

Here's how our fuzzy matching looks in action:

Screen Shot 2014-11-03 at 12.23.45 PM

That's all, folks

We've seen how to use Emacs' text properties to attach metadata to candidate strings. This is a really useful technique to use when developing company-mode backends, and one that you'll see used in real-world backends. With that metadata in place, we've also seen that it's very straightforward to tell your backend to display annotations in popup menus and metadata in the echo area. Once you've got the basic techniques under your belt, you can display anything you want as part of completion.

There are still more aspects to developing company-mode backends, but with what we've covered in this series you can get very far. More importantly, you know the main concepts and infrastructure for the backends, so you can learn the rest on your own. If you want to delve into all of the gory details, you'll need to read the company-mode source code, and specifically the documentation for company-backends. [6]

For an example of a fairly full-featured backend implementation that's currently in use (and under active development), you can see the emacs-ycmd project. [7] Happy hacking!

[1]The first article in this series..
[2]The summaries are simply the first sentences of the respective Wikipedia articles.
[3]The Emacs manual entry on the echo area.
[4]Fuzzy matching is commonly used for completion tools because it addresses common cases where users transpose characters, accidentally leave characters out, or consciously leverage fuzzy matching for increased speed.
[5]The details of why this is the case are murky, but the company-mode source code specifically states this. The same source also says that we technically should be implementing a response to match, but that doesn't seem to affect this implementation.
[6]The company-mode project page.
[7]The emacs-ycmd project is on github. In particular, see company-ycmd.el.

Prove it – factorial is bigger than 2^n

Frances Buontempo from BuontempoConsulting

I've been doing the scala coursera and wanted to write down the proof that

factorial(n) ≥ 2n when n ≥ 4

since it uses induction and I am out of practise.

Base case

For n = 4

factorial(4) = 4*3*2*1 = 24
and 24 = 2*2*2*2 = 16
and 24 ≥ 16

so factorial(n) ≥ 2n when n = 4

Induction step

For n >= 4 assume we have

factorial(n) >= 2n

and consider

factorial(n+1) = factorial(n) × (n+1)
 ≥ 2n × (n+1)
 ≥ 2n × 2 since (n+1) ≥ 2 when n ≥ 4
 = 2n+1

QED
(I also wanted to learn about writing maths in html)

Acid Air Pollution on the Cowley Road Oxford Uk

Tim Pizey from Tim Pizey

The Long Wall which Long Wall Street is named for always has a light line of white sand at its base, and nearly always has ongoing masonry works to repair it.

These pictures are taken at another Oxford road junction.

On the eastern edge of Oxford another set of traffic lights, at the junction of Cowley Road and Between Towns Road, again causes stationary traffic.

The local limestone is not very hard, and is constantly eroded, with the damage to the mortar being worse than that to the stone.

The damage is worst at ground level up to one metre.

The damage is worst near the ground and is caused by acidic exhaust fumes.

Here all the lichen has been killed and the brick looks as though it has been cleaned with brick acid.

We know that breathing in small particles from diesel exhaust is dangerous, as the particle size is small enough to pass deep into the lungs. The pattern of the effect on the walls shows that the concentration is particularly strong under one metre in height. There are two primary schools, one on either side of this junction.

Iffley Village Oxford Rag Stone

The quality of Oxford stone can be pretty shaky, and the stone that was used for field and road boundary walls was presumably of lower quality than that used for houses as good quality stone is scarce on the clay. This wall in Iffley is a charming example, but note what is happening to the bottom metre.

Who should we be claiming financial damages from? Shell? BP? Exxon?

Modern C++ Testing

Phil Nash from level of indirection

Back in August I took my family to stay for a week at my brother's house in (Old) South Wales. I've not been to Wales for a long time and it was great to be back there - but that's not what this post is about, of course.

The thing about Wales is that it has mountains (and where there are no mountains there are plenty of hills and valleys and cliffs). Mobile cell coverage is non-existent in much of the country - particularly where my brother lives.

So the timing was particularly bad when, just as we were driving along the south cost (somewhere between Cardiff and Swansea, I think), I started getting emails and tweets from people pointing out that Catch was riding high on HackerNews! Someone had recently discovered Catch and was enjoying it enough that they wanted to share it with the community. Which is awesome!

Except that, between lack of mobile reception and spending time with my family, I didn't have opportunity to join the discussion.

When I got back home a week later I read through the comments. One of them stuck out because it called me out on describing Catch as a "Modern C++" framework (the commenter recommended another framework, Bandit, as being "more modern").

When I first released Catch, back in 2010, C++11 was still referred to as C++1x (or even C++0x!) and the final release date was still uncertain. So Catch was written to target C++03. It used a few "modern" idioms of the time - but the modernity was intended more as being a break from the past - where most C++ frameworks were just reimplementations of JUnit in C++. So I think the label was somewhat justified at the time.

Of course since then C++11 has not only been standardised but is fully, or nearly fully, implemented by many leading, mainstream, compilers. I think adoption is still not high enough, at this point, that I'd be willing to drop support for C++03 in Catch (there is even an actively maintained fork for VC6!). But it is enough that the baseline for what constitutes "modern C++" has definitely moved on. And now C++14 is here too - pushing it even further forward.

"Modern" is not what it used to be

What does it mean to be a "Modern C++ Test Framework" these days anyway? Well the most obvious thing for the user is probably the use of lambdas. Along with a few other features, lambdas allow for a lot of what previously required macros to be done in pure C++. I'm usually the first to hold this up as A Good Thing. In a moment I'll get to why I don't think it's necessarily as good a step as you might think.

But before I get to that; one other thing: For me, as a framework author, the biggest difference C++11/14 would make to something like Catch would be in the internals. Large chunks of code could be removed, reduced or at least cleaned up. The "no dependencies" policy means that Catch has complete implementations of things like shared pointers, optional types and function objects - as well as many things that must be done the long way round (such as iterating collections - I long for range for loops - or at least BOOST_FOREACH).

The competition

I've come across three frameworks that I'd say qualify as truly trying to be "modern C++ test frameworks". I'm sure there are others - and I've not really even used these ones extensively - but these are the ones I'll reference in this discussion. The three frameworks are:

  • Lest - by Martin Moene, an active contributor to Catch - and partly based on some Catch ideas - re-imagined for a C++11 world.
  • Bandit - this is the one mentioned in the Hacker News comment I kicked off with
  • Mettle - I saw this in a tweet from @MeetingCpp last week and it's what kicked off the train of thought that led me to this post

The case for test case macros

But why did I say that the use of lambdas is not such a good idea? Actually I didn't quite say that. I think lambdas are a very good idea - and in many ways they would certainly clean up at least the mechanics of defining and registering test cases and sections.

Before lambdas C++ had only one place you could write a block of imperative code: in a function (or method). That means that, in Catch, test cases are really just functions - which must have a function signature - including a name (which we hide - because in Catch the test name is a string). Those functions must be captured somehow. This is done by passing a pointer to the function to the constructor of a small class - who's sole purposes is to forward the function pointer onto a global registry. Later, when the tests are being run, the registry is iterated and the function pointers invoked.

So a test case like this:

    TEST_CASE( "test name", "[tags]" )
    {
        /* ... */
    }

...written out in full (after macro expansion) looks something like:

    static void generatedFunctionName();
    namespace{ ::Catch::AutoReg generatedNameAutoRegistrar
        (   &generatedFunctionName, 
       	    ::Catch::SourceLineInfo( __FILE__, static_cast<std::size_t>( __LINE__ ) ), 
            ::Catch::NameAndDesc( "test name", "[tags]") ); 
    }
    static void generatedFunctionName()
    {
        /* .... */
    }

(generatedFunctionName is generated by yet another macro, which combines root with the current line number. Because the function is declared static the identifier is only visible in the current translation unit (cpp file), so this should be unique enough)

So there's a lot of boilerplate here - you wouldn't want to write this all by hand every time you start a new test case!

With lambdas, though, blocks of code are now first class entities, and you can introduce them anonymously. So you could write them like:

    Catch11::TestCase( "test name", "[tags]", []() 
    {
        /* ... */
    } );

This is clearly far better than the expanded macro. But it's still noisier than the version that uses the macro. Most of the C++11/14 test frameworks I've looked at tend to group tests together at a higher level. The individual tests are more like Catch's sections - but the pattern is still the same - you get noise from the lambda syntax in the form of the []() or [&]() to introduce the lambda and an extra ); at the end.

Is that really worth worrying about?

Personally I find it's enough extra noise that I think I'd prefer to continue to use a macro - even if it used lambdas under the hood. But it's also small enough that I can certainly see the case for going macro free here.

Assert yourself

But that's just test cases (and sections). Assertions have traditionally been written using macros too. In this case the main reasons are twofold:

  1. It allows the expression evaluation to be wrapped in an exception handler.
  2. It allows us the capture the file and line number to report on.

(1) can arguably be handled in whatever is holding the current lambda (e.g. it or describe in Bandit, suite, subsuite or expect in Mettle). If these blocks are small enough we should get sufficient locality of exception handling - but it's not as tight as the per-expression handling with the macro approach.

(2) simply cannot be done without involving the preprocessor in some way (whether it's to pass __FILE__ and __LINE__ manually, or to encapsulate that with a macro). How much does that matter? Again it's a matter of taste but you get several benefits from having that information. Whether you use it to manually locate the failing assertion or if you're running the reporter in an IDE window that automatically allows you to double-click the failure message to take you to the line - it's really useful to be able to go straight to it. Do you want to give that up in order to go macro free? Perhaps. Perhaps not.

Interestingly lest still uses a macro for assertions

Weighing up

So we've seen that a truly modern C++ test framework, using lambdas in particular, can allow you to write tests without the use of macros - but at a cost!

So the other side of the equation must be: what benefit do you get from eschewing the macros?

Personally I've always striven to minimise or eliminate the use of macros in C++. In the early days that was mostly about using const, inline and templates. Now lambdas allow us to address some of the remaining cases and I'm all for that.

But I also tend to associate a much higher "cost" to macro usage when it generates imperative code. This is code that you're likely to find yourself needing to step through in a debugger at runtime - and macros really obfuscate this process. When I use macros it tends to be in declarative code. Code that generates purely declarative statements, or effectively declarative statements (such as the test case function registration code). It tends to always generate the exact same machinery - so should not be sensitive to its inputs in ways that will require debugging.

How do Catch's macros play out in that regard? Well the test case registration macros get a pass. Sections are a grey area - they are on the path of code that needs to be stepped over - and, worse, hide a conditional (a section is really just an if statement on a global variable!). So score a few points down there. Assertions are also very much runtime executable - and are frequently on the debugging path! In fact stepping into expressions being asserted on in Catch tests can be quite a pain as you end up stepping into some of the "hidden" calls before you get to the expression you supplied (in Visual Studio, at least, this can be mitigated by excluding the Catch namespace using the StepOver registry key).

Now, interestingly, the use of macros for the assertions was never really about C++03 vs C++11. It was about capturing extra information (file/ line) and wrapping in a try-catch. So if you're willing to make that trade-off there's no reason you can't have non-macro assertions even in C++03!

Back to the future

One of my longer arcs of development on Catch (that I edge towards on each refactoring) is to decouple the assertion mechanism from the guts of the test runner. You should be able to provide your own assertions that work with Catch. Many other test frameworks work this way and it allows them to be much more flexible. In particular it will allow me to decouple the matcher framework (and maybe allow third-party matchers to work with Catch).

Of course this would also allow macro-less assertions to be used (as it happens the assertions in bandit and mettle are both matcher-like already).

So, while I think Catch is committed to supporting C++03 for some time yet, that doesn't mean there is no scope for modernising it and keeping it relevant. And, modern or not, I still believe it is the simplest C++ test framework to get up and running with, and the least noisy to work with.

Add new lines to end of files with missing line ends

Tim Pizey from Tim Pizey

A Sonar rule: Files should contain an empty new line at the end convention
Some tools such as Git work better when files end with an empty line.

To add a new line to all files without one place the following in a file called newlines


FILES="$@"
for f in $FILES
do
c=tail -c 1 $f
if [ "$c" != "" ];
then
echo "$f No new line"
echo "" >> $f
continue
fi
done

Then invoke:

$ chmod +x newlines
$ find * -name *.java |xargs ./newlines

Writing: Testing Times

Pete Goodliffe from Pete Goodliffe

My latest Becoming a Better Programmer column is published in the September issue of C Vu magazine (26.4). It called Testing Times and surveys the world of developer testing, covering the what, why, and how of programmer-driven testing. We look at feedback loops, TDD, unit testing, integration testing, system testing and more.

C Vu is a magazine produced by the ACCU - an excellent organisation for programmers. It has a great community, great publications, and an awesome conference. Check it out.

Meanwhile, my book: Becoming a Better Programmer, is nearing print. It's gone through tech review, copy edit, and layout is almost complete. You can still access the early release at http://shop.oreilly.com/product/0636920033929.do.

Predictive Models of Development Teams and the Systems They Build

Rob Smallshire from Good With Computers

In 1968 Melvin Conway pointed out a seemingly inevitable symmetry between organisations and the software systems they construct. Organisations today are more fluid than 40 years ago, with short developer tenure, and frequent migration of individuals between projects and employers. In this article we’ll examine data on the tenure and productivity of programmers and use this to gain insight into codebases, by simulating their growth with simple stochastic models. From such models, we can make important predictions about the maintainability and long-term viability of software systems, with implications for how we approach software design, documentation and how we assemble teams.

Legacy systems

I've always been interested in legacy software systems, primarily because legacy software systems are those which have proven to be valuable over time. The reason they become legacy – and get old – is because they continue to be useful.

I have an interest in that as a software engineer, having worked on legacy systems as an employee for various software product companies, and more recently as a consultant with Sixty North, helping out with the problems that such systems inevitably raise.

I called myself a "software engineer", although I use the term somewhat loosely. To call what many developers do "engineering" is a bit of a stretch. Engineer or not, my academic training was as a scientist, which is perhaps reflected in the content of this article. Most readers will be familar with the structure of the scientific method: We ask questions. We formulate hypotheses which propose answers to those questions. We design experiments to test our hypotheses. We collect data from the experiments. And we draw conclusions from the data. This done, having learned something about how our world works, we go round again.

I would like to be able to apply this powerful tool to what we do as "software engineers" or developers. Unfortunately for our industry, it's very difficult to do experimental science - still less randomised controlled trials – on the process of software development, for a whole host of reasons: Developers don't like to be watched. We can't eliminate extraneous factors. The toy problems we use in experiments aren't realistic. No two projects are the same. The subjects are often students who have little experience.

Kids in a science experiment.

Even on the rare occasions we do perform experiments, there are many threats to validity of such experiments, so the results tend not be to taken very seriously. Addressing the weaknesses of the experimental design would be prohibitively expensive, if possible at all.

The role of models

Fortunately, there's another way of doing science, which doesn't rely on the version of the scientific method just mentioned. It's the same type of science we do in astronomy, or geology where we can't run experiments because we don't have enough time, we don't have enough money, or we just don't have anywhere big enough to perform the experiment. Experimentally colliding galaxies, or experimenting with the initiation of plate tectonics are simply in the realms of science fiction on the money, time and space axes.

In such cases, we have to switch to a slightly different version of the scientific method, which looks like this: We make a prediction about how the universe works, where our 'universe' could be galactic collisions, or the more prosaic world of software development. We then make a model of that situation either through physical analogy or in a computer. By executing this model we can predict the outcome based on the details of a specific scenario. Lastly, we compare the results from the model with reality and either reject the model completely if it just doesn't work, or tune the model by refining, updating or tweaking it, until we have a model that is a good match for reality.

The aim here, is to come up with a useful model. Models are by their very nature simplifications or abstractions of reality. So long as we bear this in mind, even a simple (although not simplistic) model can have predictive power.

Essentially, all models are wrong, but some are useful

—George E. P. Box [Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building and Response Surfaces, John Wiley & Sons, New York, NY.]

A model of software development

A key factor in modelling the development of legacy software systems is the fact that although such systems may endure for decades - we developers tend not to endure them for decades. In other words, the tenure of software developers is typically much shorter than the life span of software systems.

But how much shorter?

Whenever I speak publically on this topic with an audience of developers, I like to perform a simple experiment with my audience. My assumption is that the turnover of developers can be modelled as if developers have a half-life within organizations. The related concept of residence time [1] is probably a better approach, but most people have a grasp of half-life, and it avoids a tedious digression into explaining something that is ultimately tangiential to the main discussion. In any case, a catchy hook is important when you're going for audience participation, so half-life it is.

I start by asking everyone in the audience who has moved from working on one codebase – for example a product – to another (including the transition to working on their first codebase), at any time in the preceding 32 years to raise their hands. This starting point is intended to catch the vast majority of typical tech conference audience members, and indeed it does, although arguably in the spirit of inclusiveness I should start with 64 years. Now all of the audience have raised hands.

Next I ask the audience to keep their hands raised if this is still true for 16 years: Most of the hands are still raised. Now eight years: Some hands are going down. Now four years: A small majority of hands are still raised. Then two years: At this point, a large minority still have raised hands, but we have crossed the half-way threshold and established that the 'half-life' of developers is somewhere between two and four years. This result has been consistent on over 90% of the occasions I've performed this experiment. The only notable deviation was a large Swedish software consultancy where we established the half-life was around six months!

In fact, what little research there has been into developer tenure indicates that the half-life across the industry is about 3.2 years, which fits nicely with what I see out in the field.

One way to think about this result concretely is as follows: If you work on a team that numbers ten developers in total, you can expect half of them - five in this case - to leave at some point in the next 3.2 years. Obviously, if the team size is to remain constant, they will need to be replaced.

Note that saying that turnover of half of a population of developers will take 3.2 years is not the same as claiming that the average tenure of a developer is 3.2 years. In fact, mean tenure will be \(3.2 / \\ln 2\) which is about 4.6 years. You might want to compare that figure against your own career so far.

If you're concerned that developers don't behave very much like radionucleides then rest assured that the notion of half-life follows directly from an assumption that the decay of a particle (or departure of a developer) follows exponential decay, which again follows from the notion of constant probability density with respect to time. All we're saying is that in a given time interval there is a fixed probability that are particle will decay (or a developer will depart), so it is actually a very simple model.

Notably, the half-life of developers is shorter than the half-life of almost anything else in our industry, including CEOs, lines of code, megacorps or classes.

Half-lives of various entities

Half-lives in years of various entities in and around the software industry. Developers are one of the most short-lived.

Productivity

If we're going to have a stab a modelling software developers as part of the software development process, we're going to need some measure of productivity. I'm going to use - and you can save your outrage for later - lines of code. To repurpose a famous phrase by Winston Churchill: "Lines of code is the worst programmer productivty metric, except for all the others". Now I know, as well as you do, that what ultimately matters to the customers of software systems is value for money, return on investment, and all those other good things. The problem is, that it's notoriously hard to tie any of those things back in a rigourous way to what individual developers do on a day-to-day basis, which should be design and code software systems based on an understanding of the problem at hand. On the other hand, I think I'm on fairly safe ground in assuming that software systems with zero lines of code deliver no value, and proportionally more complex problems can be solved (and hence more value delivered) by larger software systems.

Furthermore, there's some evidence that the number of lines of code cut by a particular developer per day is fairly constant irrespective of which programming language they're working in. So five lines of F# might do twice as much 'work' as 10 lines of Python or 20 lines of C++. This is an alternative phrasing of the notion of 'expressiveness' in programming languages. This is why we tend to feel that expressiveness - or semantic density - is important in programming languages. We can often deliver as much value with 5 lines of F# as with 20 lines of C++, yet it will take a quarter of the effort to put together.

Now, however you choose to measure productivity, not all developers are equally productive on the same code base, and the same developer will demonstrate different productivity on different code bases, even if they are in the same programming language. In fact, as I'm sure many of us have experienced, the principle control on our productivity is simply the size of the code base at hand. Writing a hundred lines of code for a small script is usually much less work than adding 100 lines to a one million line system.

We can capture this variance by looking to what little literature there is on the topic [2], and using this albeit sparse data to build some simple developer productivity distributions.

For example, we know that on a small 10,000 line code base, the least productive developer will produce about 2000 lines of debugged and working code in a year, the most productive developer will produce about 29,000 lines of code in a year, and the typical (or average) developer will produce about 3200 lines of code in a year. Notice that the distribution is highly skewed toward the low productivity end, and the multiple between the typical and most productive developers corresponds to the fabled 10x programmer.

Given only these three numbers and in the absence of any more information on the shape of the distribution, we'll follow a well-trodden path and use them to erect a triangular probability density function (PDF) characterised by the minimum, modal and maximum productivity values. Based on this PDF it's straightforward to compute the corresponding cumulative distribution function (CDF) which we can use to construct simulated "teams" of developers, by using the CDF to transform uniformly distributed samples on the cumulative probability axis into samples on the producivity axis. In a real simulation where we wanted to generate many typical teams, we would generate uniform random numbers between zero and one and transform them into productivity values using the CDF, although for clarity in the illustration that follows, I've used evenly distributed samples from which to generate the productivity values.

Programmer productivity

Programmer productivity in lines of code per year for a team of ten developers on a 10000 line project.

As you can see the resulting productivity values for a team of ten developers cluster around the modal productity value, with comparitavely few developers of very high productivity.

Perhaps more intuitively, software development of teams comprising ten developers look like this:

Programmer productivity as circles

A typical team of ten developers would look like this, if their contributions in lines of code were represented as circular areas.

This typical team has a only a couple of people being responsible for the majority of the output. Again, it might be interesting to compare this to your own situation. At the very least, it shows how the 'right' team of two developers can be competitive with a much larger team; a phenomenon you may have witnessed for yourselves.

Overall, this team produces about 90,000 lines of code in a year.

Incorporating growth of software

Of course, the story doesn't end there. Once our team has written 90,000 lines of code, they're no longer working on a 10,000 line code base, they're working on a 100,000 line code base! This causes their productivity to drop, so we now have a modified description of their productivities and a new distribution from which to draw a productivity if somebody new joins the team. But more of that in a moment. We don't have much in the way of published data for productivity on different sizes of code base, but we can interpolate between and extrapolate from the data we do have, without any of the assumptions involved in such extrapolation looking too outlandish. As you can see, we can put three straight line through the minimums, modes and maximums respectively to facilitate determination of a productivity distribution for any code size up to about 10 million lines of code. (Note that we shouldn't infer from these straight lines anything about the mathematical nature of how productivity declines with increasing code size - there be dragons! [3])

Productivity again codebase size

Productivity is a function of codebase size. Developers are dramatically less productive on larger bodies of code.

When performing a simulation of growth of software in the computer, we can get more accurate results by reducing the time-step on which we adjust programmer productivity downwards from once per year as in the example above, to just once per day: At the end of every simulated day, we know how much code we have, so we can predict the productivity of our developers on the following day, and so on.

Incorporating turnover

We've already stated our assumption that the probability of a developer departing is constant per unit time, together with our half-life figure of 3.2 years. Given this, it's straightforward to compute the probability of a developer leaving on any given day, which is about 0.001, or once in every thousand days. As we all know, when a particular developer leaves an organisation and is replaced by a new recruit, there's no guarantee that their replacement will have the same level of productivity. In this event, our simulation will draw a new developer at random from the distribution of developer productivity for the current code base size, so it's likely that a very low productivity developer will be replaced with a higher productivity developer and that a very high productivity developer will be replaced with a lower productivity developer; an example of regression to mediocrity. [4]

Simulating a project

With components of variance in developer productivity, its relationship to code base size and a simple model of developer turnover we're ready to run a simulation of a project. To do so, we initialize the model with the number of developers in the development team, and set it running. The simulator starts by randomly drawing a team of developers of the required size from the productivity distribution for a zero-size code base, and computes how much code they will have produced after one day. At the end of the time step, the developer productivities are updated to the new distribution; each developer's quantile within the distribution remains fixed, but the underlying CDF is updated to yield a new productivity value. The next time step for day two then begins, with each developer producing a little less code than on the previous day.

On each day, there is a fixed probability that a developer will leave the team. When this occurs, they are immediately replaced the following day by a new hire whose productivity will be drawn anew from the productivity distribution. For small teams, this case shift the overall team productivity significantly and more often than not towards the mean.

Let's look at an example: If we configure a simulation with a team of seven developers, and let it run for five years, we get something like this:

Streamed code contributions

Streamed code contributions of a team of seven developers over five years. A total of 19 people contribute to this codebase.

This figure has time running from left to right, and the coloured streams show the growing contributions over time of individual developers. We start on the left with no code and the original seven developers in the team, from top to bottom sporting the colours brown, orange, green, red, blue, purple and yellow. The code base grows quickly at first, but soon slows. About 180 days into the project the purple developer quits, indicated by a black, vertical bar across their stream. From this point on, their contribution remains static and is shown in a lighter shade. Vertically below this terminator we see a new stream beginning, coloured pink, which represents the contribution of the recruit who is purple's replacement. As you can see, they are about three times more productive (measured in lines of code at least), than their predecessor, although pink only sticks around for around 200 days before moving on and being replaced by the upper blue stream.

In this particular scenario, at the end of the five year period, we find our team of seven has churned through a grand total of 19 developers. In fact the majority of the extant code was written by people no longer with the organisation; only 37% of the code was programmed by people still present at the end. This is perhaps motivation for getting documentation in place as systems are developed, while the people who are doing the development are still around, rather than at the end of the effort - if at all - as is all to common.

Monte Carlo simulation

Being drawn randomly, each scenario such as the one outlines above is different, although in aggregate they vary in a predictable way according to the distributions we are using. The above scenario was typical, insofar as it produced, compared to all identically configured simulations, an average amount of code, although it did happen to get through a rather high number of developers. Of course, individual scenarios such as this, although interesting, can never be indicative of what will actually happen. For that, we need to turn to Monte Carlo modelling: Run many thousands of simulations - all with configurations drawn randomly from identical distributions - and look at the results in aggregate either graphically or using various statistical tools.

When we run 1000 simulations of a seven person project run over three years, the following statistics emerge: We can expect our team of seven to see four people leave and be replaced during the project. In fact, the total number of contributors will be 11 ± 2 at one standard deviation (1σ). The total body of code produced in three years will be 157,000 ± 23,000 @ 1σ. The proportion of the code written by contributors present at the end will be 70% ± 14% @ 1σ.

Perhaps a more useful question might be to ask "How long is it likely to take to produce 100,000 lines of code?" By answering this question for each simulation, we can build a histogram (actually we use a kernel density estimate here, to give a smooth, rather than binned, result).

Time to deliver 100,000 lines of code

How long does it take a team of seven to deliver one-hundred thousand lines of code?

Although this gives a good intuitive sense of when the team will reach the 100 k threshold, a more useful chart is the cumulative distribution of finishing time, which allows us to easily recognise that while there is a probability of 20% of finishing in 330 days, for a much more secure 80% probability, we should allow for 470 days - some 42% longer and correspondingly more costly.

Cumulative distribution of 100,000 LOC delivery times

Cumulative distribution function showing probability of delivery of one-hundred thousand lines of code before a particular day. Based on 10 000 simulations.

Finally, looking at the proportion of the code base that was, at any time, written by the current team, we see an exponential decline in this fraction, leaving us with a headline figure of 20% after 20 years.

Proportion of code written by current team

The proportion of code written by the current team. In other words, for how much of your code can you easily talk to the author? Blue and green lines show plus and minus one standard deviation around the mean. Based on 10,000 simulations.

That's right, on a 20 year old code base only one fifth of the code will have been created by the current team. This resonates with my own experience, and quantitatively explains why working on large legacy systems can be a lonely, disorienting and confusing experience.

A refinement of Conway's Law?

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure

—Melvin Conway [5]

This remark, which has become known as Conway's Law, was later interpreted, a little more playfully, by Eric Raymond as "If you have four groups working on a compiler, you'll get a 4-pass compiler". My own experience is that Conway's Law rings true, and I've often said that "Conway's Law is the one thing we know about software engineering that will still be true 1000 years from now".

However, over the long term development efforts which lead to large, legacy sofware systems the structure and organisation of the system isn't necessarily congruent with the organisation at present. After all, we all know that reorganisations of people are all too frequent compared to major reorganisation of software! Rather, the state of a system reflects not only the organisation, but the organisational history and the flow of people through those organisations over the long term. What I mean is that the structure of the software reflects the organisational structure integrated over time.

Simulations such as those presented in this article allow to to get a sense of how large software systems as we see them today are the fossilised footprints of developers past. Perhaps we can use this improved, and quantitative, understanding to improve planning, costing and ongoing guidance of large software projects.

[1]Residence time article on Wikipedia.
[2]COCOMO II: COnstructive COst MOdel II
[3]Contrary to popular conception, straight lines on log-log plots don't necessarily indicate power-law relationships. See the excellent So You Think You Have a Power Law — Well Isn't That Special?
[4]Francis Galton (1886). "Regression towards mediocrity in hereditary stature". The Journal of the Anthropological Institute of Great Britain and Ireland (The Journal of the Anthropological Institute of Great Britain and Ireland, Vol. 15) 15: 246–263.
[5]Melvin Conway (1968) How do Committees Invent?