Understanding Transducers through Python

Rob Smallshire from Good With Computers

In this series we take an in-depth look at transducers. Transducers - a portmanteau of "transform reducers" - are a new functional programming concept introduced into the Clojure programming language. Although transducers are actually pretty straightforward in retrospect, wrapping your brain around them, especially if you're not already a competent Clojureist, can be challenging.

In this series, we introduce transducers by implementing them from scratch in everybody's favourite executable pseudocode, Python. We'll start with the familiar staples of functional programming, map(), filter() and reduce(), and derive transducers from first principles. We'll work towards a set of general tools which works with eager collections, lazy "pull" sequences, and "push" event streams. Along the way we"ll cover stateful transducers and transducer composition, demonstrating that transducers are both more general, and more fundamental, than the functional programming tools baked into Python and many other languages.

By the end of this series, not only should transducers make sense to you, but you"ll have a recipe for implementing transducers in your own favourite programming language.

Deriving Transducers from First Principles

Rob Smallshire from Good With Computers

What is a transducer?

Transducers - a portmanteau of ‘transform reducers’ - are a new functional programming concept introduced into the Clojure programming language. Although transducers are actually pretty straightforward in retrospect, wrapping your brain around them, especially if you’re not already a competent Clojureist, can be challenging. In this series of articles we'll explain transducers by implementing them from first principles in Python.

To understand transducers, we must first understand reducing functions. In a nutshell, a reducing function is any function which takes a partial result and a new piece of information to produce a new result. Reducers are what we pass to the reduce() function in Python and can be thought of as having the following structure:

(result, input) -> result

A transducer is a function which accepts a reducing function and returns a new reducing function:

((result, input) -> result)  ->  ((result, input) -> result)

A transducer function can be used to augment an existing reducing function with some additional reducing behaviour, producing a new reducing function which applies both reducing behaviours in a specific order. In this way, we can use transducers to chain arbitrary reducing functions together. This in turn is useful, because it allows us to perform complex reduction operations using only a single call to reduce(). In other words, transducers are primarily useful because they provide a means of composing reducers.

As we will see, transducers also facilitate a clear separation of concerns between the algorithmic essence of a particular reduction, and the details of from where the input data are coming from and where the output data are going to.

In this series, we'll introduce transducers by implementing them from scratch in everybody’s favourite executable pseudocode, Python. We’ll start with the familiar staples of functional programming, map(), filter() and reduce(), and derive transducers from first principles. We’ll work towards a set of general tools which work with eager collections, lazy ‘pull’ sequences, and ‘push’ event streams. Along the way we’ll cover stateful transducers and transducer composition, demonstrating that transducers are both more general, and more fundamental, than the functional programming tools baked into Python and many other languages.

By the end of this series, not only should transducers make sense to you, but you’ll have a recipe for implementing transducers in your own favourite programming language.

The main objective of this exercise is to understand transducers. A secondary objective is to see how some more advanced functional programming techniques can be realised in a language like Python. We're not particularly advocating that you should widely adopt transducers in Python instead of, say, generators. For the time being, we expect transducers in Python to remain something of curiosity. That said, if you do find any interesting applications of transducers in Python, do please let us know.

The origin of transducers

Transducers were introduced by Rich Hickey, creator of the Clojure programming language, in a blog post "Transducers are Coming" [1] on August 6th, 2014.

The best way to understand transducers is to implement them yourself. Unfortunately from an educational point of view, the Clojure implementations are mixed up with some other details of Clojure which are pretty much irrelevant to transducers, and Clojure transducers are heavily implemented using anonymous functions which makes them excessively difficult to discuss.

How does this relate to Python?

Python is a pretty straightforward language - at least on the surface, and although Python supports lambdas and other forms of anonymous functions, in this presentation we have striven to use named functions to make the intent of the code more obvious, rather than writing Python in the style of Clojure.

Both Clojure and Python happen to be dynamically typed – or uni-typed – languages, and although that's not particularly important here, it does mean that the Python version can stay fairly close to the Clojure original, and we don't need to get sidetracked into a discussion of how transducers are typed.

Reviewing Python's functional programming tools

Let's briefly review three key functional programming tools included with Python, chiefly for the benefit of those people new to Python, and specifically Python 3. We'll look at map(), filter() and reduce() in turn.

map()

First, the map() function:

>>> help(map)

Help on class map in module builtins:

class map(object)
 |  map(func, *iterables) --> map object
 |
 |  Make an iterator that computes the function using arguments from
 |  each of the iterables.  Stops when the shortest iterable is exhausted.

The map() function accepts a function - or more strictly in Python, any callable object – as its first argument, and then any number of iterable series. In Python an iterable is a sort of lazy sequence which supports only forward iteration. For now, we'll pass the built-in function len as the first argument, and we'll stick to using one iterable, which will be a list of words, produced by calling split() on a string object:

>>> map(len, "Compute the length of each word".split())
<map object at 0x102beaf28>

In fact, in Python 3, which we're using here, map() is lazy and produces an iterator. To force evaluation, we'll pass this iterator to the list constructor:

>>> list(map(len, "Compute the length of each word".split()))
[7, 3, 6, 2, 4, 4]

As expected, we get the length of each word.

filter()

Secondly, we'll look at the filter() function:

>>> help(filter)

Help on class filter in module builtins:

class filter(object)
 |  filter(function or None, iterable) --> filter object
 |
 |  Return an iterator yielding those items of iterable for which function(item)
 |  is true. If function is None, return the items that are true.

The filter() function accepts a predicate function – a function which returns True or False – and an iterable series. Each item from the series is passed in turn to the predicate, and filter() returns an iterator over a series which contains only those items from the source series which match the predicate:

>>> list(filter(lambda w: 'o' in w,
... "the quick brown fox jumps over the lazy dog".split()))
['brown', 'fox', 'over', 'dog']

Neither map() nor filter() see much use in contemporary Python owing to the capabilities of list comprehensions, which provide a more natural syntax for mapping functions over, and filtering items from, iterable series:

>>> [len(w) for w in "jackdaws love my big sphinx of quartz".split()
... if 'a' in w]
[8, 6]

reduce()

The built-in reduce() function has had a troubled life [2] in the history of Python, and in the transition from Python 2 to Python 3, it was moved out of the language proper into the Python standard library. In fact, Guido was ready to throw lambda out with the bathwater!

Let's take a closer look at reduce(), which will set us firmly on the road to understanding and implementing transducers:

>>> from functools import reduce
>>> help(reduce)

Help on built-in function reduce in module _functools:

reduce(...)
    reduce(function, sequence[, initial]) -> value

    Apply a function of two arguments cumulatively to the items of a sequence,
    from left to right, so as to reduce the sequence to a single value.
    For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
    ((((1+2)+3)+4)+5).  If initial is present, it is placed before the items
    of the sequence in the calculation, and serves as a default when the
    sequence is empty.

When we execute reduce, we can see that it evaluates its arguments eagerly, immediately producing a result.

>>> reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])
15

The most common uses of reduction in Python, such as summing a series of items, have been replaced by the sum(), any() and all() functions built into the language, so reduce() doesn't see much use either.

We can also supply a initial item to be used in front of the supplied sequence, which is especially useful if the sequence is empty:

>>> reduce(lambda x, y: x+y, [1, 2, 3, 4, 5], 10)
25
>>> reduce(lambda x, y: x+y, [], 10)
10
>>> reduce(lambda x, y: x+y, [])
Traceback (most recent call last):
  File "", line 1, in
TypeError: reduce() of empty sequence with no initial value
>>>

Notice that with an empty series, the initial value becomes necessary to avoid an error.

It can sometimes be helpful to think of the initial value as being the identity of an operation. This value depends on the operation itself; for example, the additive identity is zero, whereas the multiplicative identity is one.

You might be tempted into thinking that the initial value should be of the same type as the values in the data series, but this isn't necessarily so. For example, using the following function we can accumulate a histogram into a dictionary item by item. In this case the identity value is an empty dictionary (i.e. histogram), not an integer:

>>> def count(hist, item):
...     if item not in hist:
...         hist[item] = 0
...     hist[item] += 1
...     return hist
...
>>> reduce(count, [1, 1, 5, 6, 5, 2, 1], {})
{1: 3, 2: 1, 5: 2, 6: 1}

Note that in production code, this use of reduce() with count() would be much better handled using the Counter type [3]. from the Python Standard Library collections module!

Our lambda expression for adding two numbers and our count() function serve as so-called "reducing functions", or simply 'reducers' for short. In fact, this is what during this session we'll refer to any function that can be passed to reduce() in this way.

Those of you familiar with functional programming techniques in other languages will recognise Python's reduce() as being equivalent to Haskell's foldl, Clojure's reduce, F#'s reduce or fold, Ruby's inject, C++'s std::accumulate(), or the Aggregate() extension method from .NET's LINQ.

Reduce is more fundamental than map or filter

Reduce is interesting because it's is more fundamental than map() or filter(). This is because we can implement map() and filter() in terms of reduce(), but not vice-versa. To do so, we just need to cook up the right reducing functions and supply a suitable initial value. Here is map() implemented in terms of reduce():

"""A module t.py in which we'll do live coding."""

from functools import reduce

def my_map(transform, iterable):

    def map_reducer(sequence, item):
        sequence.append(transform(item))
        return sequence

    return reduce(map_reducer, iterable, [])

Then in the REPL:

>>> my_map(str.upper, "Reduce can perform mapping too!".split())
['REDUCE', 'CAN', 'PERFORM', 'MAPPING', 'TOO!']

Similarly we can make an implementation of filter() in terms of reduce():

def my_filter(predicate, iterable):

    def filter_reducer(sequence, item):
        if predicate(item):
            sequence.append(item)
        return sequence

    return reduce(filter_reducer, iterable, [])

Then in the REPL:

>>> my_filter(lambda x: x % 2 != 0, [1, 2, 3, 4, 5, 6, 7, 8])
[1, 3, 5, 7]

Reducers

In the two previous examples, the local functions map_reducer() and filter_reducer() were used as the reducing functions with reduce(). Our my_map() and my_filter() functions reduce one collection to another, transforming values and collecting the results together. For the time being, we'll stick with the notion that a reducer is any operation you can pass to the reduce() function, although we'll extend and refine this notion later.

Composability is perhaps the most important quality of any programming abstraction. How composable are my_map() and my_filter()? Well, of course, given a predicate function is_prime() and transforming function square() defined like this:

from math import sqrt

def is_prime(x):
  if x < 2:
      return False
  for i in range(2, int(sqrt(x)) + 1):
      if x % i == 0:
          return False
  return True


def square(x):
    return x*x

We can compose this filtering expression,

>>> my_filter(is_prime, range(100))
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61,
67, 71, 73, 79, 83, 89, 97]

with a mapping expression using regular function composition:

>>> my_map(square, my_filter(is_prime, range(100)))
[4, 9, 25, 49, 121, 169, 289, 361, 529, 841, 961, 1369, 1681, 1849,
2209, 2809, 3481, 3721, 4489, 5041, 5329, 6241, 6889, 7921, 9409]

One issue here, is that since my_map() and my_filter() are implemented in terms of reduce() they are eager and so a full intermediate list of elements is created. Notably, if we'd used the built-in Python 3 versions of map() and filter() evaluation would have been lazy, and consequently more memory efficient, so we've lost an important quality here.

Furthermore, these sort of functional pipelines can get hard to read in a language like Python as the arguments to the outermost (last applied) function become ever more widely separated. Other "real" functional languages such as Haskell and F# have better support for partial function application and pipelining results through a chain of functions. To some extent, this can be solved in Python by using named intermediate results, although this trades off the readability problem against the difficulty of coming up with meaningful names for the intermediate values.

Progressively abstracting reducers away from iterables

Notice that although reduce() abstracts away from some of the details of the operation - specifically how the items are transformed and combined, it is fundamentally coupled to the notion of iterable series. Furthermore, our my_map() and my_filter() functions are closely tied to the notion of mutable sequences, since they both call append():

my_filter(is_prime, my_map(prime_factors, range(32)))

Iterables in Python can be lazy, but our my_filter() and my_map() functions eagerly return fully realised list sequences.

Let's return to our function definitions to see why:

def my_map(transform, iterable):
    def map_reducer(sequence, item):
        sequence.append(transform(item))
        return sequence
    return reduce(map_reducer, iterable, [])

def my_filter(predicate, iterable):
    def filter_reducer(sequence, item):
        if predicate(item):
            sequence.append(item)
        return sequence
    return reduce(filter_reducer, iterable, [])

The my_map() and my_filter() functions have quite a lot in common:

  • Both my_map() and my_filter() call reduce()
  • Both my_map() and my_filter() supply a mutable sequence – specifically an empty list – as the initial value to reduce()
  • Both map_reducer() and filter_reducer() call append(), thereby expecting a mutable sequence

Let's progressively refactor these functions, extracting the common elements as we go. We'll start by moving the use of reduce() out of the functions completely. Instead we'll simply return the inner reducing function object, which can then be used externally. Our outer my_map() and my_filter() functions now become what are essentially reducer factories; we'll reflect this in the code by renaming them to make_mapper() and make_filterer():

def make_mapper(transform):

    def map_reducer(sequence, item):
        sequence.append(transform(item))
        return sequence

    return map_reducer

def make_filterer(predicate):

    def filter_reducer(sequence, item):
        if predicate(item):
            sequence.append(item)
        return sequence

    return filter_reducer

These can be used with individually with reduce() like this:

.. code-block:: python
>>> from t import *
>>> reduce(make_mapper(square), range(10), [])
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> reduce(make_filterer(is_prime), range(100), [])
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61,
67, 71, 73, 79, 83, 89, 97]

Unfortunately, it's not possible to compose the mapper and filterer into a single function which can be used with a single application of reduce(). If we want to perform both mapping and filtering we must still call reduce() twice.

We've pointed out that, both of our inner reducing functions depend on their first argument supporting an append() method, so we must pass a suitable mutable sequence type – such as a list – as the seed item for reduce(). In this case, append() is just a tool for combining an existing partial result (the list) with a new piece of data (the new item).

Let's abstract away that dependency on sequences, by arranging to pass in this "combining function", which in the case of appending to a sequence, will look like this:

def appender(result, item):
    result.append(item)
    return result

Look carefully at our combining function and recall that 'reducing' is just a synonymn for 'combining'. This innocuous function, which does nothing more than wrap a method call and return the modified sequence, is more interesting than it might first appear, because it too is a reducing-function that takes an intermediate result (the sequence) and the next piece of input (the item) and returns the next result for any subsequent iterations. Our combining function appender() is itself a reducer! We can prove this to ourselves by, using this roundabout way of constructing a list from a range:

>>> from t import *
>>> reduce(appender, range(5), [])
[0, 1, 2, 3, 4]

Having identified appender() as a reducer, we'll add another layer of abstraction, by using another layer of nested functions. The new layer affords us the opportunity to pass in a reducing function, such as appender(). We use a separate layer of inner functions, rather than adding an argument to the outer function, because the point in our client code at which we want to specify the transform or predicate is different from the point at which we want to specify the reducer:

We've made the following changes to make_mapper():

def mapping(transform):

    def make_mapping_reducer(reducer):

        def map_reducer(result, item):
            return reducer(result, transform(item))

        return map_reducer

    return make_mapping_reducer
  1. Renamed make_mapper() to mapping()
  2. Added a new inner function around map_reducer() called make_mapping_reducer(). This function accepts the 'combining' reducer and returns the map_reducer function.
  3. Adjusted the outer mapping() function to return the map_reducer function.
  4. Reworked the implementation of map_reducer() to be in terms of the passed-in combining reducer instead of append(), thereby breaking the dependency on sequence behaviour.
  5. Renamed the sequence argument of map_reducer to result, since it contains the partial result and is no longer necessarily a sequence.

And we've made the following changes to the following changes to make_filterer():

def filtering(predicate):

    def make_filtering_reducer(reducer):

        def filter_reducer(result, item):
            return reducer(result, item) if predicate(item) else result

        return filter_reducer

    return make_filtering_reducer
  1. Renamed make_filterer() to filtering()
  2. Added a new inner function around filter_reducer() called make_filtering_reducer(). This function accepts the combining reducer and returns the filter_reducer function.
  3. Adjusted the outer filtering() function to return the make_filtering_reducer function.
  4. Renamed the sequence argument of map_reducer to result.
  5. Reworked the implementation of filter_reducer() to be in terms of the passed-in combining reducer instead of append(), thereby breaking the dependency on sequence behaviour.
  6. Used a conditional expression rather than a conditional statement.

Finally: Transducers!

Recall that the word 'transducer' is a convenience (and contrivance) for functions which "transform reducers". Look closely at our inner functions make_mapping_reducer() and make_filtering_reducer(). Both of these functions accept a reducer and return a new reducer based on the one passed in. These inner functions are therefore our transducers; let's rename them as such to make this completely clear:

def mapping(transform):

    def mapping_transducer(reducer):

        def map_reducer(result, item):
            return reducer(result, transform(item))

        return map_reducer

    return mapping_transducer

def filtering(predicate):

    def filtering_transducer(reducer):

        def filter_reducer(result, item):
            return reducer(result, item) if predicate(item) else result

        return filter_reducer

    return filtering_transducer

The transducers identified, the outermost functions mapping() and filtering() are exposed for what they are: transducer factories.

The wonderful thing about such transducers is that they can be composed in such a way that the operations proceed element by element, without the need for intermediate sequences. This is achievable because the combining reducer passed to a transducer needn't be our appender() it could be any reducing function, with appender() only being used at the "bottom" of the composite reducer to put place final results into an actual data structure, such as a list. Transducer composition allows us to chain multiple reducing operations together into a single composite reducer.

Let's give our transducers a whirl, using named function arguments for clarity:

>>> from t import \*
>>> reduce(filtering(predicate=is_prime)(reducer=appender), range(20), [])
[2, 3, 5, 7, 11, 13, 17, 19]

Note that we pass the last reduction operation we want to perform on each item (appending) as a reducer to the first reduction operation we want to perform (primality testing). This is the opposite of normal function composition.

Instead of appender() we can pass a different reducer - in this case one produced by transducer made by the mapping() transducer factory:

>>> reduce(filtering(
... predicate=is_prime)(
... reducer=mapping(
... transform=square)(
... reducer=appender)),
... range(20),
... [])
[4, 9, 25, 49, 121, 169, 289, 361]

What's important to recognise here, is that we have composed the filtering and mapping operations to produce the effects of both using a single pass over the input sequence with a single application of reduce().

In the next article in this series, will look at improving on this effective, but somewhat clunky way of composing transducers.

[1]Transducers are Coming: The blog post introducing transducers to the world.
[2]The fate of reduce() in Python 3000, the Artima Developer posting in which Guido van Rossum explains his "reasons for dropping lambda, map() and filter()" from what was to become Python 3. He goes on "I expect tons of disagreement in the feedback, all from ex-Lisp-or-Scheme folks". Ultimately though, the proposal was watered down.
[3]Python documentation for Counter

Writing: Becoming a Better Programmer

Pete Goodliffe from Pete Goodliffe


It's finally here!

My new book, Becoming a Better Programmer, is fully edited, laid out, and is now available as a final product for your reading pleasure, published by O'Reilly. You can purchase it in printed form or as a digital version for your e-reader of choice.

Find out more about the book from the O'Reilly product page. You can view the full table of contents there. Or head over to Amazon to purchase. If you are a Safari subscriber, you can read it here. Grab your iBook here.

The cover image is a flying fish. I'll leave it to your imagination to work out the significance.

It's great to finally see this labour of love come to fruition, and I do hope that stands as a useful resource for programmers today.

One of my favourite parts of the book is a family in-joke in the "advance praise" at the front. Nestled amongst the luminaries and expert programmers who graciously contributed their honest thoughts on the book is another very honest opinion:


How Visual Lint parses projects and makefiles

Products, the Universe and Everything from Products, the Universe and Everything

Code analysis tools can require a lot of configuration to be useful. Whilst some (e.g. Vera++ or cpplint.py) need very little configuration to make use if effectively, others such as PC-lint (and even, to a lesser extent, CppCheck) may need to be fed pretty much the same configuration as the compiler itself to be useful. As a result the command lines you need to use with some analysis tools are pretty complex in practice (which is of course a disincentive to using them, and so where Visual Lint comes in. But I digress...).

In fact, more capable code analysis tools may need to be given even more information than the compiler iteself to generate meaningful results - for example they may need to be configured with the built-in preprocessor settings of the compiler itself. In a PC-lint analysis configuration, this is part of the job compiler indirect files such as co-msc110.lnt do.

As a result of the above, a product such as Visual Lint which effectively acts as a "front end" to complex (and almost infinitely configurable) code analysis tools like PC-lint needs to care about the details of how your compiler actually sees your code every bit as much as the analysis tool itself does.

What this means in practice is that Visual Lint needs be able to determine the properties of each project it sees and understand how they are affected by the properties of project platforms, configurations and whatever compiler(s) a project uses. When it generates an analysis command line, it may need to reflect those properties so that the analysis tool sees the details of the preprocessor symbols, include folders etc. each file in the project would normally be compiled with - including not only the properties exposed in the corresponding project or makefile, but also built-in symbols etc. normally provided by the corresponding compiler.

That's a lot of data to get right, and inevitably sometimes there will be edge cases where don't quite get it right the first time. It goes without saying that if you find one of those edge cases - please tell us!

So, background waffle over - in this post I'm going to talk about one of the things Visual Lint does - parsing project files to identify (among other things) the preprocessor symbols and include folders for each file in order to be able to reflect this information in the analysis tool configuration.

When analysing with CppCheck, preprocessor and include folder data read in this way can be included on the generated command line as -D and -I directives. Although we could do the same with PC-lint, it is generally better to write the preprocessor and include folder configuration to an indirect ("project.lnt") file which also includes details of which implementation (.c, .cpp, .cxx etc.) files are included in the project -as well as any other project specific options. For example:

// Generated by Visual Lint 4.5.6.233 from file: SourceVersioner_vs90.vcproj
// -dConfiguration=Release|Win32

//
-si4 -sp4 // Platform = "Win32"
//
+ffb // ForceConformanceInForLoopScope = "TRUE"
-D_UNICODE;UNICODE // CharacterSet = "1"
-DWIN32;NDEBUG;_CONSOLE // PreprocessorDefinitions = "WIN32;NDEBUG;_CONSOLE"
-D_CPPRTTI // RuntimeTypeInfo = "TRUE"
-D_MT // RuntimeLibrary = "0"
//
// AdditionalIncludeDirectories = ..\Include"
-save -e686 //
-i"..\Include" //
-restore //
//
// SystemIncludeDirectories = "
// F:\Projects\Libraries\boost\boost_1_55_0;
// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\include;

// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\atlmfc\include;

// C:\Program Files\
// Microsoft SDKs\Windows\v6.0A\include;

// F:\Projects\Libraries\Wtl\8.1\Include"
//
-save -e686 //
+libdir(F:\Projects\Libraries\boost\boost_1_55_0)
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include")
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir(F:\Projects\Libraries\Wtl\8.1\Include)
-iF:\Projects\Libraries\boost\boost_1_55_0
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include"
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include"
-i"C:\Program Files\Microsoft SDKs\Windows\v6.0A\include"
-iF:\Projects\Libraries\Wtl\8.1\Include
-restore //
//
SourceVersioner.cpp // RelativePath = "SourceVersioner.cpp"
SourceVersionerImpl.cpp // RelativePath = "SourceVersionerImpl.cpp"
stdafx.cpp // RelativePath = "stdafx.cpp"
Shared\FileUtils.cpp // RelativePath = "Shared\FileUtils.cpp"
Shared\FileVersioner.cpp // RelativePath = "Shared\FileVersioner.cpp"
Shared\PathUtils.cpp // RelativePath = "Shared\PathUtils.cpp"
Shared\ProjectConfiguration.cpp
// RelativePath = "Shared\ProjectConfiguration.cpp"
Shared\ProjectFileReader.cpp // RelativePath = "Shared\ProjectFileReader.cpp"
Shared\SolutionFileReader.cpp // RelativePath = "Shared\SolutionFileReader.cpp"
Shared\SplitPath.cpp // RelativePath = "Shared\SplitPath.cpp"
Shared\StringUtils.cpp // RelativePath = "Shared\StringUtils.cpp"
Shared\XmlUtils.cpp // RelativePath = "Shared\XmlUtils.cpp"

A file like this is written for every project configuration we analyse, but the mechanism used to actually read the configuration data from projects varies slightly depending on the project type and structure.

In the case of the Visual Studio 2002, 2003, 2005 and 2008 .vcproj files Visual Lint was originally designed to work with, this is straightforward as the project file contains virtually all of the information needed in a simple to read, declarative form. Hence to parse a .vcproj file we simply load its contents into an XML DOM object and query the properties in a straightforward manner. Built-in preprocessor symbols such as _CPPUNWIND are defined by reading the corresponding properties (EnableExceptions in the case above) or inferred from the active platform etc.

Similarly, for Visual C++ 6.0 and eMbedded Visual C++ 4.0 project files we just parse the compiler command lines in the .dsp or .vcp file directly. This is pretty straightforward as well as although .dsp and .vcp files are really makefiles they have a very predictable structure. Some other development environments (e.g. Green Hills MULTI, CodeVisionAVR) have bespoke project file formats which are generally easy to parse using conventional techniques.

Visual Studio 2010, 2012 and 2013 .vcxproj project files are far more tricky, as the MSBuild XML format they use is effectively a scripting language rather than a declarative format. To load them, we effectively have to execute them (but obviously without running the commands they contain).

As you can imagine this is rather tricky! We basically had to write a complete MSBuild parsing engine* for this task, which effectively executes the MSBuild script in memory with defined parameters to identifiy its detailed properties.

* Although there are MSBuild APIs which could conceivably help, there are implemented in managed code - which we can't use in a plug-in environment due to .NET framework versioning restrictions. Instead, our MSBuild parsing engine is written natively in C++.

To work out the parameters to apply to the MSBuild script, we prescan the contents of the .vcxproj file to identify the configurations and platforms available. Once we have those, we can run the script in memory and inspect the properties it generates for the required build configuration. This process is also used with CodeGear C++ and Atmel Studio projects, both of which are MSBuild based.

Eclipse projects (.project and .cproject files) are also XML based and are not too difficult to parse, but whereas it is straightforward to work out how things are defined in a .vcproj file, Eclipse project files are much less clearly defined and more variable (not surprising as they effectively represent serialised Java object data).

To make matters worse, each compiler toolchain can have its own sub-format, so things can change in very subtle ways between based projects for different Eclipse based environments such as Eclipse/CDT, QNX Momentics and CodeWarrior. In addition, Eclipse C/C++ projects come in two flavours - managed builder and makefile.

Of the two, managed builder projects are easy to deal with - for example to determine the built-in settings of the compiler in a managed builder we read the scanner (*.sc) file produced by the IDE when it builds the project, and add that data to the configuration data we have been able to read from the project file.

The scanner (.sc) file is a simple XML file located within the Eclipse workspace .metadata folder. Here's an extract to give you an idea what it looks like:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?scdStore version="2"?>
<scannerInfo id="org.eclipse.cdt.make.core.discoveredScannerInfo">
<instance id="cdt.managedbuild.config.gnu.mingw.exe.debug.116390618;
cdt.managedbuild.config.gnu.mingw.exe.debug.116390618.;
cdt.managedbuild.tool.gnu.cpp.compiler.mingw.exe.debug.1888704458;
cdt.managedbuild.tool.gnu.cpp.compiler.input.391761176">
<collector id="org.eclipse.cdt.make.core.PerProjectSICollector">
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/mingw32"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/backward"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include-fixed"/>
<definedSymbol symbol="__STDC__=1"/>
<definedSymbol symbol="__cplusplus=1"/>
<definedSymbol symbol="__STDC_HOSTED__=1"/>
<definedSymbol symbol="__GNUC__=4"/>
<definedSymbol symbol="__GNUC_MINOR__=5"/>
<definedSymbol symbol="__GNUC_PATCHLEVEL__=2"/>
<definedSymbol symbol="__GNUG__=4"/>
<definedSymbol symbol="__SIZE_TYPE__=unsigned int"/>
.
.
.
</collector>
</instance>
</scannerInfo>

Unfortunately scanner files are not generated for makefile projects, and the enclosing project files do not give details of the preprocessor symbols and include folders needed to analyse them. So how do we do it?

The answer is similar to the way we load MSBuild projects - by running the makefile without executing the commands within. In this case however the make tools themselves can (fortunately!) lend a hand, as both NMake and GNU Make have an option which runs the makefile and echoes the commands which would be executed without actually running them. For NMake this is /n and GNU Make -n or --just-print.

The /B (NMake) -B (GNU Make) switch can also be used to ensure that all targets are executed regardless of the build status of the project (otherwise we wouldn't be able to read anything if the project build was up to date).

If we run a makefile with these switches and capture the output we can then parse the compiler command lines themselves and extract preprocessor symbols etc. from it. As so many compilers have similar command line switches for things like preprocessor symbols and include folders this is generally a pretty straightforward matter. Many variants of GCC also have command line options which can report details of built-in preprocessor symbols and system include folders, which we can obviously make use of - so the fact that many embedded C/C++ compilers today are based on GCC is very useful indeed.

There's a lot of detail I've not covered here, but I hope you get the general idea. It follows from the above that adding support for new project file types to Visual Lint is generally not difficult - provided we have access to representative project files and (preferably) a test installation of the IDE in question. If there is a project file type you would like us to look at, please tell us.

How Visual Lint parses projects and makefiles

Products, the Universe and Everything from Products, the Universe and Everything

Code analysis tools can require a lot of configuration to be useful. Whilst some (e.g. Vera++ or cpplint.py) need very little configuration to make use if effectively, others such as PC-lint (and even, to a lesser extent, CppCheck) may need to be fed pretty much the same configuration as the compiler itself to be useful. As a result the command lines you need to use with some analysis tools are pretty complex in practice (which is of course a disincentive to using them, and so where Visual Lint comes in. But I digress...).

In fact, more capable code analysis tools may need to be given even more information than the compiler iteself to generate meaningful results - for example they may need to be configured with the built-in preprocessor settings of the compiler itself. In a PC-lint analysis configuration, this is part of the job compiler indirect files such as co-msc110.lnt do.

As a result of the above, a product such as Visual Lint which effectively acts as a "front end" to complex (and almost infinitely configurable) code analysis tools like PC-lint needs to care about the details of how your compiler actually sees your code every bit as much as the analysis tool itself does.

What this means in practice is that Visual Lint needs be able to determine the properties of each project it sees and understand how they are affected by the properties of project platforms, configurations and whatever compiler(s) a project uses. When it generates an analysis command line, it may need to reflect those properties so that the analysis tool sees the details of the preprocessor symbols, include folders etc. each file in the project would normally be compiled with - including not only the properties exposed in the corresponding project or makefile, but also built-in symbols etc. normally provided by the corresponding compiler.

That's a lot of data to get right, and inevitably sometimes there will be edge cases where don't quite get it right the first time. It goes without saying that if you find one of those edge cases - please tell us!

So, background waffle over - in this post I'm going to talk about one of the things Visual Lint does - parsing project files to identify (among other things) the preprocessor symbols and include folders for each file in order to be able to reflect this information in the analysis tool configuration.

When analysing with CppCheck, preprocessor and include folder data read in this way can be included on the generated command line as -D and -I directives. Although we could do the same with PC-lint, it is generally better to write the preprocessor and include folder configuration to an indirect ("project.lnt") file which also includes details of which implementation (.c, .cpp, .cxx etc.) files are included in the project -as well as any other project specific options. For example:

// Generated by Visual Lint 4.5.6.233 from file: SourceVersioner_vs90.vcproj
// -dConfiguration=Release|Win32

//
-si4 -sp4 // Platform = "Win32"
//
+ffb // ForceConformanceInForLoopScope = "TRUE"
-D_UNICODE;UNICODE // CharacterSet = "1"
-DWIN32;NDEBUG;_CONSOLE // PreprocessorDefinitions = "WIN32;NDEBUG;_CONSOLE"
-D_CPPRTTI // RuntimeTypeInfo = "TRUE"
-D_MT // RuntimeLibrary = "0"
//
// AdditionalIncludeDirectories = ..\Include"
-save -e686 //
-i"..\Include" //
-restore //
//
// SystemIncludeDirectories = "
// F:\Projects\Libraries\boost\boost_1_55_0;
// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\include;

// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\atlmfc\include;

// C:\Program Files\
// Microsoft SDKs\Windows\v6.0A\include;

// F:\Projects\Libraries\Wtl\8.1\Include"
//
-save -e686 //
+libdir(F:\Projects\Libraries\boost\boost_1_55_0)
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include")
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir(F:\Projects\Libraries\Wtl\8.1\Include)
-iF:\Projects\Libraries\boost\boost_1_55_0
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include"
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include"
-i"C:\Program Files\Microsoft SDKs\Windows\v6.0A\include"
-iF:\Projects\Libraries\Wtl\8.1\Include
-restore //
//
SourceVersioner.cpp // RelativePath = "SourceVersioner.cpp"
SourceVersionerImpl.cpp // RelativePath = "SourceVersionerImpl.cpp"
stdafx.cpp // RelativePath = "stdafx.cpp"
Shared\FileUtils.cpp // RelativePath = "Shared\FileUtils.cpp"
Shared\FileVersioner.cpp // RelativePath = "Shared\FileVersioner.cpp"
Shared\PathUtils.cpp // RelativePath = "Shared\PathUtils.cpp"
Shared\ProjectConfiguration.cpp
// RelativePath = "Shared\ProjectConfiguration.cpp"
Shared\ProjectFileReader.cpp // RelativePath = "Shared\ProjectFileReader.cpp"
Shared\SolutionFileReader.cpp // RelativePath = "Shared\SolutionFileReader.cpp"
Shared\SplitPath.cpp // RelativePath = "Shared\SplitPath.cpp"
Shared\StringUtils.cpp // RelativePath = "Shared\StringUtils.cpp"
Shared\XmlUtils.cpp // RelativePath = "Shared\XmlUtils.cpp"

A file like this is written for every project configuration we analyse, but the mechanism used to actually read the configuration data from projects varies slightly depending on the project type and structure.

In the case of the Visual Studio 2002, 2003, 2005 and 2008 .vcproj files Visual Lint was originally designed to work with, this is straightforward as the project file contains virtually all of the information needed in a simple to read, declarative form. Hence to parse a .vcproj file we simply load its contents into an XML DOM object and query the properties in a straightforward manner. Built-in preprocessor symbols such as _CPPUNWIND are defined by reading the corresponding properties (EnableExceptions in the case above) or inferred from the active platform etc.

Similarly, for Visual C++ 6.0 and eMbedded Visual C++ 4.0 project files we just parse the compiler command lines in the .dsp or .vcp file directly. This is pretty straightforward as well as although .dsp and .vcp files are really makefiles they have a very predictable structure. Some other development environments (e.g. Green Hills MULTI, CodeVisionAVR) have bespoke project file formats which are generally easy to parse using conventional techniques.

Visual Studio 2010, 2012 and 2013 .vcxproj project files are far more tricky, as the MSBuild XML format they use is effectively a scripting language rather than a declarative format. To load them, we effectively have to execute them (but obviously without running the commands they contain).

As you can imagine this is rather tricky! We basically had to write a complete MSBuild parsing engine* for this task, which effectively executes the MSBuild script in memory with defined parameters to identifiy its detailed properties.

* Although there are MSBuild APIs which could conceivably help, there are implemented in managed code - which we can't use in a plug-in environment due to .NET framework versioning restrictions. Instead, our MSBuild parsing engine is written natively in C++.

To work out the parameters to apply to the MSBuild script, we prescan the contents of the .vcxproj file to identify the configurations and platforms available. Once we have those, we can run the script in memory and inspect the properties it generates for the required build configuration. This process is also used with CodeGear C++ and Atmel Studio projects, both of which are MSBuild based.

Eclipse projects (.project and .cproject files) are also XML based and are not too difficult to parse, but whereas it is straightforward to work out how things are defined in a .vcproj file, Eclipse project files are much less clearly defined and more variable (not surprising as they effectively represent serialised Java object data).

To make matters worse, each compiler toolchain can have its own sub-format, so things can change in very subtle ways between based projects for different Eclipse based environments such as Eclipse/CDT, QNX Momentics and CodeWarrior. In addition, Eclipse C/C++ projects come in two flavours - managed builder and makefile.

Of the two, managed builder projects are easy to deal with - for example to determine the built-in settings of the compiler in a managed builder we read the scanner (*.sc) file produced by the IDE when it builds the project, and add that data to the configuration data we have been able to read from the project file.

The scanner (.sc) file is a simple XML file located within the Eclipse workspace .metadata folder. Here's an extract to give you an idea what it looks like:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?scdStore version="2"?>
<scannerInfo id="org.eclipse.cdt.make.core.discoveredScannerInfo">
<instance id="cdt.managedbuild.config.gnu.mingw.exe.debug.116390618;
cdt.managedbuild.config.gnu.mingw.exe.debug.116390618.;
cdt.managedbuild.tool.gnu.cpp.compiler.mingw.exe.debug.1888704458;
cdt.managedbuild.tool.gnu.cpp.compiler.input.391761176">
<collector id="org.eclipse.cdt.make.core.PerProjectSICollector">
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/mingw32"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/backward"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include-fixed"/>
<definedSymbol symbol="__STDC__=1"/>
<definedSymbol symbol="__cplusplus=1"/>
<definedSymbol symbol="__STDC_HOSTED__=1"/>
<definedSymbol symbol="__GNUC__=4"/>
<definedSymbol symbol="__GNUC_MINOR__=5"/>
<definedSymbol symbol="__GNUC_PATCHLEVEL__=2"/>
<definedSymbol symbol="__GNUG__=4"/>
<definedSymbol symbol="__SIZE_TYPE__=unsigned int"/>
.
.
.
</collector>
</instance>
</scannerInfo>

Unfortunately scanner files are not generated for makefile projects, and the enclosing project files do not give details of the preprocessor symbols and include folders needed to analyse them. So how do we do it?

The answer is similar to the way we load MSBuild projects - by running the makefile without executing the commands within. In this case however the make tools themselves can (fortunately!) lend a hand, as both NMake and GNU Make have an option which runs the makefile and echoes the commands which would be executed without actually running them. For NMake this is /n and GNU Make -n or --just-print.

The /B (NMake) -B (GNU Make) switch can also be used to ensure that all targets are executed regardless of the build status of the project (otherwise we wouldn't be able to read anything if the project build was up to date).

If we run a makefile with these switches and capture the output we can then parse the compiler command lines themselves and extract preprocessor symbols etc. from it. As so many compilers have similar command line switches for things like preprocessor symbols and include folders this is generally a pretty straightforward matter. Many variants of GCC also have command line options which can report details of built-in preprocessor symbols and system include folders, which we can obviously make use of - so the fact that many embedded C/C++ compilers today are based on GCC is very useful indeed.

There's a lot of detail I've not covered here, but I hope you get the general idea. It follows from the above that adding support for new project file types to Visual Lint is generally not difficult - provided we have access to representative project files and (preferably) a test installation of the IDE in question. If there is a project file type you would like us to look at, please tell us.

How Visual Lint parses projects and makefiles

Products, the Universe and Everything from Products, the Universe and Everything

Code analysis tools can require a lot of configuration to be useful. Whilst some (e.g. Vera++ or cpplint.py) need very little configuration to make use if effectively, others such as PC-lint (and even, to a lesser extent, CppCheck) may need to be fed pretty much the same configuration as the compiler itself to be useful. As a result the command lines you need to use with some analysis tools are pretty complex in practice (which is of course a disincentive to using them, and so where Visual Lint comes in. But I digress...). In fact, more capable code analysis tools may need to be given even more information than the compiler iteself to generate meaningful results - for example they may need to be configured with the built-in preprocessor settings of the compiler itself. In a PC-lint analysis configuration, this is part of the job compiler indirect files such as co-msc110.lnt do. As a result of the above, a product such as Visual Lint which effectively acts as a "front end" to complex (and almost infinitely configurable) code analysis tools like PC-lint needs to care about the details of how your compiler actually sees your code every bit as much as the analysis tool itself does. What this means in practice is that Visual Lint needs be able to determine the properties of each project it sees and understand how they are affected by the properties of project platforms, configurations and whatever compiler(s) a project uses. When it generates an analysis command line, it may need to reflect those properties so that the analysis tool sees the details of the preprocessor symbols, include folders etc. each file in the project would normally be compiled with - including not only the properties exposed in the corresponding project or makefile, but also built-in symbols etc. normally provided by the corresponding compiler. That's a lot of data to get right, and inevitably sometimes there will be edge cases where don't quite get it right the first time. It goes without saying that if you find one of those edge cases - please tell us! So, background waffle over - in this post I'm going to talk about one of the things Visual Lint does - parsing project files to identify (among other things) the preprocessor symbols and include folders for each file in order to be able to reflect this information in the analysis tool configuration. When analysing with CppCheck, preprocessor and include folder data read in this way can be included on the generated command line as -D and -I directives. Although we could do the same with PC-lint, it is generally better to write the preprocessor and include folder configuration to an indirect ("project.lnt") file which also includes details of which implementation (.c, .cpp, .cxx etc.) files are included in the project -as well as any other project specific options. For example:
// Generated by Visual Lint 4.5.6.233 from file: SourceVersioner_vs90.vcproj
// -dConfiguration=Release|Win32
//
-si4 -sp4 // Platform = "Win32"
//
+ffb // ForceConformanceInForLoopScope = "TRUE"
-D_UNICODE;UNICODE // CharacterSet = "1"
-DWIN32;NDEBUG;_CONSOLE // PreprocessorDefinitions = "WIN32;NDEBUG;_CONSOLE"
-D_CPPRTTI // RuntimeTypeInfo = "TRUE"
-D_MT // RuntimeLibrary = "0"
//
// AdditionalIncludeDirectories = ..\Include"
-save -e686 //
-i"..\Include" //
-restore //
//
// SystemIncludeDirectories = "
// F:\Projects\Libraries\boost\boost_1_55_0;
// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\include;
// C:\Program Files (x86)\
// Microsoft Visual Studio 9.0\VC\atlmfc\include;
// C:\Program Files\
// Microsoft SDKs\Windows\v6.0A\include;
// F:\Projects\Libraries\Wtl\8.1\Include"
//
-save -e686 //
+libdir(F:\Projects\Libraries\boost\boost_1_55_0)
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include")
+libdir("C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir("C:\Program Files\Microsoft SDKs\Windows\v6.0A\include")
+libdir(F:\Projects\Libraries\Wtl\8.1\Include)
-iF:\Projects\Libraries\boost\boost_1_55_0
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include"
-i"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\atlmfc\include"
-i"C:\Program Files\Microsoft SDKs\Windows\v6.0A\include"
-iF:\Projects\Libraries\Wtl\8.1\Include
-restore //
//
SourceVersioner.cpp // RelativePath = "SourceVersioner.cpp"
SourceVersionerImpl.cpp // RelativePath = "SourceVersionerImpl.cpp"
stdafx.cpp // RelativePath = "stdafx.cpp"
Shared\FileUtils.cpp // RelativePath = "Shared\FileUtils.cpp"
Shared\FileVersioner.cpp // RelativePath = "Shared\FileVersioner.cpp"
Shared\PathUtils.cpp // RelativePath = "Shared\PathUtils.cpp"
Shared\ProjectConfiguration.cpp
// RelativePath = "Shared\ProjectConfiguration.cpp"
Shared\ProjectFileReader.cpp // RelativePath = "Shared\ProjectFileReader.cpp"
Shared\SolutionFileReader.cpp // RelativePath = "Shared\SolutionFileReader.cpp"
Shared\SplitPath.cpp // RelativePath = "Shared\SplitPath.cpp"
Shared\StringUtils.cpp // RelativePath = "Shared\StringUtils.cpp"
Shared\XmlUtils.cpp // RelativePath = "Shared\XmlUtils.cpp"
A file like this is written for every project configuration we analyse, but the mechanism used to actually read the configuration data from projects varies slightly depending on the project type and structure. In the case of the Visual Studio 2002, 2003, 2005 and 2008 .vcproj files Visual Lint was originally designed to work with, this is straightforward as the project file contains virtually all of the information needed in a simple to read, declarative form. Hence to parse a .vcproj file we simply load its contents into an XML DOM object and query the properties in a straightforward manner. Built-in preprocessor symbols such as _CPPUNWIND are defined by reading the corresponding properties (EnableExceptions in the case above) or inferred from the active platform etc. Similarly, for Visual C++ 6.0 and eMbedded Visual C++ 4.0 project files we just parse the compiler command lines in the .dsp or .vcp file directly. This is pretty straightforward as well as although .dsp and .vcp files are really makefiles they have a very predictable structure. Some other development environments (e.g. Green Hills MULTI, CodeVisionAVR) have bespoke project file formats which are generally easy to parse using conventional techniques. Visual Studio 2010, 2012 and 2013 .vcxproj project files are far more tricky, as the MSBuild XML format they use is effectively a scripting language rather than a declarative format. To load them, we effectively have to execute them (but obviously without running the commands they contain). As you can imagine this is rather tricky! We basically had to write a complete MSBuild parsing engine* for this task, which effectively executes the MSBuild script in memory with defined parameters to identifiy its detailed properties. * Although there are MSBuild APIs which could conceivably help, there are implemented in managed code - which we can't use in a plug-in environment due to .NET framework versioning restrictions. Instead, our MSBuild parsing engine is written natively in C++. To work out the parameters to apply to the MSBuild script, we prescan the contents of the .vcxproj file to identify the configurations and platforms available. Once we have those, we can run the script in memory and inspect the properties it generates for the required build configuration. This process is also used with CodeGear C++ and Atmel Studio projects, both of which are MSBuild based. Eclipse projects (.project and .cproject files) are also XML based and are not too difficult to parse, but whereas it is straightforward to work out how things are defined in a .vcproj file, Eclipse project files are much less clearly defined and more variable (not surprising as they effectively represent serialised Java object data). To make matters worse, each compiler toolchain can have its own sub-format, so things can change in very subtle ways between based projects for different Eclipse based environments such as Eclipse/CDT, QNX Momentics and CodeWarrior. In addition, Eclipse C/C++ projects come in two flavours - managed builder and makefile. Of the two, managed builder projects are easy to deal with - for example to determine the built-in settings of the compiler in a managed builder we read the scanner (*.sc) file produced by the IDE when it builds the project, and add that data to the configuration data we have been able to read from the project file. The scanner (.sc) file is a simple XML file located within the Eclipse workspace .metadata folder. Here's an extract to give you an idea what it looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?scdStore version="2"?>
<scannerInfo id="org.eclipse.cdt.make.core.discoveredScannerInfo">
<instance id="cdt.managedbuild.config.gnu.mingw.exe.debug.116390618;
cdt.managedbuild.config.gnu.mingw.exe.debug.116390618.;
cdt.managedbuild.tool.gnu.cpp.compiler.mingw.exe.debug.1888704458;
cdt.managedbuild.tool.gnu.cpp.compiler.input.391761176">
<collector id="org.eclipse.cdt.make.core.PerProjectSICollector">
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/mingw32"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include/c++/backward"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include"/>
<includePath path="c:\mingw\bin\../lib/gcc/mingw32/4.5.2/include-fixed"/>
<definedSymbol symbol="__STDC__=1"/>
<definedSymbol symbol="__cplusplus=1"/>
<definedSymbol symbol="__STDC_HOSTED__=1"/>
<definedSymbol symbol="__GNUC__=4"/>
<definedSymbol symbol="__GNUC_MINOR__=5"/>
<definedSymbol symbol="__GNUC_PATCHLEVEL__=2"/>
<definedSymbol symbol="__GNUG__=4"/>
<definedSymbol symbol="__SIZE_TYPE__=unsigned int"/>
.
.
.
</collector>
</instance>
</scannerInfo>
Unfortunately scanner files are not generated for makefile projects, and the enclosing project files do not give details of the preprocessor symbols and include folders needed to analyse them. So how do we do it? The answer is similar to the way we load MSBuild projects - by running the makefile without executing the commands within. In this case however the make tools themselves can (fortunately!) lend a hand, as both NMake and GNU Make have an option which runs the makefile and echoes the commands which would be executed without actually running them. For NMake this is /n and GNU Make -n or --just-print. The /B (NMake) -B (GNU Make) switch can also be used to ensure that all targets are executed regardless of the build status of the project (otherwise we wouldn't be able to read anything if the project build was up to date). If we run a makefile with these switches and capture the output we can then parse the compiler command lines themselves and extract preprocessor symbols etc. from it. As so many compilers have similar command line switches for things like preprocessor symbols and include folders this is generally a pretty straightforward matter. Many variants of GCC also have command line options which can report details of built-in preprocessor symbols and system include folders, which we can obviously make use of - so the fact that many embedded C/C++ compilers today are based on GCC is very useful indeed. There's a lot of detail I've not covered here, but I hope you get the general idea. It follows from the above that adding support for new project file types to Visual Lint is generally not difficult - provided we have access to representative project files and (preferably) a test installation of the IDE in question. If there is a project file type you would like us to look at, please tell us.

A More Full-Featured Emacs company-mode Backend

Austin Bingham from Good With Computers

In the first article in this series we looked at how to define the simplest company-mode backend. [1] This backend drew completion candidates from a predefined list of options, and allowed you to do completion in buffers in fundamental mode. The main purpose of that article was to introduce the essential plumbing of a company-mode backend.

In this article we'll expand upon the work of the first, adding some useful UI elements like annotations and metadata. We'll also implement a rough form of fuzzy matching, wherein candidates will be presented to the user when they mostly match the prefix. After this article you'll know almost everything you need to know about writing company-mode backends, and you'll be in a great position to learn the rest on your own.

Most of what we'll be doing in the article revolves around handling completion candidate "metadata", data associated in some way with our completion candidates. In practice this kind of data covers things like documentation strings, function signatures, symbols types, and so forth, but for our purposes we'll simply associate some biographical data with the names in our completion set sample-completions.

company-mode provides affordances for displaying metadata as part of the completion process. For example, if your backend is showing completions for function names, you could display the currently-selected function's signature in the echo area. We'll develop a backend that displays a sentence about the selected candidate in the echo area, and we'll also display their initials as an annotation in the candidate selection popup menu.

Adding more data to our completion candidates

First we need to add some metadata to our existing completion candidates. To do this we'll use Emacs text properties. ((Text properties allow you to associate arbitrary data with strings. You can read about them here. Specifically, we use the special read syntax for text properties.)) For each completion candidate we define an :initials property containing their initials and a :summary property containing a one-sentence summary of the candidate. [2] To add these properties, update sample-completions to look like this:

(defconst sample-completions
  '(#("alan" 0 1
      (:initials
      "AMT"
      :summary
      (concat "Alan Mathison Turing, OBE, FRS (/ˈtjʊərɪŋ/ "
              "tewr-ing; 23 June 1912 – 7 June 1954) was a "
              "British mathematician, logician, cryptanalyst, "
              "philosopher, pioneering computer scientist, "
              "mathematical biologist, and marathon and ultra "
              "distance runner.")))
    #("john" 0 1
      (:initials
      "JVN"
      :summary
      (concat "John von Neumann (/vɒn ˈnɔɪmən/; December 28, "
              "1903 – February 8, 1957) was a Hungarian and "
              "American pure and applied mathematician, physicist, "
              "inventor and polymath.")))
    #("ada" 0 1
      (:initials
      "AAK"
      :summary
      (concat "Augusta Ada King, Countess of Lovelace (10 December "
              "1815 – 27 November 1852), born Augusta Ada Byron "
              "and now commonly known as Ada Lovelace, was an "
              "English mathematician and writer chiefly known for "
              "her work on Charles Babbage's early mechanical "
              "general-purpose computer, the Analytical Engine.")))
    #("don" 0 1
      (:initials
      "DEK"
      :summary
      (concat "Donald Ervin Knuth (/kəˈnuːθ/[1] kə-nooth; born "
              "January 10, 1938) is an American computer "
              "scientist, mathematician, and Professor Emeritus "
              "at Stanford University.")))))

Attaching properties like this is a very convenient way to store metadata for completion candidates. Of course in a real backend you probably wouldn't have a hard-coded list of candidates, and you'd be fetching them dynamically from a server, database, or external process. In that case, you'd need to also dynamically fetch the metadata you want and attach it to the candidate strings you serve through your backend. In the end, text properties work well in this context because they transparently transport the metadata - which company-mode doesn't know about - with the completion strings that company-mode definitely knows about.

Adding completion menu annotations

This change by itself doesn't really do anything, of course. All we've done is add properties to some strings, and we need to instruct company-mode on how to actually use them for display. The first way we'll use this metadata, then, is to add a small annotation to each entry in the popup menu used for candidate selection. To add this annotation, we need to update company-sample-backend to respond to the annotation command. This command should resolve to the annotation you want to use for the given candidate. Typically this means calling a function taking the completion candidate string arg and returning the annotation string.

First let's define a function that takes a completion candidate string and returns an annotation. Remember that our candidate strings store their metadata as text properties, so fundamentally this function simply needs to extract a property. For the annotation, we'll extract the :initials property and return it (prefixed with a blank.) That function looks like this:

(defun sample-annotation (s)
  (format " [%s]" (get-text-property 0 :initials s)))

Next we need to update our backend to respond to the annotation command like this:

(defun company-sample-backend (command &optional arg &rest ignored)
  (interactive (list 'interactive))``

  (case command
    (interactive (company-begin-backend 'company-sample-backend))
    (prefix (and (eq major-mode 'fundamental-mode)
                (company-grab-symbol)))
    (candidates
    (remove-if-not
      (lambda (c) (string-prefix-p arg c))
      sample-completions))
    (annotation (sample-annotation arg))))

In the last line we tell the backend to call sample-annotation with the candidate string to produce an annotation.

Now when we do completion we see the candidates' initials in the popup menu:

candidates-initials

Displaying metadata in the echo area

Where the annotation command adds a small annotation to the completion popup menu, the meta backend command produces text to display in the echo area. [3] The process for producing the metadata string is almost exactly like that of producing the annotation string. First we write a function that extracts the string from the candidate text properties. Then we wire that function into the backend through the meta command.

As you've probably guessed, the function for extracting the metadata string will simply read the :summary property from a candidate string. It looks like this:

(defun sample-meta (s)
  (get-text-property 0 :summary s))

The changes to the backend look like this:

(defun company-sample-backend (command &optional arg &rest ignored)
  (interactive (list 'interactive))

  (case command
    (interactive (company-begin-backend 'company-sample-backend))
    (prefix (and (eq major-mode 'fundamental-mode)
                (company-grab-symbol)))
    (candidates
    (remove-if-not
      (lambda (c) (string-prefix-p arg c))
      sample-completions))
    (annotation (sample-annotation arg))
    (meta (sample-meta arg))))

As before, in the last line we associate the meta command with our sample-meta function.

Here's how the metadata looks when displayed in the echo area:

Screen Shot 2014-11-03 at 12.02.10 PM

Fuzzy matching

As a final improvement to our backend, let's add support for fuzzy matching. This will let us do completion on prefixes which don't exactly match a candidate, but which are close enough. [4] For our purposes we'll implement a very crude form of fuzzy matching wherein a prefix matches a candidate if the set of letters in the prefix is a subset of the set of letters in the candidate. The function for performing fuzzy matching looks like this:

(defun sample-fuzzy-match (prefix candidate )
  (cl-subsetp (string-to-list prefix)
              (string-to-list candidate)))

Now we just need to modify our backend a bit. First we need to modify our response to the candidates command to use our new fuzzy matcher. Then we need to respond to the no-cache command by returning true. [5] Here's how that looks:

(defun company-sample-backend (command &optional arg &rest ignored)
  (interactive (list 'interactive))

  (case command
    (interactive (company-begin-backend 'company-sample-backend))
    (prefix (and (eq major-mode 'fundamental-mode)
                (company-grab-symbol)))
    (candidates
    (remove-if-not
      (lambda (c) (sample-fuzzy-match arg c))
      sample-completions))
    (annotation (sample-annotation arg))
    (meta (sample-meta arg))
    (no-cache 't)))

As you can see, we've replaced string-prefix-p in the candidates response with sample-fuzzy-match, and we've added (no-cache 't).

Here's how our fuzzy matching looks in action:

Screen Shot 2014-11-03 at 12.23.45 PM

That's all, folks

We've seen how to use Emacs' text properties to attach metadata to candidate strings. This is a really useful technique to use when developing company-mode backends, and one that you'll see used in real-world backends. With that metadata in place, we've also seen that it's very straightforward to tell your backend to display annotations in popup menus and metadata in the echo area. Once you've got the basic techniques under your belt, you can display anything you want as part of completion.

There are still more aspects to developing company-mode backends, but with what we've covered in this series you can get very far. More importantly, you know the main concepts and infrastructure for the backends, so you can learn the rest on your own. If you want to delve into all of the gory details, you'll need to read the company-mode source code, and specifically the documentation for company-backends. [6]

For an example of a fairly full-featured backend implementation that's currently in use (and under active development), you can see the emacs-ycmd project. [7] Happy hacking!

[1]The first article in this series..
[2]The summaries are simply the first sentences of the respective Wikipedia articles.
[3]The Emacs manual entry on the echo area.
[4]Fuzzy matching is commonly used for completion tools because it addresses common cases where users transpose characters, accidentally leave characters out, or consciously leverage fuzzy matching for increased speed.
[5]The details of why this is the case are murky, but the company-mode source code specifically states this. The same source also says that we technically should be implementing a response to match, but that doesn't seem to affect this implementation.
[6]The company-mode project page.
[7]The emacs-ycmd project is on github. In particular, see company-ycmd.el.

Prove it – factorial is bigger than 2^n

Frances Buontempo from BuontempoConsulting

I've been doing the scala coursera and wanted to write down the proof that

factorial(n) ≥ 2n when n ≥ 4

since it uses induction and I am out of practise.

Base case

For n = 4

factorial(4) = 4*3*2*1 = 24
and 24 = 2*2*2*2 = 16
and 24 ≥ 16

so factorial(n) ≥ 2n when n = 4

Induction step

For n >= 4 assume we have

factorial(n) >= 2n

and consider

factorial(n+1) = factorial(n) × (n+1)
 ≥ 2n × (n+1)
 ≥ 2n × 2 since (n+1) ≥ 2 when n ≥ 4
 = 2n+1

QED
(I also wanted to learn about writing maths in html)

Acid Air Pollution on the Cowley Road Oxford Uk

Tim Pizey from Tim Pizey

The Long Wall which Long Wall Street is named for always has a light line of white sand at its base, and nearly always has ongoing masonry works to repair it.

These pictures are taken at another Oxford road junction.

On the eastern edge of Oxford another set of traffic lights, at the junction of Cowley Road and Between Towns Road, again causes stationary traffic.

The local limestone is not very hard, and is constantly eroded, with the damage to the mortar being worse than that to the stone.

The damage is worst at ground level up to one metre.

The damage is worst near the ground and is caused by acidic exhaust fumes.

Here all the lichen has been killed and the brick looks as though it has been cleaned with brick acid.

We know that breathing in small particles from diesel exhaust is dangerous, as the particle size is small enough to pass deep into the lungs. The pattern of the effect on the walls shows that the concentration is particularly strong under one metre in height. There are two primary schools, one on either side of this junction.

Iffley Village Oxford Rag Stone

The quality of Oxford stone can be pretty shaky, and the stone that was used for field and road boundary walls was presumably of lower quality than that used for houses as good quality stone is scarce on the clay. This wall in Iffley is a charming example, but note what is happening to the bottom metre.

Who should we be claiming financial damages from? Shell? BP? Exxon?

Modern C++ Testing

Phil Nash from level of indirection

Back in August I took my family to stay for a week at my brother's house in (Old) South Wales. I've not been to Wales for a long time and it was great to be back there - but that's not what this post is about, of course.

The thing about Wales is that it has mountains (and where there are no mountains there are plenty of hills and valleys and cliffs). Mobile cell coverage is non-existent in much of the country - particularly where my brother lives.

So the timing was particularly bad when, just as we were driving along the south cost (somewhere between Cardiff and Swansea, I think), I started getting emails and tweets from people pointing out that Catch was riding high on HackerNews! Someone had recently discovered Catch and was enjoying it enough that they wanted to share it with the community. Which is awesome!

Except that, between lack of mobile reception and spending time with my family, I didn't have opportunity to join the discussion.

When I got back home a week later I read through the comments. One of them stuck out because it called me out on describing Catch as a "Modern C++" framework (the commenter recommended another framework, Bandit, as being "more modern").

When I first released Catch, back in 2010, C++11 was still referred to as C++1x (or even C++0x!) and the final release date was still uncertain. So Catch was written to target C++03. It used a few "modern" idioms of the time - but the modernity was intended more as being a break from the past - where most C++ frameworks were just reimplementations of JUnit in C++. So I think the label was somewhat justified at the time.

Of course since then C++11 has not only been standardised but is fully, or nearly fully, implemented by many leading, mainstream, compilers. I think adoption is still not high enough, at this point, that I'd be willing to drop support for C++03 in Catch (there is even an actively maintained fork for VC6!). But it is enough that the baseline for what constitutes "modern C++" has definitely moved on. And now C++14 is here too - pushing it even further forward.

"Modern" is not what it used to be

What does it mean to be a "Modern C++ Test Framework" these days anyway? Well the most obvious thing for the user is probably the use of lambdas. Along with a few other features, lambdas allow for a lot of what previously required macros to be done in pure C++. I'm usually the first to hold this up as A Good Thing. In a moment I'll get to why I don't think it's necessarily as good a step as you might think.

But before I get to that; one other thing: For me, as a framework author, the biggest difference C++11/14 would make to something like Catch would be in the internals. Large chunks of code could be removed, reduced or at least cleaned up. The "no dependencies" policy means that Catch has complete implementations of things like shared pointers, optional types and function objects - as well as many things that must be done the long way round (such as iterating collections - I long for range for loops - or at least BOOST_FOREACH).

The competition

I've come across three frameworks that I'd say qualify as truly trying to be "modern C++ test frameworks". I'm sure there are others - and I've not really even used these ones extensively - but these are the ones I'll reference in this discussion. The three frameworks are:

  • Lest - by Martin Moene, an active contributor to Catch - and partly based on some Catch ideas - re-imagined for a C++11 world.
  • Bandit - this is the one mentioned in the Hacker News comment I kicked off with
  • Mettle - I saw this in a tweet from @MeetingCpp last week and it's what kicked off the train of thought that led me to this post

The case for test case macros

But why did I say that the use of lambdas is not such a good idea? Actually I didn't quite say that. I think lambdas are a very good idea - and in many ways they would certainly clean up at least the mechanics of defining and registering test cases and sections.

Before lambdas C++ had only one place you could write a block of imperative code: in a function (or method). That means that, in Catch, test cases are really just functions - which must have a function signature - including a name (which we hide - because in Catch the test name is a string). Those functions must be captured somehow. This is done by passing a pointer to the function to the constructor of a small class - who's sole purposes is to forward the function pointer onto a global registry. Later, when the tests are being run, the registry is iterated and the function pointers invoked.

So a test case like this:

    TEST_CASE( "test name", "[tags]" )
    {
        /* ... */
    }

...written out in full (after macro expansion) looks something like:

    static void generatedFunctionName();
    namespace{ ::Catch::AutoReg generatedNameAutoRegistrar
        (   &generatedFunctionName, 
       	    ::Catch::SourceLineInfo( __FILE__, static_cast<std::size_t>( __LINE__ ) ), 
            ::Catch::NameAndDesc( "test name", "[tags]") ); 
    }
    static void generatedFunctionName()
    {
        /* .... */
    }

(generatedFunctionName is generated by yet another macro, which combines root with the current line number. Because the function is declared static the identifier is only visible in the current translation unit (cpp file), so this should be unique enough)

So there's a lot of boilerplate here - you wouldn't want to write this all by hand every time you start a new test case!

With lambdas, though, blocks of code are now first class entities, and you can introduce them anonymously. So you could write them like:

    Catch11::TestCase( "test name", "[tags]", []() 
    {
        /* ... */
    } );

This is clearly far better than the expanded macro. But it's still noisier than the version that uses the macro. Most of the C++11/14 test frameworks I've looked at tend to group tests together at a higher level. The individual tests are more like Catch's sections - but the pattern is still the same - you get noise from the lambda syntax in the form of the []() or [&]() to introduce the lambda and an extra ); at the end.

Is that really worth worrying about?

Personally I find it's enough extra noise that I think I'd prefer to continue to use a macro - even if it used lambdas under the hood. But it's also small enough that I can certainly see the case for going macro free here.

Assert yourself

But that's just test cases (and sections). Assertions have traditionally been written using macros too. In this case the main reasons are twofold:

  1. It allows the expression evaluation to be wrapped in an exception handler.
  2. It allows us the capture the file and line number to report on.

(1) can arguably be handled in whatever is holding the current lambda (e.g. it or describe in Bandit, suite, subsuite or expect in Mettle). If these blocks are small enough we should get sufficient locality of exception handling - but it's not as tight as the per-expression handling with the macro approach.

(2) simply cannot be done without involving the preprocessor in some way (whether it's to pass __FILE__ and __LINE__ manually, or to encapsulate that with a macro). How much does that matter? Again it's a matter of taste but you get several benefits from having that information. Whether you use it to manually locate the failing assertion or if you're running the reporter in an IDE window that automatically allows you to double-click the failure message to take you to the line - it's really useful to be able to go straight to it. Do you want to give that up in order to go macro free? Perhaps. Perhaps not.

Interestingly lest still uses a macro for assertions

Weighing up

So we've seen that a truly modern C++ test framework, using lambdas in particular, can allow you to write tests without the use of macros - but at a cost!

So the other side of the equation must be: what benefit do you get from eschewing the macros?

Personally I've always striven to minimise or eliminate the use of macros in C++. In the early days that was mostly about using const, inline and templates. Now lambdas allow us to address some of the remaining cases and I'm all for that.

But I also tend to associate a much higher "cost" to macro usage when it generates imperative code. This is code that you're likely to find yourself needing to step through in a debugger at runtime - and macros really obfuscate this process. When I use macros it tends to be in declarative code. Code that generates purely declarative statements, or effectively declarative statements (such as the test case function registration code). It tends to always generate the exact same machinery - so should not be sensitive to its inputs in ways that will require debugging.

How do Catch's macros play out in that regard? Well the test case registration macros get a pass. Sections are a grey area - they are on the path of code that needs to be stepped over - and, worse, hide a conditional (a section is really just an if statement on a global variable!). So score a few points down there. Assertions are also very much runtime executable - and are frequently on the debugging path! In fact stepping into expressions being asserted on in Catch tests can be quite a pain as you end up stepping into some of the "hidden" calls before you get to the expression you supplied (in Visual Studio, at least, this can be mitigated by excluding the Catch namespace using the StepOver registry key).

Now, interestingly, the use of macros for the assertions was never really about C++03 vs C++11. It was about capturing extra information (file/ line) and wrapping in a try-catch. So if you're willing to make that trade-off there's no reason you can't have non-macro assertions even in C++03!

Back to the future

One of my longer arcs of development on Catch (that I edge towards on each refactoring) is to decouple the assertion mechanism from the guts of the test runner. You should be able to provide your own assertions that work with Catch. Many other test frameworks work this way and it allows them to be much more flexible. In particular it will allow me to decouple the matcher framework (and maybe allow third-party matchers to work with Catch).

Of course this would also allow macro-less assertions to be used (as it happens the assertions in bandit and mettle are both matcher-like already).

So, while I think Catch is committed to supporting C++03 for some time yet, that doesn't mean there is no scope for modernising it and keeping it relevant. And, modern or not, I still believe it is the simplest C++ test framework to get up and running with, and the least noisy to work with.