Simon Brand – ACCU World of Code

This post is also available at the Microsoft Visual C++ Team Blog

C++17 merged in a paper called Guaranteed copy elision through simplified value categories. The changes mandate that no copies or moves take place in some situations where they were previously allowed, e.g.:

struct non_moveable {
    non_moveable() = default;
    non_moveable(non_moveable&&) = delete;
};
non_moveable make() { return {}; }
non_moveable x = make(); //compiles in C++17, error in C++11/14

You can see this behavior in compiler versions Visual Studio 2017 15.6, Clang 4, GCC 7, and above.

Despite the name of the paper and what you might read on the Internet, the new rules do not guarantee copy elision. Instead, the new value category rules are defined such that no copy exists in the first place. Understanding this nuance gives a deeper understanding of the current C++ object model, so I will explain the pre-C++17 rules, what changes were made, and how they solve real-world problems.

Value Categories

To understand the before-and-after, we first need to understand what value categories are (Iâ€™ll explain copy elision in the next section). Continuing the theme of C++ misnomers, value categories are not categories of values. They are characteristics of expressions. Every expression in C++ has one of three value categories: lvalue, prvalue (pure rvalue), or xvalue (eXpring value). There are then two parent categories: all lvalues and xvalues are glvalues, and all prvalues and xvalues are rvalues.

diagram expressing the taxonomy described above

For an explanation of what these are, we can look at the standard (C++17 [basic.lval]/1):

A glvalue ([generalised lvalue]) is an expression whose evaluation determines the identity of an object, bit-field, or function.
A prvalue is an expression whose evaluation initializes an object or a bit-field, or computes the value of an operand of an operator, as specified by the context in which it appears.
An xvalue is a glvalue that denotes an object or bit-field whose resources can be reused (usually because it is near the end of its lifetime).
An lvalue is a glvalue that is not an xvalue.
An rvalue is a prvalue or an xvalue.

Some examples:

std::string s; 
s //lvalue: identity of an object 
s + " cake" //prvalue: could perform initialization/compute a value 

std::string f(); 
std::string& g(); 
std::string&& h(); 

f() //prvalue: could perform initialization/compute a value 
g() //lvalue: identity of an object 
h() //xvalue: denotes an object whose resources can be reused 

struct foo { 
    std::string s; 
}; 

foo{}.s //xvalue: denotes an object whose resources can be reused

C++11

What are the properties of the expression std::string{"a pony"}?

Itâ€™s a prvalue. Its type is std::string. It has the value "a pony". It names a temporary.

That last one is the key point I want to talk about, and itâ€™s the real difference between the C++11 rules and C++17. In C++11, std::string{"a pony"} does indeed name a temporary. From C++11 [class.temporary]/1:

Temporaries of class type are created in various contexts: binding a reference to a prvalue, returning a prvalue, a conversion that creates a prvalue, throwing an exception, entering a handler, and in some initializations.

Letâ€™s look at how this interacts with this code:

struct copyable {
    copyable() = default;
    copyable(copyable const&) { /*...*/ }
};
copyable make() { return {}; }
copyable x = make();

make() results in a temporary. This temporary will be moved into x. Since copyable has no move constructor, this calls the copy constructor. However, this copy is unnecessary since the object constructed on the way out of make will never be used for anything else. The standard allows this copy to be elided by constructing the return value at the call-site rather than in make (C++11 [class.copy]). This is called copy elision.

The unfortunate part is this: even if all copies of the type are elided, the constructor still must exist.

This means that if we instead have:

struct non_moveable {
    non_moveable() = default;
    non_moveable(non_moveable&&) = delete;
};
non_moveable make() { return {}; }
auto x = make();

then we get a compiler error:

<source>(7): error C2280: 'non_moveable::non_moveable(non_moveable &&)': attempting to reference a deleted function
<source>(3): note: see declaration of 'non_moveable::non_moveable'
<source>(3): note: 'non_moveable::non_moveable(non_moveable &&)': function was explicitly deleted

Aside from returning non-moveable types by value, this presents other issues:

Use of Almost Always Auto style is prevented for immobile types:

auto x = non_moveable{}; //compiler error

The language makes no guarantees that the constructors wonâ€™t be called (in practice this isnâ€™t too much of a worry, but guarantees are more convincing than optional optimizations).
If we want to support some of these use-cases, we need to write copy/move constructors for types which they donâ€™t make sense for (and do what? Throw? Abort? Linker error?)
You canâ€™t pass non-moveable types to functions by value, in case you have some use-case which that would help with.

Whatâ€™s the solution? Should the standard just say â€œoh, if you elide all copies, you donâ€™t need those constructorsâ€? Maybe, but then all this language about constructing temporaries is really a lie and building an intuition about the object model becomes even harder.

C++17

C++17 takes a different approach. Instead of guaranteeing that copies will be elided in these cases, it changes the rules such that the copies were never there in the first place. This is achieved through redefining when temporaries are created.

As noted in the value category descriptions earlier, prvalues exist for purposes of initialization. C++11 creates temporaries eagerly, eventually using them in an initialization and cleaning up copies after the fact. In C++17, the materialization of temporaries is deferred until the initialization is performed.

Thatâ€™s a better name for this feature. Not guaranteed copy elision. Deferred temporary materialization.

Temporary materialization creates a temporary object from a prvalue, resulting in an xvalue. The most common places it occurs are when binding a reference to or performing member access on a prvalue. If a reference is bound to the prvalue, the materialized temporaryâ€™s lifetime is extended to that of the reference (this is unchanged from C++11, but worth repeating). If a prvalue initializes a class type of the same type as the prvalue, then the destination object is initialized directly; no temporary required.

Some examples:

struct foo {
    int i;
};

foo make();
auto& a = make();  //temporary materialized and lifetime-extended
auto&& b = make(); //ditto

foo{}.i //temporary materialized

auto c = make(); //no temporary materialized

That covers the most important points of the new rules. Now on to why this is actually useful past terminology bikeshedding and trivia to impress your friends.

Who cares?

I said at the start that understanding the new rules would grant a deeper understanding of the C++17 object model. Iâ€™d like to expand on that a bit.

The key point is that in C++11, prvalues are not â€œpureâ€ in a sense. That is, the expression std::string{"a pony"} names some temporary std::string object with the contents "a pony". Itâ€™s not the pure notion of the list of characters â€œa ponyâ€. Itâ€™s not the Platonic ideal of â€œa ponyâ€.

In C++17, however, std::string{"a pony"} is the Platonic ideal of â€œa ponyâ€. Itâ€™s not a real object in C++â€™s object model, itâ€™s some elusive, amorphous idea which can be passed around your program, only being given form when initializing some result object, or materializing a temporary. C++17â€™s prvalues are purer prvalues.

If this all sounds a bit abstract, thatâ€™s okay, but internalising this idea will make it easier to reason about aspects of your program. Consider a simple example:

struct foo {};
auto x = foo{};

In the C++11 model, the prvalue foo{} creates a temporary which is used to move-construct x, but the move is likely elided by the compiler.

In the C++17 model, the prvalue foo{} initializes x.

A more complex example:

std::string a() {
    return "a pony";
}

std::string b() {
    return a();
}

int main() {
    auto x = b();
}

In the C++11 model, return "a pony"; initializes the temporary return object of a(), which move-constructs the temporary return object of b(), which move-constructs x. All the moves are likely elided by the compiler.

In the C++17 model, return "a pony"; initializes the result object of a(), which is the result object of b(), which is x.

In essence, rather than an initializer creating a series of temporaries which in theory move-construct a chain of return objects, the initializer is teleported to the eventual result object. In C++17, the code:

T a() { return /* expression */ ; }
auto x = a();

is identical to auto x = /* expression */;. For any T.

Closing

The â€œguaranteed copy elisionâ€ rules do not guarantee copy elision; instead they purify prvalues such that the copy doesnâ€™t exist in the first place. Next time you hear or read about â€œguaranteed copy elisionâ€, think instead about deferred temporary materialization.

August 23, 2018

Spaceship Operator

Simon Brand from Simon Brand

You write a class. It has a bunch of member data. At some point, you realise that you need to be able to compare objects of this type. You sigh and resign yourself to writing six operator overloads for every type of comparison you need to make. Afterwards your fingers ache and your previously clean code is lost in a sea of functions which do essentially the same thing. If this sounds familiar, then C++20â€™s spaceship operator is for you. This post will look at how the spaceship operator allows you to describe the strength of relations, write your own overloads, have them be automatically generated, and how correct, efficient two-way comparisons are automatically rewritten to use them.

Relation strength

The spaceship operator looks like <=> and its official C++ name is the â€œthree-way comparison operatorâ€. It is so-called, because it is used by comparing two objects, then comparing that result to 0, like so:

(a <=> b) < 0  //true if a < b
(a <=> b) > 0  //true if a > b
(a <=> b) == 0 //true if a is equal/equivalent to b

One might think that <=> will simply return -1, 0, or 1, similar to strcmp. This being C++, the reality is a fair bit more complex, but itâ€™s also substantially more powerful.

Not all equality relations were created equal. C++â€™s spaceship operator doesnâ€™t just let you express orderings and equality between objects, but also the characteristics of those relations. Letâ€™s look at two examples.

Example 1: Ordering rectangles

We have a rectangle class and want to define comparison operators for it based on its size. But what does it mean for one rectangle to be smaller than another? Clearly if one rectangle is 1cm by 1cm, it is smaller than one which is 10cm by 10cm. But what about one rectangle which is 1cm by 5cm and another which is 5cm by 1cm? These are clearly not equal, but theyâ€™re also not less than or greater than each other. But speaking practically, having all of <, <=, ==, >, and >= return false for this case is not particularly useful, and it breaks some common assumptions at these operators, such as (a == b || a < b || a > b) == true. Instead, we can say that == in this case models equivalence rather than true equality. This is known as a weak ordering.

Example 2: Ordering squares

Similar to above, we have a square type which we want to define comparison operators for with respect to size. In this case, we donâ€™t have the issue of two objects being equivalent, but not equal: if two squares have the same area, then they are equal. This means that == models equality rather than equivalence. This is known as a strong ordering.

Describing relations

Three-way comparisons allow you express the strength of your relation, and whether it allows just equality or also ordering. This is achieved through the return type of operator<=>. Five types are provided, and stronger relations can implicitly convert to weaker ones, like so:

relation conversion

Strong and weak orderings are as described in the above examples. Strong and weak equality means that only == and != are valid. Partial ordering is weaker than weak_ordering in that it also allows unordered values, like NaN for floats.

These types have a number of uses. Firstly they indicate to users of the types what kind of relation is modelled by the comparisons, so that their behaviour is more clear. Secondly, algorithms could potentially have more efficient specializations for when the orderings are stronger, e.g. if two strongly-ordered objects are equal, calling a function on one of them will give the same result as the other, so it would only need to be carried out once. Finally, they can be used in defining language features, such as class type non-type template parameters, which require the type to have strong equality so that equal values always name the same template instantiation.

Writing your own three-way comparison

You can provide a three-way comparison operator for your type in the usual way, by writing an operator overload:

struct foo {
  int i;

  std::strong_ordering operator<=> (foo const& rhs) {
    return i <=> rhs.i;
  }
};

Note that whereas two-way comparisons should be non-member functions so that implicit conversions are done on both sides of the operator, this is not necessary for operator<=>; we can make it a member and itâ€™ll do the right thing.

Sometimes you may find that you want to do <=> on types which donâ€™t have an operator<=> defined. The compiler wonâ€™t try to work out a definition for <=> based on the other operators, but you can use std::compare_3way instead, which will fall back on two-way comparisons if there is no three-way version available. For example, we could write a three-way comparison operator for a pair type like so:

template<class T, class U>
struct pair {
  T t;
  U u;

  auto operator<=> (pair const& rhs) const
    -> std::common_comparison_category_t<
         decltype(std::compare3_way(t, rhs.t)),
         decltype(std::compare3_way(u, rhs.u)> {
    if (auto cmp = std::compare3_way(t, rhs.t); cmp != 0) return cmp;
    return std::compare3_way(u, rhs.u);
  }

std::common_comparison_category_t there determines the weakest relation given a number of them. E.g. std::common_comparison_category_t<std::strong_ordering, std::partial_ordering> is std::partial_ordering.

If you found the previous example a bit too complex, you might be glad to know that C++20 will support automatic generation of comparison operators. All we need to do is =default our operator<=>:

auto operator<=>(x const&) = default;

Simple! This will carry out a lexicographic comparison for each base, then member of the type, in order of declaration.

Automatically-rewritten two-way comparisons

Although <=> is very powerful, most of the time we just want to know if some object is less than another, or equal to another. To facilitate this, two-way comparisons like < can be rewritten by the compiler to use <=> if a better match is not found.

The basic idea is that for some operator @, an expression a @ b can be rewritten as a <=> b @ 0. It can even be rewritten as 0 @ b <=> a if that is a better match, which means we get symmetry for free.

In some cases, this can actually provide a performance benefit. Quite often comparison operators are implemented by writing == and <, then writing the other operators in terms of those rather than duplicating the code. This can lead to situations where we want to check <= and end up doing an expensive comparison twice. This automatic rewriting can avoid that cost, since it will only call the one operator<=> rather than both operator< and operator==.

Conclusion

The spaceship operator is a very welcome addition to C++. It gives us more expressiveness in how we define our relations, lets us write less code to define them (sometimes even just a defaulted declaration), and avoids some of the performance pitfalls of manually implementing some comparison operators in terms of others.

For more details, have a look at the cppreference articles for comparison operators and the compare header, and Herb Sutterâ€™s standards proposal. For a good example on how to write a spaceship operator, see Barry Revzinâ€™s post on implementing it for std::optional. For more information on the mathematics of ordering with a C++ slant, see Jonathan MÃ¼llerâ€™s blog post series on the mathematics behind comparison.

July 12, 2018

Super Simple Named Boolean Parameters

Simon Brand from Simon Brand

Quite often we come across interfaces with multiple boolean parameters, like this:

cake make_cake (bool with_dairy, bool chocolate_sauce, bool poison);

A call to this function might look like:

auto c = make_cake(true, true, false);

Unfortunately, this is not very descriptive. We need to look at the declaration of the function to know what each of those bools mean, and it would be way too easy to accidentally flip the wrong argument during maintainance and end up poisoning ourselves rather than having a chocolatey treat.

There are many solutions which people use, from just adding comments to each argument, to creating tagged types. I recently came across a very simple trick which I had not seen before:

#define K(name) true

Yep, itâ€™s a macro which takes an argument and gets replaced by true. This may look strange, but it enables us to write this:

auto c = make_cake(K(with_dairy), !K(chocolate_sauce), !K(poison));

// or using `not` if you prefer
auto c = make_cake(K(with_dairy), not K(chocolate_sauce), not K(poison));

Of course, itâ€™s up to you whether you think using a macro for this kind of thing is worth it, or if you think that macro should have a name which is less likely to collide. Personally I think itâ€™s quite a neat trick to solve this issue in a low-overhead manner.

A guess I need to come up with a marketing name for this. Since we already have X Macros, I hereby dub this trick â€œK Macrosâ€.

June 7, 2018

Function Template Partial Ordering: Worked Examples

Simon Brand from Simon Brand

C++ function overloading rules are complex. C++ template rules are complex. Put the two together, and you unfortunately do not get something simple; you get a hideous monster of standardese which requires great patience and knowledge to overcome. However, since C++ is mostly corner-cases, it can pay to understand how the rules apply for those times where you just canâ€™t work out why your code wonâ€™t compile. This post will present a few step-by-step examples of how partial ordering of function templates works in order to arm you for these times of need.

Partial ordering of function templates is a step of overload resolution. It occurs when you call a function template which is overloaded and the compiler needs to decide which one is more specialized than the other. Consider this code:

template<class T> void f(T);        //(1)
template<class T> void f(T const*); //(2)

int const* p = nullptr;
f(p);

We expect f(p) to call (2), because p is a int const*. In order to decide that (2) is more specialized than (1), the compiler needs to follow the function template partial ordering rules. Letâ€™s see what the standard has to say.

Partial ordering selects which of two function templates is more specialized than the other by transforming each template in turn (see next paragraph) and performing template argument deduction using the function type.

This is not so complicated, even if the terms may be unfamiliar. Unfortunately, the â€œnext paragraphâ€ and the sections which it references are almost impossible to parse without a lot of background knowledge and re-readings, so I shall step through the algorithm rather than pasting the rules for you to cry over.

There are four steps which we to have work through:

Transforming (1).
Performing deduction on (2) with the transformed template from step 1.
Transforming (2).
Performing deduction on (1) with the transformed template from step 3.

If one and only one of the deductions succeeds, then the template with which the deduction was performed is more specialized than the other.

Step 1: The rules state that for each template parameter, we create some unique type to use in its stead¹. Letâ€™s call this unique type type_0. You can pretend that this was defined somewhere like class type_0{};. Now we take our function template template <class T> void f(T) and substitute in type_0 for T. This gives us void f(type_0). The transformation is complete.

Step 2: Now that we have transformed template <class T> void f(T) into void f(type_0), we will perform deduction on (2) using the transformed function type. To do this, we imagine a call to (2) where the arguments have the type of the parameters for (1). Concretely, it would look like this:

template <class T> void func_2(T const*);
func_2(type_0{}); //derived from void f(type_0)

Would this call succeed? We can put it into our compiler to find out. GCC 8.1 says:

<source>: In function 'int main()':
<source>:4:18: error: no matching function for call to 'func_2(type_0)'
   func_2(type_0{});
                  ^
<source>:1:25: note: candidate: 'template<class T> void func_2(const T*)'
 template <class T> void func_2(T const*);
                         ^~~~~~
<source>:1:25: note:   template argument deduction/substitution failed:
<source>:4:18: note:   mismatched types 'const T*' and 'type_0'
   func_2(type_0{});
                  ^   

So deduction from (1) to (2) fails, because the invented type type_0 cannot be used to deduce const T*.

Step 3: Letâ€™s try from (2) to (1). Again, weâ€™ll transform (2) from template <class T> void f(T const*) to void f(type_0 const*).

Step 4: Now we attempt deduction:

template <class T> void func_1(T);
type_0 const* arg = nullptr;
func_1(arg);

This succeeds because a type_0 const* can be used to deduce T. Since deduction from (1) to (2) fails, but deduction from (2) to (1) succeeds, (2) is more specialized than (1) and will be chosen by overload resolution.

Letâ€™s try a different example. How about:

template<class T> void g(T);  //(1)
template<class T> void g(T&); //(2)
int i = 0;
g(i);

(1) transforms to void g(type_0). Before we try deduction, we need to apply one of the numerous additional rules from the standard, which says we need to replace references with the type being referred to. So template <class T> void g(T&) becomes template <class T> void g(T). Deduction time:

template<class T> void func_2(T);
func_2(type_0{});

This succeeds.

Now the other direction. template<class T> void g(T&) transforms to void g(type_0&), then we remove the reference to get void g(type_0). Our second deduction:

template<class T> void func_1(T);
func_1(type_0{});

This is effectively identical to the previous one, so of course it succeeds.

Since deduction succeeded in both directions, the call is ambiguous. Sure enough, GCC diagnoses:

<source>: In function 'int main()':
<source>:5:8: error: call of overloaded 'g(int&)' is ambiguous
     g(i);
        ^
<source>:1:24: note: candidate: 'void g(T) [with T = int]'
 template<class T> void g(T);  //(1)
                        ^
<source>:2:24: note: candidate: 'void g(T&) [with T = int]'
 template<class T> void g(T&); //(2)
                        ^ 

This is why the algorithm is a partial ordering: sometimes two function templates are not ordered.

Iâ€™ll give one more example. This one has multiple parameters and is a bit more subtle.

template<class T>struct identity { using type = T; };
template<class T>struct A{};

template<class T, class U> void h(typename identity<T>::type, U); //(1)
template<class T, class U> void h(T, A<U>);                       //(2)
h<int>(0,A<void>{});

identity here just evaluates to its template argument, but the important thing to note is that typename identity<T>::type is a non-deduced context, so T cannot be deduced from the argument for that parameter.

(1) transforms to void h(typename identity<type_0>::type, type_0), which is void h(type_0, type_0). Attempt deduction on (2):

template<class T, class U> void func_2(T, A<U>);
func_2(type_0{}, type_0{});

This fails because we canâ€™t match type_0 against A<U>.

(2) transforms to void h(type_0, A<type_1>). Try deduction against (1):

template<class T, class U> void func_1(typename identity<T>::type, U);
func_1(type_0{}, A<type_1>{});

This fails because typename identity<T>::type is a non-deduced context, so we canâ€™t deduce T.

In the example from the last section deduction succeeded both ways so the call was ambiguous. In this example, deduction fails both ways, which is also an ambiguous call.

Thatâ€™s the last of the examples. Of course, there are a bunch of rules which I didnâ€™t cover here, like Concepts, parameter packs, non-type/template template parameters, and cases where both the argument and parameter types are references. Hopefully you now have enough of an intuition that you can understand what the standard says when you inevitably hit those corner cases. If you have any partial ordering conundrums, drop them down in the comments below or send them to me on Twitter.

Iâ€™ll ignore non-type template parameters and template template parameters for simplicity, but the rules are essentially the same.Â ↩

May 15, 2018

std::accumulate vs. std::reduce

Simon Brand from Simon Brand

std::accumulate has been a part of the standard library since C++98. It provides a way to fold a binary operation (such as addition) over an iterator range, resulting in a single value. std::reduce was added in C++17 and looks remarkably similar. This post will explain the difference between the two and when to use one or the other.

Letâ€™s start by looking at their interfaces, beginning with std::accumulate.

template< class InputIt, class T >
T accumulate( InputIt first, InputIt last, T init );

template< class InputIt, class T, class BinaryOperation >
T accumulate( InputIt first, InputIt last, T init,
              BinaryOperation op );

std::accumulate takes an iterator range and an initial value for the accumulation. You can optionally give it a binary operation to do the reduction, which will default to addition. It will call this operation on the initial value and the first element of the range, then on the result and the second element of the range, etc. Here are two equivalent calls:

auto sum1 = std::accumulate(begin(vec), end(vec), 0);
auto sum2 = std::accumulate(begin(vec), end(vec), 0, std::plus<>{});

std::reduce has a fair few more overloads to get your head round, but has a very similar interface once you understand them:

template<class InputIt>
typename std::iterator_traits<InputIt>::value_type reduce(
    InputIt first, InputIt last);

template<class ExecutionPolicy, class ForwardIt>
typename std::iterator_traits<ForwardIt>::value_type reduce(
    ExecutionPolicy&& policy,
    ForwardIt first, ForwardIt last);

template<class InputIt, class T>
T reduce(InputIt first, InputIt last, T init);

template<class ExecutionPolicy, class ForwardIt, class T>
T reduce(ExecutionPolicy&& policy,
         ForwardIt first, ForwardIt last, T init);

template<class InputIt, class T, class BinaryOp>
T reduce(InputIt first, InputIt last, T init, BinaryOp binary_op);

template<class ExecutionPolicy, class ForwardIt, class T, class BinaryOp>
T reduce(ExecutionPolicy&& policy,
         ForwardIt first, ForwardIt last, T init, BinaryOp binary_op);

The differences here are:

You can optionally provide an execution policy.
reduce is overloaded for input iterators and forward iterators.
Supplying the initial element is optional (default construction of the value type is used by default).

Iâ€™ll talk about these points in turn.

Execution policies

Execution policies are a C++17 feature which allows programmers to ask for algorithms to be parallelised. There are three execution policies in C++17:

std::execution::seq â€“ do not parallelise
std::execution::par â€“ parallelise
std::execution::par_unseq â€“ parallelise and vectorise (requires that the operation can be interleaved, so no acquiring mutexes and such)

The idea behind execution policies is that you can change a serial algorithm to a parallel algorithm simply by passing an additional argument to the function:

auto sum1 = std::reduce(begin(vec), end(vec));                      //sequential
auto sum2 = std::reduce(std::execution::seq, begin(vec), end(vec)); //sequential
auto sum3 = std::reduce(std::execution::par, begin(vec), end(vec)); //parallel

Allowing parallelisation is the main reason for the addition of std::reduce. Letâ€™s look at an example where we want to sum up all the elements in an array. With std::accumulate it looks like this:

accumulate plus

Note that each step of the computation relies on the previous computation, i.e. this algorithm will execute serially and we make no use of hardware parallel processing capabilities. If we use std::reduce with the std::execution::par policy then it could look like this:

reduce plus

This is a trivial amount of data for processing in parallel, but the benefit gained when the data size is scaled up should be clear: some of the operations can be executed independently of others, so they can be done in parallel.

A common question is: why do we need an entirely new algorithm for this? Why canâ€™t we just overload std::accumulate? For an example of why we canâ€™t do this, letâ€™s use std::minus<> instead of std::plus<> as our reduction operation. With std::accumulate we get this:

accumulate minus

However, if we try to use std::reduce, we could get something like:

reduce minus

Uh oh. We just broke our code.

We got the wrong answer because of the mathematical properties of subtraction. You canâ€™t arbitrarily reorder the operands, or compute the operations out of order when doing subtraction. This is formalised in the properties of commutativity and associativity.

A binary operation âˆ— on a set S is associative if the following equation holds for all x, y, and z in S:

(x âˆ— y) âˆ— z = x âˆ— (y âˆ— z)

An operation is commutative if:

x âˆ— y = y âˆ— x

Unfortunately thereâ€™s no way for the compiler to reliably check that some operation is associative and commutative, so weâ€™re stuck with having std::reduce as a separate algorithm. In concepts speak, this is called an axiom: a requirement imposed on semantics which cannot generally be statically verified.

Input vs. Forward Iterators

I couldnâ€™t find any reasoning in the original standards papers for the separate overloads for input and forward iterators. However, my educated guess is that this is to allow library implementors to chunk up the data in order to distribute it to a thread pool. Indeed, this appears to be what MSVCâ€™s implementation does.

The idea is that input iterators are single-pass, whereas forward iterators can be iterated through multiple times. An algorithm couldnâ€™t chunk up data indexed by input iterators because by the time it had gone through the range to work out the sub-range boundaries, the data from the iterator could have already been consumed.

Optional Initial Element

std::reduce lets you not bother passing an initial element, in which case it will default-construct one using typename std::iterator_traits<InputIt>::value_type{}. I think this was a mistake. A default-constructed value is not always the identity element (such as in multiplication), and itâ€™s very easy to for a programmer to miss out the initial element. The code will still compile, it will just give the wrong answer. I suspect that this choice will result in some hard-to-find bugs when this interface comes into heavier use.

Finishing Up

That covers the differences between std::reduce and std::accumulate. My three point guide to std::reduce is:

Use std::reduce when you want your accumulation to run in parallel
Ensure that the operation you want to use is both associative and commutative
Remember that the default initial value is produced by default construction, and that this may not be correct for your operation

April 24, 2018

Custom Alias Analysis in LLVM

Simon Brand from Simon Brand

At Codeplay I currently work on a compiler backend for an embedded accelerator. The backend is the part of the compiler which takes some representation of the source code and translates it to machine code. In my case Iâ€™m working with LLVM, so this representation is LLVM IR.

Itâ€™s very common for accelerators like GPUs to have multiple regions of addressable memory â€“ each with distinct properties. One important optimisation Iâ€™ve implemented recently is extending LLVMâ€™s alias analysis functionality to handle the different address spaces for our target architecture.

This article will give an overview of the aim of alias analysis â€“ along with an example based around address spaces â€“ and show how custom alias analyses can be expressed in LLVM. You should be able to understand this article even if you donâ€™t work with compilers or LLVM; hopefully it gives some insight into what compiler developers do to make the tools you use generate better code. LLVM developers may find the implementation section helpful, but may want to read the documentation or examples linked at the bottom for more details.

What is alias analysis

Alias analysis is the process of determining whether two pointers can point to the same object (alias) or not. This is very important for some valuable optimisations.

Take this C function as an example:

int foo (int __attribute__((address_space(0)))* a,
         int __attribute__((address_space(1)))* b) {
    *a = 42;
    *b = 20;
    return *a;
}

Those __attribute__s specify that a points to an int in address space 0, b points to an int in address space 1. An important detail of the target architecture for this code is that address spaces 0 and 1 are completely distinct: modifying memory in address space 0 can never affect memory in address space 1. Hereâ€™s some LLVM IR which could be generated from this function:

define i32 @foo(i32 addrspace(0)* %a, i32 addrspace(1)* %b) #0 {
entry:
  store i32 42, i32 addrspace(0)* %a, align 4
  store i32 20, i32 addrspace(1)* %b, align 4
  %0 = load i32, i32* %a, align 4
  ret i32 %0
}

For those unfamiliar with LLVM IR, the first store is storing 42 into *a, the second storing 20 into *b. The %0 = ... line is like loading *a into a temporary variable, which is then returned in the final line.

Optimising `foo`

Now we want foo to be optimised. Can you see an optimisation which could be made?

What we really want is for that load from a (the line beginning %0 = ...) to be removed and for the final statement to instead return 42. We want the optimised code to look like this:

define i32 @foo(i32 addrspace(0)* %a, i32 addrspace(1)* %b) #0 {
entry:
  store i32 42, i32 addrspace(0)* %a, align 4
  store i32 20, i32 addrspace(1)* %b, align 4
  ret i32 42
}

However, we have to be very careful, because this optimisation is only valid if a and b do not alias, i.e. they must not point at the same object. Forgetting about the address spaces for a second, consider this call to foo where we pass pointers which do alias:

int i = 0;
int result = foo(&i, &i);

Inside the unoptimised version of foo, i will be set to 42, then to 20, then 20 will be returned. However, if we carry out desired optimisation then the two stores will occur, but 42 will be returned instead of 20. Weâ€™ve just broken the behaviour of our function.

The only way that a compiler can reasonably carry out the above optimisation is if it can prove that the two pointers cannot possibly alias. This reasoning is carried out through alias analysis.

Custom alias analysis in LLVM

As I mentioned above, address spaces 0 and 1 for our target architecture are distinct. However, this may not hold for some systems, so LLVM cannot assume that it holds in general: we need to make it explicit.

One way to achieve this is llvm::AAResultBase. If our target is called TAR then we can create a class called TARAAResult which inherits from AAResultBase<TARAAResult>¹:

class TARAAResult : public AAResultBase<TARAAResult> {
public:
  explicit TARAAResult() : AAResultBase() {}
  TARAAResult(TARAAResult &&Arg) : AAResultBase(std::move(Arg)) {}

  AliasResult alias(const MemoryLocation &LocA, const MemoryLocation &LocB);
};

The magic happens in the alias member function, which takes two MemoryLocations and returns an AliasResult. The result indicates whether the locations can never alias, may alias, partially alias, or precisely alias. We want our analysis to say â€œIf the address spaces for the two memory locations are different, then they can never aliasâ€. The resulting code is surprisingly close to this English description:

AliasResult TARAAResult::alias(const MemoryLocation &LocA,
                               const MemoryLocation &LocB) {
  auto AsA = LocA.Ptr->getType()->getPointerAddressSpace();
  auto AsB = LocB.Ptr->getType()->getPointerAddressSpace();

  if (AsA != AsB) {
    return NoAlias;
  }

  // Forward the query to the next analysis.
  return AAResultBase::alias(LocA, LocB);
}

Alongside this you need a bunch of boilerplate for creating a pass out of this analysis (Iâ€™ll link to a full example at the end), but after thatâ€™s done you just register the pass and ensure that the results of it are tracked:

void TARPassConfig::addIRPasses() {
  addPass(createTARAAWrapperPass());
  auto AnalysisCallback = [](Pass &P, Function &, AAResults &AAR) {
    if (auto *WrapperPass = P.getAnalysisIfAvailable<TARAAWrapper>()) {
      AAR.addAAResult(WrapperPass->getResult());
    }
  }; 
  addPass(createExternalAAWrapperPass(AliasCallback));
  TargetPassConfig::addIRPasses();
}

We also want to ensure that there is an optimisation pass which will remove unnecessary loads and stores which our new analysis will find. One example is Global Value Numbering.

  addPass(createNewGVNPass());

After all this is done, running the optimiser on the LLVM IR from the start of the post will eliminate the unnecessary load, and the generated code will be faster as a result.

You can see an example with all of the necessary boilerplate in LLVMâ€™s test suite. The AMDGPU target has a full implementation which is essentially what I presented above extended for more address spaces.

Hopefully you have got a taste of what alias analysis does and the kinds of work involved in writing compilers, even if I have not gone into a whole lot of detail. I may write some more short posts on other areas of compiler development if readers find this interesting. Let me know on Twitter or in the comments!

This pattern of inheriting from a class and passing the derived class as a template argument is known as the Curiously Recurring Template Pattern.Â ↩

April 1, 2018

No more pointers

Simon Brand from Simon Brand

One of the major changes at the most recent C++ standards meeting in Jacksonville was the decision to deprecate raw pointers in C++20, moving to remove them completely in C++23. This came as a surprise to many, with a lot of discussion as to how weâ€™ll get by without this fundamental utility available any more. In this post Iâ€™ll look at how we can replace some of the main use-cases of raw pointers in C++20.

Three of the main reasons people use raw pointers are:

Dynamic allocation & runtime polymorphism
Nullable references
Avoiding copies

Iâ€™ll deal with these points in turn, but first, an answer to the main question people ask about this change.

The elephant in the room

What about legacy code? Donâ€™t worry, the committee have come up with a way to both move forward the language boldly forward without breaking all the millions of lines of C++ which people have written over the years: opt-in extensions.

If you want to opt-in to C++20â€™s no-pointers feature, you use #feature.

#feature <no_pointers> //opt-in to no pointers
#feature <cpp20>       //opt-in to all C++20 features

This is a really cool new direction for the language. Hopefully with this we can slowly remove features like std::initializer_list so that new code isnâ€™t bogged down with legacy as much as it is today.

Dynamic allocation & runtime polymorphism

Iâ€™m sure most of you already know the answer to this one: smart pointers. If you need to dynamically allocate some resource, that resourceâ€™s lifetime should be managed by a smart pointer, such as std::unique_ptr or std::shared_ptr. These types are now special compiler-built-in types rather than normal standard library types. In fact, std::is_fundamental<std::unique_ptr<int>>::value now evaluates to true!

Nullable references

Since references cannot be rebound and cannot be null, pointers are often used to fulfil this purpose. However, with C++20, we have a new type to fulfil this purpose: std::optional<T&>. std::optional was first introduced in C++17, but was plagued with no support for references and no monadic interface. C++20 has fixed both of these, so now we have a much more usable std::optional type which can fill the gap that raw pointers have left behind.

Avoiding copies

Some people like to use raw pointers to avoid copies at interface boundaries, such as returning some resource from a function. Fortunately, we have much better options, such as (Named) Return Value Optimization. C++17 made some forms of copy elision mandatory, which gives us even more guarantees for the performance of our code.

Wrapping up

Of course there are more use-cases for raw pointers, but this covers three of the most common ones. Personally, I think this is a great direction to see the language going in, and I look forward to seeing other ways we can slowly make C++ into a simpler, better language.

March 15, 2018

emBO++ 2018 Trip Report

Simon Brand from Simon Brand

emBO++ is a conference focused on C++ on embedded systems in Bochum, Germany. This was itâ€™s second year of operation, but the first that Iâ€™ve been along to. It was a great conference, so Iâ€™m writing a short report to hopefully convince more of you to attend next year!

Format

The conference took place over four days: an evening of lightning talks and burgers, a workshop day, a day of talks (what you might call the conference proper), and finally an unofficial standards meeting for those interested in SG14. This made for a lot of variety, and each day was valuable.

Venue

One thing I really enjoyed about emBO++ was that the different tech and social events were dotted around the city. This meant that I actually got to see some of Bochum, get lost navigating the train system, walk around town at night, etc., which made a nice change from being cooped up in a hotel for a few days.

The main conference venue was at the Zentrum fÃ¼r IT-Sicherheit (Centre for IT Security). It was a spacious building with a lot of light and large social areas, so it suited the conference environment well. The only problem was that it used to be a military building and was lined with copper, making the thing into one huge Faraday cage. This meant that WiFi was an issue for the first few hours of the day, but it got sorted eventually.

zits

Food and Drink

The catering at the main conference location was really excellent: a variety of tasty food with healthy options and large quantities. Even better were the selection of drinks available, which mostly consisted of interesting soft drinks which Iâ€™d never seen before, such as bottled Matcha with lime and a number of varieties of Mate. All the locations we went to for food and drinks were great â€“ especially the speakers dinner. A lot of thought was obviously put into this area of the conference, and it showed.

Workshops

There were four workshops on the first day of the conference with two running in parallel. The two which I attended were very interesting and instructive, but I wish that they had been more hands-on.

JÃ¶rn Seger â€“ Getting Started with Yocto

I was in two minds about attending this workshop. We need to use Yocto a little bit in my current project, so I could attend the workshop in order to gain more knowledge about it. On the other hand, Iâ€™d then be the most experienced member of my team in Yocto and would be forced to fix all the issues!

In the end I decided to go along, and it was definitely worthwhile. Previously Iâ€™d mostly muddled along without an understanding of the fundamentals of the system; this workshop provided those.

Kris Jusiak â€“ Embedding a Compile-Time-State-Machine

Kris gave a workshop on Boost.SML, which is an embedded domain specific language (EDSL) for encoding expressive, high-performance state machines in C++. The library is very impressive, and it was valuable to see all the different use-cases it supports and how it supports switching out the frontend and backend of the system. I was particularly interested in this session as my talk the next day was on EDSLs, so it was an opportunity to steal some things to mention in my talk.

You can find Boost.SML here.

Talks

There were two tracks for most of the day, with the first and final ones being plenary sessions. There was a strong variety of talks, and I felt that my time was well-spent at all of them.

Simon Brand â€“ Embedded DSLs for Embedded Programming

My talk! I think it went down well. I got some good engagement and questions from the audience, although not much feedback from the attendees later on in the day. I guess Iâ€™ll need to wait for it to get torn apart on YouTube.

Klemens Morgenstern â€“ Developing high-performance Coroutines for ARMs

Klemens gave an excellent talk about an ARM coroutine library which he implemented. This talk has nothing to do with the C++ Coroutines TS, instead focusing on how coroutines can be implemented in a very low-overhead manner. In Klemensâ€™ library, the user provides some memory to be used as the stack for the coroutine, then there are small pieces of ARM assembly which perform the context switch when you suspend or resume that coroutine. The talk went into the performance considerations, implementation, and use in just the right amount of detail, so I would definitely recommend watching if you want an overview of the ideas.

The library and presentation can be found here.

Emil Fresk â€“ The compile-time, reactive scheduler: CRECT

CRECT is a task scheduler which carries out its work at compile time, therefore almost entirely disappearing from the generated assembly. Emilâ€™s lofty goal for the talk was to present all of the necessary concepts such that those viewing the talk would feel like they could go off and implement something similar afterwards. I think he mostly succeeded in this â€“ although a fair amount of metaprogramming skills would be required! He showed how to use the library to specify the jobs which need to be executed, the resources which they require, and when they should be scheduled to run. After we understood the fundamentals of the library, we learned how this actually gets executed at compile-time in order to produce the final scheduled output. Highly recommended for those who work with embedded systems and want a better way of scheduling their work.

You can find CRECT here.

Ben Craig â€“ Standardizing an OS-less subset of C++

If you watch one talk from the conference it should be this one. C++ has had a â€œfreestandingâ€ variant for a long time, and itâ€™s been neglected for the same amount of time. Ben talked about all the things which should not be available in freestanding mode but are, and those which should be but are not. He presented his vision for what should be standards-mandated facilities available in freestanding C++ implementations, and a tentative path to making this a reality. Particularly of interest were the odd edge cases which I hadnâ€™t considered. For example, it turns out that std::array has to #include <string> somewhere down the line, because my_array.at(n) can throw an exception (std::out_of_range), and that exception has a constructor which takes std::string as an argument. These tiny issues will make getting a solid standard for freestanding difficult to pin down and agree on, but I think itâ€™s a worthy cause.

Benâ€™s ISO C++ paper on a freestanding standard library can be found here.

Jacek Galowicz â€” Scalable test infrastructure for advanced bare-metal software development

In Jacekâ€™s team, they have many different hardware versions to support. This creates a problem of creating regressions in some versions and not others when changes are made. This talk showed how they developed the testing infrastructure to enable them to test all hardware versions they needed on each merge request to ensure that bad commits arenâ€™t merged in to the master branch. They wrote a simple testing framework in Haskell which was fine-tuned to their use case rather than using an existing solution like Jenkins (thatâ€™s what we use at Codeplay for solving the same problem). Jacek spoke about issues they faced and the creative solutions they put in place, such as putting a light detector over the CAPS LOCK button of a keyboard and making it blink in Morse code in order to communicate out from machines with no usable ports.

Odin Holmes â€“ Bare-Metal-Roadmap

Odinâ€™s talk summed up some current major problems that are facing the embedded community, and roped together all of the talks which had come before. It was cool to see the overlap in all the talks in areas of abstraction, EDSLs, making choices at compile time, etc.

Closing

I had a great time at emBO++ and would whole-heartedly recommend attending next year. The talks should be online in the next few months, so I look forward to watching those which I didnâ€™t attend. The conference is mostly directed at embedded C++ developers, but I think it would be valuable to anyone doing low-latency programming on non-embedded systems, or those writing C/Rust/whatever for embedded platforms.

Thank you to Marie, Odin, Paul, Stephan, and Tabea for inviting me to talk and organising these great few days!

January 26, 2018

Do compilers take inline as a hint?

Simon Brand from Simon Brand

If youâ€™ve spent any time in C or C++ communities online, youâ€™ve probably seen someone say this:

inline used to be a hint for compilers to inline the definition, but no compilers actually take that into account any more.

You shouldnâ€™t believe everything you see on the internet.

Of course, inline has mandatory effects on linkage and whatnot, but this post is only interested in whether or not compilers might change the decision to inline a function or not based on whether you write inline in the declaration.

However, the point of this article isnâ€™t to give you good rules for when to use inline, because as much as I like to debate technical minutiae, your compiler will likely be a better judge than you anyway. Instead I want to show that if you want to know how compilers or standard libraries implement things, you can just go look! It might be daunting the first time, but getting a knack for finding things in large code bases and understanding your tools can pay off. Iâ€™ll be diving into the codebases for Clang and GCC to see how they handle inlining hints and hopefully convince you that you can do the same for questions which you have about your tools.

Clang

Letâ€™s do this bottom-up. Clang is a compiler frontend for LLVM, which means that it takes C-family languages, translates them to LLVM Intermediate Representation, and LLVM handles generating assembly/binaries from that. So we can dig around in LLVM to see if we can find code which deals with inlining. I armed myself with a text editor and grep and found the following code in llvm/lib/Analysis/InlineCost.cpp:

  // Adjust the threshold based on inlinehint attribute and profile based
  // hotness information if the caller does not have MinSize attribute.
  if (!Caller->optForMinSize()) {
    if (Callee.hasFnAttribute(Attribute::InlineHint))
      Threshold = MaxIfValid(Threshold, Params.HintThreshold);
    if (PSI) {
      uint64_t TotalWeight;
      if (CS.getInstruction()->extractProfTotalWeight(TotalWeight) &&
          PSI->isHotCount(TotalWeight)) {
        Threshold = MaxIfValid(Threshold, Params.HotCallSiteThreshold);
      } else if (PSI->isFunctionEntryHot(&Callee)) {
        // If callsite hotness can not be determined, we may still know
        // that the callee is hot and treat it as a weaker hint for threshold
        // increase.
        Threshold = MaxIfValid(Threshold, Params.HintThreshold);
      } else if (PSI->isFunctionEntryCold(&Callee)) {
        Threshold = MinIfValid(Threshold, Params.ColdThreshold);
      }
    }
  }

By just looking at this code without knowledge of all the other functions, we wonâ€™t be able to totally understand what itâ€™s doing. But thereâ€™s certainly some information we can gleam here:

    if (Callee.hasFnAttribute(Attribute::InlineHint))
      Threshold = MaxIfValid(Threshold, Params.HintThreshold);

So now we know that LLVM takes the presence of an inlinehint attribute into account in its inlining cost model. But does Clang produce that attribute? Letâ€™s look for that attribute in the Clang sources:

    // Otherwise, propagate the inline hint attribute and potentially use its
    // absence to mark things as noinline.
    if (auto *FD = dyn_cast<FunctionDecl>(D)) {
      if (any_of(FD->redecls(), [&](const FunctionDecl *Redecl) {
            return Redecl->isInlineSpecified();
          })) {
        B.addAttribute(llvm::Attribute::InlineHint);
      } else if (CodeGenOpts.getInlining() ==
                     CodeGenOptions::OnlyHintInlining &&
                 !FD->isInlined() &&
                 !F->hasFnAttribute(llvm::Attribute::AlwaysInline)) {
        B.addAttribute(llvm::Attribute::NoInline);
      }
    }

The above code from clang/lib/CodeGen/CodeGenModule.cpp shows Clang setting the inlinehint attribute. It will also possibly mark declarations as noinline if inline was not supplied (depending on compiler flags). Now we can look for code which affects isInlineSpecified:

// clang/include/clang/Sema/DeclSpec.h
bool isInlineSpecified() const {
  return FS_inline_specified | FS_forceinline_specified;
}

// clang/lib/Sema/DeclSpec.cpp
bool DeclSpec::setFunctionSpecInline(SourceLocation Loc, const char *&PrevSpec,
                                     unsigned &DiagID) {
  // 'inline inline' is ok.  However, since this is likely not what the user
  // intended, we will always warn, similar to duplicates of type qualifiers.
  if (FS_inline_specified) {
    DiagID = diag::warn_duplicate_declspec;
    PrevSpec = "inline";
    return true;
  }
  FS_inline_specified = true;
  FS_inlineLoc = Loc;
  return false;
}

// Parse/ParseDecl.cpp:ParseDeclarationSpecifiers
  case tok::kw_inline:
    isInvalid = DS.setFunctionSpecInline(Loc, PrevSpec, DiagID);
    break;

The above snippets show the functions which set and retrieve whether something was specified as inline, and the code in the parser which calls the setter.

So now we know that Clang propagates the presence of the inline keyword through to LLVM using the inlinehint attribute, and that LLVM takes this into account in its cost model for inlining. Our detective work has been successful!

GCC

How about GCC? The source for GCC is a fair bit less accessible than that of LLVM, but we can still make an attempt. After a bit of searching, I found that the DECL_DECLARED_INLINE_P macro seemed to be used lots of places which were relevant. So we can look for where thatâ€™s set:

        // c/c-decl.c:grokdeclarator

	/* Record presence of `inline' and `_Noreturn', if it is
	   reasonable.  */
	if (flag_hosted && MAIN_NAME_P (declarator->u.id))
	  {
	    if (declspecs->inline_p)
	      pedwarn (loc, 0, "cannot inline function %<main%>");
	    if (declspecs->noreturn_p)
	      pedwarn (loc, 0, "%<main%> declared %<_Noreturn%>");
	  }
	else
	  {
	    if (declspecs->inline_p)
	      /* Record that the function is declared `inline'.  */
	      DECL_DECLARED_INLINE_P (decl) = 1;
	    if (declspecs->noreturn_p)
	      {
		if (flag_isoc99)
		  pedwarn_c99 (loc, OPT_Wpedantic,
			       "ISO C99 does not support %<_Noreturn%>");
		else
		  pedwarn_c99 (loc, OPT_Wpedantic,
			       "ISO C90 does not support %<_Noreturn%>");
		TREE_THIS_VOLATILE (decl) = 1;
	      }
	  }

We can then look for uses of this macro which seem to affect the behaviour of the inliner. Here are a few:

  //ipa-inline.c:want_inline_small_function_p    

  /* Do fast and conservative check if the function can be good
     inline candidate.  At the moment we allow inline hints to
     promote non-inline functions to inline and we increase
     MAX_INLINE_INSNS_SINGLE 16-fold for inline functions.  */
  else if ((!DECL_DECLARED_INLINE_P (callee->decl)
	   && (!e->count.ipa ().initialized_p () || !e->maybe_hot_p ()))
	   && ipa_fn_summaries->get (callee)->min_size
		- ipa_call_summaries->get (e)->call_stmt_size
	      > MAX (MAX_INLINE_INSNS_SINGLE, MAX_INLINE_INSNS_AUTO))
    {
      e->inline_failed = CIF_MAX_INLINE_INSNS_AUTO_LIMIT;
      want_inline = false;
    }


  //ipa-inline.c:edge_badness
  
  /* ... and do not overwrite user specified hints.   */
  && (!DECL_DECLARED_INLINE_P (edge->callee->decl)
      || DECL_DECLARED_INLINE_P (caller->decl)))))


  //ipa-inline.c:early_inline_small_functions

  /* Do not consider functions not declared inline.  */
  if (!DECL_DECLARED_INLINE_P (callee->decl)
	  && !opt_for_fn (node->decl, flag_inline_small_functions)
	  && !opt_for_fn (node->decl, flag_inline_functions))
	continue;

Thatâ€™s some evidence that the keyword is being taken into account. If we want to get a better idea of all of the effects, then we now have a starting point from which to conduct our expectations.

Conclusion

I hope Iâ€™ve managed to convince you of two things:

Some modern compilers still take inline hints into account.
If you want to understand how your compiler works and itâ€™s open source, you can just go and look.

If youâ€™re actually going to try and optimize your code using inline then Do The Right Thing And Measure It¹. See what your compiler actually generates. Profile your code to make sure youâ€™re opmitizing something that needs optimizing. Donâ€™t guess.

DTRTAMI (dee-tee-arr-tamee) kinda has a ring to it.Â ↩

January 18, 2018

Passing overload sets to functions

Simon Brand from Simon Brand

Passing functions to functions is becoming increasingly prevalent in C++. With common advice being to prefer algorithms to loops, new library features like std::visit, lambdas being incrementally beefed up¹² and C++ function programming talks consistently being given at conferences, itâ€™s something that almost all C++ programmers will need to do at some point. Unfortunately, passing overload sets or function templates to functions is not very well supported by the language. In this post Iâ€™ll discuss a few solutions and show how C++ still has a way to go in supporting this well.

An example

We have some generic operation called foo. We want a way of specifying this function which fulfils two key usability requirements.

1- It should be callable directly without requiring manually specifying template arguments:

auto a = foo(42);           //good
auto b = foo("hello");      //good
auto c = foo<double>(42.0); //bad
auto d = foo{}(42.0);       //bad

2- Passing it to a higher-order function should not require manually specifying template arguments:

std::transform(first, last, target, foo);      //good
std::transform(first, last, target, foo<int>); //bad
std::transform(first, last, target, foo{});    //okay I guess

A simple first choice would be to make it a function template:

template <class T>
T foo(T t) { /*...*/ }

This fulfils the first requirement, but not the second:

//compiles, but not what we want
std::transform(first, last, target, foo<int>); 

//uh oh
std::transform(first, last, target, foo);

7 : <source>:7:5: error: no matching function for call to 'transform'
    std::transform(first, last, target, foo);
    ^~~~~~~~~~~~~~
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/../../../../include/c++/7.2.0/bits/stl_algo.h:4295:5: note: candidate template ignored: couldn't infer template argument '_UnaryOperation'
    transform(_InputIterator __first, _InputIterator __last,
    ^
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/../../../../include/c++/7.2.0/bits/stl_algo.h:4332:5: note: candidate function template not viable: requires 5 arguments, but 4 were provided
    transform(_InputIterator1 __first1, _InputIterator1 __last1,
    ^
1 error generated.

Thatâ€™s no good.

A second option is to write foo as a function object with a call operator template:

struct foo {
    template<class T>
    T operator()(T t) { /*...*/ }
};

We are now required to create an instance of this type whenever we want to use the function, which is okay for passing to other functions, but not great if we want to call it directly:

//this looks okay
std::transform(first, last, target, foo{});

//this looks strange
auto x = foo{}(42.0);
auto x = foo()(42.0);

We have similar problems when we have multiple overloads, even when weâ€™re not using templates:

int foo (int);
float foo (float);

std::transform(first, last, target, foo); //doesn't compile
// ew ew ew ew ew ew ew
std::transform(first, last, target, static_cast<int(*)(int)>(foo));

Weâ€™re going to need a different solution.

Lambdas and `LIFT`

As an intermediate step, we could use the normal function template approach, but wrap it in a lambda whenever we want to pass it to another function:

std::transform(first, last, target,
               [](const auto&... xs) { return foo(xs...); });

Thatâ€™s not great. Itâ€™ll work in some contexts where we donâ€™t know what template arguments to supply, but itâ€™s not yet suitable for all cases. One improvement would be to add perfect forwarding:

[](auto&&... xs) { return foo(std::forward<decltype(xs)>(xs)...); }

But wait, we want to be SFINAE friendly, so weâ€™ll add a trailing return type:

[](auto&&... xs) -> decltype(foo(std::forward<decltype(xs)>(xs)...)) {
    return foo(std::forward<decltype(xs)>(xs)...);
}

Okay, itâ€™s getting pretty crazy and expert-only at this point. And weâ€™re not even done! Some contexts will care about noexcept:

[](auto&&... xs)
    noexcept(noexcept(foo(std::forward<decltype(xs)>(xs)...)))
          -> decltype(foo(std::forward<decltype(xs)>(xs)...)) {
    return foo(std::forward<decltype(xs)>(xs)...);
}

So the solution is to write this every time we want to pass an overloaded function to another function. Thatâ€™s probably a good way to make your code reviewer cry.

What would be nice is if P0573: Abbreviated Lambdas for Fun and Profit and P0644: Forward without forward were accepted into the language. Thatâ€™d let us write this:

[](xs...) => foo(>>xs...)

The above is functionally equivalent to the triplicated monstrosity in the example before. Even better, if P0834: Lifting overload sets into objects was accepted, we could write:

[]foo

That lifts the overload set into a single function object which we can pass around. Unfortunately, all of those proposals have been rejected. Maybe they can be renewed at some point, but for now we need to make do with other solutions. One such solution is to approximating []foo with a macro (I know, I know).

#define FWD(...) std::forward<decltype(__VA_ARGS__)>(__VA_ARGS__)

#define LIFT(X) [](auto &&... args) \
    noexcept(noexcept(X(FWD(args)...)))  \
    -> decltype(X(FWD(args)...)) \
{  \
    return X(FWD(args)...); \
}

Now our higher-order function call becomes:

std::transform(first, last, target, LIFT(foo));

Okay, so thereâ€™s a macro in there, but itâ€™s not too bad (you know weâ€™re in trouble when I start trying to justify the use of macros for this kind of thing). So LIFT is at least some solution.

Making function objects work for us

You might recall from a number of examples ago that the problem with using function object types was the need to construct an instance whenever we needed to call the function. What if we make a global instance of the function object?

struct foo_impl {
    //template
    template<class T>
    T operator()(T t) { /*...*/ }

   //overloads
   int operator()(int) { /*...*/ }
   float operator()(float) { /*...*/ }   
};

extern const foo_impl foo;

// in some .cpp file
foo_impl foo;

This works if youâ€™re able to have a single translation unit with the definition of the global object. If youâ€™re writing a header-only library then you donâ€™t have that luxury, so you need to do something different.

struct foo_impl {
    template<class T>
    T operator()(T t) { /*...*/ }
};

static constexpr foo_impl foo;

This might look innocent, but it can lead to One-Definition Rule (ODR) violations³:

//test.h header
struct foo_impl {
    template<class T>
    T operator()(T t) const { return t; }
};

static constexpr foo_impl foo;

template <class T>
int oh_no(T t) {
    auto* foop = &foo;
    return (*foop)(t);
}

//cpp1
#include "test.h"
int sad() {
    return oh_no(42);
}

//cpp2
#include "test.h"
int also_sad() {
    return oh_no(24);
}

Since foo is declared static, each Translation Unit (TU) will get its own definition of the variable. However, sad and also_sad will instantiate oh_no which will get different definitions of foo for &foo. This is undefined behaviour by [basic.def.odr]/12.2.

In C++17 the solution is simple:

inline constexpr foo_impl foo{};

The inline allows the variable to be multiply-defined, and the linker will throw away all but one of the definitions.

If you canâ€™t use C++17, there are a few solutions given in N4424: Inline Variables. The Ranges V3 library uses a reference to a static member of a template class:

template<class T>
struct static_const {
    static constexpr T value{};
};

template <class T>
constexpr T static_const<T>::value;

constexpr auto& foo = static_const<foo_impl>::value;

An advantage of the function object approach is that function objects designed carefully make for much better customisation points than the traditional techniques used in the standard library. See Eric Nieblerâ€™s blog post and standards paper for more information.

A disadvantage is that now we need to write all of the functions we want to use this way as function objects, which is not great at the best of times, and even worse if we want to use external libraries. One possible solution would be to combine the two techniques weâ€™ve already seen:

// This could be in an external library
namespace lib {
    template <class T>
    T foo(T t) { /*...*/ }
}

namespace lift {
    inline constexpr auto foo = LIFT(lib::foo);
}

Now we can use lift::foo instead of lib::foo and itâ€™ll fit the requirements I laid out at the start of the post. Unfortunately, I think itâ€™s possible to hit ODR-violations with this due to possible difference in closure types cross-TU. Iâ€™m not sure what the best workaround for this is, so input is appreciated.

Conclusion

Iâ€™ve given you a few solutions to the problem I showed at the start, so whatâ€™s my conclusion? C++ still has a way to go to support this paradigm of programming, and teaching these ideas is a nightmare. If a beginner or even intermediate programmer asks how to pass overloaded functions around â€“ something which sounds like it should be fairly easy â€“ itâ€™s a real shame that the best answers I can come up with are â€œCopy this macro which you have no chance of understandingâ€, or â€œMake function objects, but make sure you do it this way for reasons which I canâ€™t explain unless you understand the subtleties of ODR⁴â€. I feel like the language could be doing more to support these use cases.

Maybe for some people â€œDo it this way and donâ€™t ask whyâ€ is an okay answer, but thatâ€™s not very satisfactory to me. Maybe I lack imagination and thereâ€™s a better way to do this with whatâ€™s already available in the language. Send me your suggestions or heckles on Twitter @TartanLlama.

Thanks to Michael Maier for the motivation to write this post; Jayesh Badwaik, Ben Craig, MichaÅ‚ Dominiak and KÃ©vin Boissonneault for discussion on ODR violations; and Eric Niebler, Barry Revzin, Louis Dionne, and MichaÅ‚ Dominiak (again) for their work on the libraries and standards papers I referenced.

P0315: Lambdas in unevaluated contextsÂ ↩
P0624: Default constructible and assignable stateless lambdasÂ ↩
Example lovingly stolen from n4381.Â ↩
Disclaimer: I donâ€™t understand all the subtleties of ODR.Â ↩

Value Categories

C++11

C++17

Who cares?

Closing

Relation strength

Example 1: Ordering rectangles

Example 2: Ordering squares

Describing relations

Writing your own three-way comparison

Automatically-rewritten two-way comparisons

Conclusion

Execution policies

Input vs. Forward Iterators

Optional Initial Element

Finishing Up

What is alias analysis

Optimising foo

Custom alias analysis in LLVM

The elephant in the room

Dynamic allocation & runtime polymorphism

Nullable references

Avoiding copies

Wrapping up

Format

Venue

Food and Drink

Workshops

JÃ¶rn Seger â€“ Getting Started with Yocto

Kris Jusiak â€“ Embedding a Compile-Time-State-Machine

Talks

Simon Brand â€“ Embedded DSLs for Embedded Programming

Klemens Morgenstern â€“ Developing high-performance Coroutines for ARMs

Emil Fresk â€“ The compile-time, reactive scheduler: CRECT

Ben Craig â€“ Standardizing an OS-less subset of C++

Jacek Galowicz â€” Scalable test infrastructure for advanced bare-metal software development

Odin Holmes â€“ Bare-Metal-Roadmap

Closing

Clang

GCC

Conclusion

An example

Lambdas and LIFT

Making function objects work for us

Conclusion

Optimising `foo`

Lambdas and `LIFT`