Replication: not always worth the effort

Derek Jones from The Shape of Code

Replication is the means by which mistakes get corrected in science. A researcher does an experiment and gets a particular result, but unknown to them one or more unmeasured factors (or just chance) had a significant impact. Another researcher does the same experiment and fails to get the same results, and eventually many experiments later people have figured out what is going on and what the actual answer is.

In practice replication has become a low status activity, journals want to publish papers containing new results, not papers backing up or refuting the results of previously published papers. The dearth of replication has led to questions being raised about large swathes of published results. Most journals only published papers that contain positive results, i.e., something was shown to some level of statistical significance; only publishing positive results produces publication bias (there have been calls for journals that publishes negative results).

Sometimes, repeating an experiment does not seem worth the effort. One such example is: An Explicit Strategy to Scaffold Novice Program Tracing. It looks like the authors ran a proper experiment and did everything they are supposed to do; but, I think the reason that got a positive result was luck.

The experiment involved 24 subjects and these were randomly assigned to one of two groups. Looking at the results (figures 4 and 5), it appears that two of the subjects had much lower ability that the other subjects (the authors did discuss the performance of these two subjects). Both of these subjects were assigned to the control group (there is a 25% chance of this happening, but nobody knew what the situation was until the experiment was run), pulling down the average of the control, making the other (strategy) group appear to show an improvement (i.e., the teaching strategy improved student performance).

Had one, or both, low performers been assigned to the other (strategy) group, no experimental effect would have shown up in the results, significantly reducing the probability that the paper would have been accepted for publication.

Why did the authors submit the paper for publication? Well, academic performance is based on papers published (quality of journal they appear in, number of citations, etc), a positive result is reason enough to submit for publication. The researchers did what they have been incentivized to do.

I hope the authors of the paper continue with their experiments. Life is full of chance effects and the only way to get a solid result is to keep on trying.

ACCU Conference 2018

Products, the Universe and Everything from Products, the Universe and Everything

We had an absolute blast at this year's ACCU Conference, and if you were there we imagine you did too. For us the highlight had to be the launch of #include <C++>, a new global, inclusive, and diverse community for developers interested in C++.
#include C++ logo
This is an initiative that has been brewing for a while, and we're very happy to be a part of. Above all else #include <C++> is designed to be a safe place for developers irrespective of their background, ethnicity, gender identity or sexuality. The group runs a Discord server which is moderated to ensure that it remains a safe space and which you are welcome to join.
On the technical front, one unexpected highlight of the conference was Benjamin Missel's wonderful short talk on writing a C compiler for a BBC Micro, during which he demonstrated SSHing into a BBC Model B through the serial port! Most conference sessions were recorded so even if you weren't there you can still watch them. See you at ACCU 2019!

ResOrg 2.0.7.27 has been released

Products, the Universe and Everything from Products, the Universe and Everything

ResOrg 2.0.7.27 has just been released. This is a maintenance update for ResOrg 2.0, and is compatible with all ResOrg 2.0 licence keys. The following changes are included:
  • Fixed a bug in the Symbols Display which could cause some "OK" symbols to be incorrectly shown in the "Problem Symbols Only" view.
  • Corrected the upper range limit for control symbols from 28671 (0x6FFF) to 57343 (0xDFFF).
  • ResOrg binaries are now dual signed with both SHA1 and SHA256.
  • Added support for Visual Studio 2017.
  • Corrected the File Save Dialog filters used by the ResOrgApp "File | Export" command.
  • The ResOrgApp "File | Export", "File | Save", "File | Save As" and "File | Properties" commands (which apply only to symbol file views) are now disabled when the active view is a report.
  • Fixed a crash in the Symbol File Properties Dialog.
  • Fixed a typo on the Symbol File "Next Values" page.
  • Various minor improvements to the installer.
Download ResOrg 2.0.7.27

The Product Owner is dead, long live the Product Owner!

Allan Kelly from Allan Kelly Associates

3ProductOwners-2018-04-26-17-33.jpg

For years I have been using this picture to describe the Product Owner role. For years I have been saying:

“The title Product Owner is really an alias. Nobody should have Product Owner on their business cards. Product Owner is a Scrum defined role which is usually filled by a Product Manager or Business Analyst, sometimes it is filled by a Domain Expert (also known as a Subject Matter Expert, SME) and sometimes by someone else.”

Easy right?

In telling us about the Product Owner Scrum tells us what one of these people will be doing within the Scrum setting. Scrum doesn’t tell us how the Product Owner knows what they need to know to make those decisions – that comes by virtue of the fact that underneath they are really a Product Manager, BA or expert in the field.

In the early descriptions of Scrum there was a tangible feel that the Product Owner really had the authority to make decisions – they were the OWNER. I still hope that is true but more often than not these days the person playing Product Owner is more likely to be a proxy for one or more real customers.

I go on to say:

“In a software company, like Microsoft or Adobe, Product Managers normally fill the role of Product Owner. The defining feature of the Product Manager role is that their customers are not in the building. The first task facing a new Product Manager is to work out who their customers are – or should be – and then get out to meet them. By definition customers are external.”

“Conversely in a corporate setting, like HSBC, Lufthansa, Proctor and Gamble, a Product Owner is probably a Business Analyst. There job is to analyse some aspect of the business and make it better. By definition their customers are in the building.”

With me so far?

Next I point out that having set up this nice model these roles are increasingly confused because software product companies increasingly sell their software as a service. And corporate customer interact with their customers online, which means customer contact is now through the computer.

Consider the airline industry: twenty years ago the only people who interacted with airline systems from United, BA, Lufthansa, etc. were airline employees. If you wanted to book a flight you went to a travel agent and a nice lady used a green screen to tell you what was available.

Today, whether you book with Lufthansa, SouthWest or Norwegian may well come down to which has the best online booking system.

Business Analyst need to be able to think like Product Managers and Product Managers need to be able to think like Business Analysts.

I regularly see online posts proclaiming “Product Managers are not Product Owners” or “Business Analysts are not Product Owners.” I’ve joined in with this, my alias argument says “they might be but there is so much more to those roles.”

It makes me sad to see the Product Manager role reduced to a Product Owner: the Product Owner role as defined by Scrum is a mere shadow of what a good Product Manager should be.

But the world has moved on, things have changed.

The world has decided that Product Owner is the role, the person who deals with the demand side, the person decides what is needed and what is to be built.

I think its time to change my model. The collision between the world of Business Analysts and Product Managers is now complete. The result is an even bigger mess and a new role has appeared: “Digital Business Analyst” – the illegitimate love child of Business Analysis and Product Management.

The Product Owner is now a superset of Product Manager and Business Analyst.

ProductOwnerSkills-2018-04-26-17-33-1.jpg

Product Owners today may well need the skills of business analysis. They are even more likely to need the skills of Product Management. And they are frequently expected to know about the domain.

Today’s Product Owner may well have a Subject Matter Expert background, in which case they quickly need to learn about Product Ownership, Product Management and Business Analysis.

Or they may have a Business Analysis background and need to absorb Product Management skills. Conversely, Product Owners may come from a Product Management background and may quickly need to learn some Business Analysis. In either case they will learn about the domain but they may want to bring in a Subject Matter Expert too.

To make things harder, exactly which skills they need, and which skills are most important is going to vary from team to team and role to role.

The post The Product Owner is dead, long live the Product Owner! appeared first on Allan Kelly Associates.

Examples of SQL join types (LEFT JOIN, INNER JOIN etc.)

Andy Balaam from Andy Balaam&#039;s Blog

I have 2 tables like this:

> SELECT * FROM table_a;
+------+------+
| id   | name |
+------+------+
|    1 | row1 |
|    2 | row2 |
+------+------+

> SELECT * FROM table_b;
+------+------+------+
| id   | name | aid  |
+------+------+------+
|    3 | row3 |    1 |
|    4 | row4 |    1 |
|    5 | row5 | NULL |
+------+------+------+

INNER JOIN cares about both tables

INNER JOIN cares about both tables, so you only get a row if both tables have one. If there is more than one matching pair, you get multiple rows.

> SELECT * FROM table_a a INNER JOIN table_b b ON a.id=b.aid;
+------+------+------+------+------+
| id   | name | id   | name | aid  |
+------+------+------+------+------+
|    1 | row1 |    3 | row3 | 1    |
|    1 | row1 |    4 | row4 | 1    |
+------+------+------+------+------+

It makes no difference to INNER JOIN if you reverse the order, because it cares about both tables:

> SELECT * FROM table_b b INNER JOIN table_a a ON a.id=b.aid;
+------+------+------+------+------+
| id   | name | aid  | id   | name |
+------+------+------+------+------+
|    3 | row3 | 1    |    1 | row1 |
|    4 | row4 | 1    |    1 | row1 |
+------+------+------+------+------+

You get the same rows, but the columns are in a different order because we mentioned the tables in a different order.

LEFT JOIN only cares about the first table

LEFT JOIN cares about the first table you give it, and doesn’t care much about the second, so you always get the rows from the first table, even if there is no corresponding row in the second:

> SELECT * FROM table_a a LEFT JOIN table_b b ON a.id=b.aid;
+------+------+------+------+------+
| id   | name | id   | name | aid  |
+------+------+------+------+------+
|    1 | row1 |    3 | row3 | 1    |
|    1 | row1 |    4 | row4 | 1    |
|    2 | row2 | NULL | NULL | NULL |
+------+------+------+------+------+

Above you can see all rows of table_a even though some of them do not match with anything in table b, but not all rows of table_b – only ones that match something in table_a.

If we reverse the order of the tables, LEFT JOIN behaves differently:

> SELECT * FROM table_b b LEFT JOIN table_a a ON a.id=b.aid;
+------+------+------+------+------+
| id   | name | aid  | id   | name |
+------+------+------+------+------+
|    3 | row3 | 1    |    1 | row1 |
|    4 | row4 | 1    |    1 | row1 |
|    5 | row5 | NULL | NULL | NULL |
+------+------+------+------+------+

Now we get all rows of table_b, but only matching rows of table_a.

RIGHT JOIN only cares about the second table

a RIGHT JOIN b gets you exactly the same rows as b LEFT JOIN a. The only difference is the default order of the columns.

> SELECT * FROM table_a a RIGHT JOIN table_b b ON a.id=b.aid;
+------+------+------+------+------+
| id   | name | id   | name | aid  |
+------+------+------+------+------+
|    1 | row1 |    3 | row3 | 1    |
|    1 | row1 |    4 | row4 | 1    |
| NULL | NULL |    5 | row5 | NULL |
+------+------+------+------+------+

This is the same rows as table_b LEFT JOIN table_a, which we saw in the LEFT JOIN section.

Similarly:

> SELECT * FROM table_b b RIGHT JOIN table_a a ON a.id=b.aid;
+------+------+------+------+------+
| id   | name | aid  | id   | name |
+------+------+------+------+------+
|    3 | row3 | 1    |    1 | row1 |
|    4 | row4 | 1    |    1 | row1 |
| NULL | NULL | NULL |    2 | row2 |
+------+------+------+------+------+

Is the same rows as table_a LEFT JOIN table_b.

No join at all gives you copies of everything

If you write your tables with no JOIN clause at all, just separated by commas, you get every row of the first table written next to every row of the second table, in every possible combination:

> SELECT * FROM table_b b, table_a;
+------+------+------+------+------+
| id   | name | aid  | id   | name |
+------+------+------+------+------+
|    3 | row3 | 1    |    1 | row1 |
|    3 | row3 | 1    |    2 | row2 |
|    4 | row4 | 1    |    1 | row1 |
|    4 | row4 | 1    |    2 | row2 |
|    5 | row5 | NULL |    1 | row1 |
|    5 | row5 | NULL |    2 | row2 |
+------+------+------+------+------+

Another way to use Emacs to convert DOS/Unix line endings

Timo Geusch from The Lone C++ Coder&#039;s Blog

I’ve previously blogged about using Emacs to convert line endings and use it as an alternative to the dos2unix/unix2dos tools. Using set-buffer-file-coding-system works well and has been my go-to conversion method. That said, there is another way to do the same conversion by using M-x recode-region. As the name implies, recode-region works on a region. […]

The post Another way to use Emacs to convert DOS/Unix line endings appeared first on The Lone C++ Coder's Blog.

Custom Alias Analysis in LLVM

Simon Brand from Simon Brand

At Codeplay I currently work on a compiler backend for an embedded accelerator. The backend is the part of the compiler which takes some representation of the source code and translates it to machine code. In my case I’m working with LLVM, so this representation is LLVM IR.

It’s very common for accelerators like GPUs to have multiple regions of addressable memory – each with distinct properties. One important optimisation I’ve implemented recently is extending LLVM’s alias analysis functionality to handle the different address spaces for our target architecture.

This article will give an overview of the aim of alias analysis – along with an example based around address spaces – and show how custom alias analyses can be expressed in LLVM. You should be able to understand this article even if you don’t work with compilers or LLVM; hopefully it gives some insight into what compiler developers do to make the tools you use generate better code. LLVM developers may find the implementation section helpful, but may want to read the documentation or examples linked at the bottom for more details.

What is alias analysis

Alias analysis is the process of determining whether two pointers can point to the same object (alias) or not. This is very important for some valuable optimisations.

Take this C function as an example:

int foo (int __attribute__((address_space(0)))* a,
         int __attribute__((address_space(1)))* b) {
    *a = 42;
    *b = 20;
    return *a;
}

Those __attribute__s specify that a points to an int in address space 0, b points to an int in address space 1. An important detail of the target architecture for this code is that address spaces 0 and 1 are completely distinct: modifying memory in address space 0 can never affect memory in address space 1. Here’s some LLVM IR which could be generated from this function:

define i32 @foo(i32 addrspace(0)* %a, i32 addrspace(1)* %b) #0 {
entry:
  store i32 42, i32 addrspace(0)* %a, align 4
  store i32 20, i32 addrspace(1)* %b, align 4
  %0 = load i32, i32* %a, align 4
  ret i32 %0
}

For those unfamiliar with LLVM IR, the first store is storing 42 into *a, the second storing 20 into *b. The %0 = ... line is like loading *a into a temporary variable, which is then returned in the final line.

Optimising foo

Now we want foo to be optimised. Can you see an optimisation which could be made?

What we really want is for that load from a (the line beginning %0 = ...) to be removed and for the final statement to instead return 42. We want the optimised code to look like this:

define i32 @foo(i32 addrspace(0)* %a, i32 addrspace(1)* %b) #0 {
entry:
  store i32 42, i32 addrspace(0)* %a, align 4
  store i32 20, i32 addrspace(1)* %b, align 4
  ret i32 42
}

However, we have to be very careful, because this optimisation is only valid if a and b do not alias, i.e. they must not point at the same object. Forgetting about the address spaces for a second, consider this call to foo where we pass pointers which do alias:

int i = 0;
int result = foo(&i, &i);

Inside the unoptimised version of foo, i will be set to 42, then to 20, then 20 will be returned. However, if we carry out desired optimisation then the two stores will occur, but 42 will be returned instead of 20. We’ve just broken the behaviour of our function.

The only way that a compiler can reasonably carry out the above optimisation is if it can prove that the two pointers cannot possibly alias. This reasoning is carried out through alias analysis.

Custom alias analysis in LLVM

As I mentioned above, address spaces 0 and 1 for our target architecture are distinct. However, this may not hold for some systems, so LLVM cannot assume that it holds in general: we need to make it explicit.

One way to achieve this is llvm::AAResultBase. If our target is called TAR then we can create a class called TARAAResult which inherits from AAResultBase<TARAAResult>1:

class TARAAResult : public AAResultBase<TARAAResult> {
public:
  explicit TARAAResult() : AAResultBase() {}
  TARAAResult(TARAAResult &&Arg) : AAResultBase(std::move(Arg)) {}

  AliasResult alias(const MemoryLocation &LocA, const MemoryLocation &LocB);
};

The magic happens in the alias member function, which takes two MemoryLocations and returns an AliasResult. The result indicates whether the locations can never alias, may alias, partially alias, or precisely alias. We want our analysis to say “If the address spaces for the two memory locations are different, then they can never alias”. The resulting code is surprisingly close to this English description:

AliasResult TARAAResult::alias(const MemoryLocation &LocA,
                               const MemoryLocation &LocB) {
  auto AsA = LocA.Ptr->getType()->getPointerAddressSpace();
  auto AsB = LocB.Ptr->getType()->getPointerAddressSpace();

  if (AsA != AsB) {
    return NoAlias;
  }

  // Forward the query to the next analysis.
  return AAResultBase::alias(LocA, LocB);
}

Alongside this you need a bunch of boilerplate for creating a pass out of this analysis (I’ll link to a full example at the end), but after that’s done you just register the pass and ensure that the results of it are tracked:

void TARPassConfig::addIRPasses() {
  addPass(createTARAAWrapperPass());
  auto AnalysisCallback = [](Pass &P, Function &, AAResults &AAR) {
    if (auto *WrapperPass = P.getAnalysisIfAvailable<TARAAWrapper>()) {
      AAR.addAAResult(WrapperPass->getResult());
    }
  }; 
  addPass(createExternalAAWrapperPass(AliasCallback));
  TargetPassConfig::addIRPasses();
}

We also want to ensure that there is an optimisation pass which will remove unnecessary loads and stores which our new analysis will find. One example is Global Value Numbering.

  addPass(createNewGVNPass());

After all this is done, running the optimiser on the LLVM IR from the start of the post will eliminate the unnecessary load, and the generated code will be faster as a result.

You can see an example with all of the necessary boilerplate in LLVM’s test suite. The AMDGPU target has a full implementation which is essentially what I presented above extended for more address spaces.

Hopefully you have got a taste of what alias analysis does and the kinds of work involved in writing compilers, even if I have not gone into a whole lot of detail. I may write some more short posts on other areas of compiler development if readers find this interesting. Let me know on Twitter or in the comments!


  1. This pattern of inheriting from a class and passing the derived class as a template argument is known as the Curiously Recurring Template Pattern

A Python Data Science hackathon

Derek Jones from The Shape of Code

I was at the Man AHL Hackathon this weekend. The theme was improving the Python Data Science ecosystem. Around 15, or so, project titles had been distributed around the tables in the Man AHL cafeteria and the lead person for each project gave a brief presentation. Stable laws in SciPy sounded interesting to me and their room location included comfy seating (avoiding a numb bum is an under appreciated aspect of choosing a hackathon team and wooden bench seating is not numbing after a while).

Team Stable laws consisted of Andrea, Rishabh, Toby and yours truly. Our aim was to implement the Stable distribution as a Python module, to be included in the next release of SciPy (the availability had been announced a while back and there has been one attempt at an implementation {which seems to contain a few mistakes}).

We were well-fed and watered by Man AHL, including fancy cream buns and late night sushi.

A probability distribution is stable if the result of linear combinations of the distribution has the same distribution; the Gaussian, or Normal, distribution is the most well-known stable distribution and the central limit theorem leads many to think, that is that.

Two other, named, stable distributions are the Cauchy distribution and most interestingly (from my perspective this weekend) the Lévy distribution. Both distributions have very fat tails; the mean and variance of the Cauchy distribution is undefined (i.e., the values jump around as the sample size increases, never converging to a fixed value), while they are both infinite for the Lévy distribution.

Analytic expressions exist for various characteristics of the Stable distribution (e.g., probability distribution function), with the Gaussian, Cauchy and Lévy distributions as special cases. While solutions for implementing these expressions have been proposed, care is required; the expressions are ill-behaved in different ways over some intervals of their parameter values.

Andrea has spent several years studying the Stable distribution analytically and was keen to create an implementation. My approach for complicated stuff is to find an existing implementation and adopt it. While everybody else worked their way through the copious papers that Andrea had brought along, I searched for existing implementations.

I found several implementations, but they all suffered from using approaches that delivered discontinuities and poor accuracies over some range of parameter values.

Eventually I got lucky and found a paper by Royuela-del-Val, Simmross-Wattenberg and Alberola-López, which described their implementation in C: Libstable (licensed under the GPL, perfect for SciPy); they also provided lots of replication material from their evaluation. An R package was available, but no Python support.

No other implementations were found. Team Stable laws decided to create a new implementation in Python and to create a Python module to interface to the C code in libstable (the bit I got to do). Two implementations would allow performance and accuracy to be compared (accuracy checks really need three implementations to get some idea of which might be incorrect, when output differs).

One small fix was needed to build libstable under OS X (change Makefile to link against .so library, rather than .a) and a different fix was needed to install the R package under OS X (R patch; Windows and generic Unix were fine).

Python’s ctypes module looked after loading the C shared library I built, along with converting the NumPy arrays. My PyStable module will not win any Python beauty contest, it is a means of supporting the comparison of multiple implementations.

Some progress was made towards creating a new implementation, more than 24 hours is obviously needed (libstable contains over 4,000 lines of code). I had my own problems with an exception being raised in calls to stable_pdf; libstable used the GNU Scientific Library and I tracked the problem down to a call into GSL, but did not get any further.

We all worked overnight, my first 24-hour hack in a very long time (I got about 4-hours sleep).

After Sunday lunch around 10 teams presented and after a quick deliberation, Team Stable laws were announced as the winners; yea!

Hopefully, over the coming weeks a usable implementation will come into being.

On Quaker’s Dozen – student

student from thus spake a.k.

The Baron's latest wager set Sir R----- the task of rolling a higher score with two dice than the Baron should with one twelve sided die, giving him a prize of the difference between them should he have done so. Sir R-----'s first roll of the dice would cost him two coins and twelve cents and he could elect to roll them again as many times as he desired for a further cost of one coin and twelve cents each time, after which the Baron would roll his.
The simplest way to reckon the fairness of this wager is to re-frame its terms; to wit, that Sir R----- should pay the Baron one coin to play and thereafter one coin and twelve cents for each roll of his dice, including the first. The consequence of this is that before each roll of the dice Sir R----- could have expected to receive the same bounty, provided that he wrote off any losses that he had made beforehand.