Evaluating estimation performance

Derek Jones from The Shape of Code

What is the best way to evaluate the accuracy of an estimation technique, given that the actual values are known?

Estimates are often given as point values, and accuracy scoring functions (for a sequence of estimates) have the form S=1/n sum{i=1}{n}{S(E_i, A_i)}, where n is the number of estimated values, E_i the estimates, and A_i the actual values; smaller S is better.

Commonly used scoring functions include:

  • S(E, A)=(E-A)^2, known as squared error (SE)
  • S(E, A)=delim{|}{E-A}{|}, known as absolute error (AE)
  • S(E, A)=delim{|}{E-A}{|}/A, known as absolute percentage error (APE)
  • S(E, A)=delim{|}{E-A}{|}/E, known as relative error (RE)

APE and RE are special cases of: S(E, A)=delim{|}{1-(A/E)^{beta}}{|}, with beta=-1 and beta=1 respectively.

Let’s compare three techniques for estimating the time needed to implement some tasks, using these four functions.

Assume that the mean time taken to implement previous project tasks is known, E_m. When asked to implement a new task, an optimist might estimate 20% lower than the mean, E_o=E_m*0.8, while a pessimist might estimate 20% higher than the mean, E_p=E_m*1.2. Data shows that the distribution of the number of tasks taking a given amount of time to implement is skewed, looking something like one of the lines in the plot below (code):

Two example distributions of number of tasks taking a given amount of time to implement.

We can simulate task implementation time by randomly drawing values from a distribution having this shape, e.g., zero-truncated Negative binomial or zero-truncated Weibull. The values of E_o and E_p are calculated from the mean, E_m, of the distribution used (see code for details). Below is each estimator’s score for each of the scoring functions (the best performing estimator for each scoring function in bold; 10,000 values were used to reduce small sample effects):

    SE   AE   APE   RE
E_o 2.73 1.29 0.51 0.56
E_m 2.31 1.23 0.39 0.68
E_p 2.70 1.37 0.36 0.86

Surprisingly, the identity of the best performing estimator (i.e., optimist, mean, or pessimist) depends on the scoring function used. What is going on?

The analysis of scoring functions is very new. A 2010 paper by Gneiting showed that it does not make sense to select the scoring function after the estimates have been made (he uses the term forecasts). The scoring function needs to be known in advance, to allow an estimator to tune their responses to minimise the value that will be calculated to evaluate performance.

The mathematics involves Bregman functions (new to me), which provide a measure of distance between two points, where the points are interpreted as probability distributions.

Which, if any, of these scoring functions should be used to evaluate the accuracy of software estimates?

In software estimation, perhaps the two most commonly used scoring functions are APE and RE. If management selects one or the other as the scoring function to rate developer estimation performance, what estimation technique should employees use to deliver the best performance?

Assuming that information is available on the actual time taken to implement previous project tasks, then we can work out the distribution of actual times. Assuming this distribution does not change, we can calculate APE and RE for various estimation techniques; picking the technique that produces the lowest score.

Let’s assume that the distribution of actual times is zero-truncated Negative binomial in one project and zero-truncated Weibull in another (purely for convenience of analysis, reality is likely to be more complicated). Management has chosen either APE or RE as the scoring function, and it is now up to team members to decide the estimation technique they are going to use, with the aim of optimising their estimation performance evaluation.

A developer seeking to minimise the effort invested in estimating could specify the same value for every estimate. Knowing the scoring function (top row) and the distribution of actual implementation times (first column), the minimum effort developer would always give the estimate that is a multiple of the known mean actual times using the multiplier value listed:

                   APE   RE
Negative binomial  1.4   0.5
Weibull            1.2   0.6

For instance, management specifies APE, and previous task/actuals has a Weibull distribution, then always estimate the value 1.2*E_m.

What mean multiplier should Esta Pert, an expert estimator aim for? Esta’s estimates can be modelled by the equation Act*U(0.5, 2.0), i.e., the actual implementation time multiplied by a random value uniformly distributed between 0.5 and 2.0, i.e., Esta is an unbiased estimator. Esta’s table of multipliers is:

                   APE   RE
Negative binomial  1.0   0.7
Weibull            1.0   0.7

A company wanting to win contracts by underbidding the competition could evaluate Esta’s performance using the RE scoring function (to motivate her to estimate low), or they could use APE and multiply her answers by some fraction.

In many cases, developers are biased estimators, i.e., individuals consistently either under or over estimate. How does an implicit bias (i.e., something a person does unconsciously) change the multiplier they should consciously aim for (having analysed their own performance to learn their personal percentage bias)?

The following table shows the impact of particular under and over estimate factors on multipliers:

                 0.8 underestimate bias   1.2 overestimate bias
Score function          APE   RE            APE   RE
Negative binomial       1.3   0.9           0.8   0.6
Weibull                 1.3   0.9           0.8   0.6

Let’s say that one-third of those on a team underestimate, one-third overestimate, and the rest show no bias. What scoring function should a company use to motivate the best overall team performance?

The following table shows that neither of the scoring functions motivate team members to aim for the actual value when the distribution is Negative binomial:

                    APE   RE
Negative binomial   1.1   0.7
Weibull             1.0   0.7

One solution is to create a bespoke scoring function for this case. Both APE and RE are special cases of a more general scoring function (see top). Setting beta=-0.7 in this general form creates a scoring function that produces a multiplication factor of 1 for the Negative binomial case.

A Review: Incineration Fest 2022 – Metal is back!

Paul Grenyer from Paul Grenyer

Overall I really enjoyed Incineration Fest and would go again if the line up is right for me. What was really great was seeing metallers back at gig with no restrictions and doing what we do best!


I completely fell in love with Winterfylleth when they played Bloodstock on the mainstage and even more so when they released the set as a live album. They are incredible and totally deserved to be opening proceedings at the Roundhouse for Incineration Fest. Actually, they deserved to be much higher up the bill. They’re a solid outfit, played what I wanted to hear and ended, as I always think of them ending from the live album, with Chris saying this is the last song “as time is short and our songs are long!” I need to see them do a headline set in a venue with a great PA soon.


Tsjuder was the wildcard for me. I didn’t really know them and had heard only a few things on Spotify before, although what I heard was really good. I had no idea I was going to be blown away. They sounded incredible from the first note, which was even more impressive given that they are only a three piece and the PA in the Roundhouse wasn’t turning out to be great for definition.


Bloodbath was really the reason I was at Incineration Fest. I’d missed them at Bloodstock ten years before as one of my sons was being born and I hadn’t had a chance to see them until now. Of course now Nick Holmes (Paradise Lost) rather than Mikael Åkerfeldt (Opeth) was on lead vocals.

I was very, very excited and from the moment I heard that trademark crunching guitar sound I was even more excited. They played for a full hour. Unlike the Black Metal bands on the bill there was more riffing and solos and a slight different drum sound.

They’re an odd band to watch. For reasons I don’t understand, the bass player and two guitarists would often turn their backs to the audience to face the dummer. The band didn’t seem to interact much with each other on stage and even less so with Nick.

Nick’s deadpan humour was present when he did speak to the audience. He introduced the band as being from Sweden, then added from Halifax almost as an afterthought! During the set he admitted he couldn’t see and dispensed with his sun glasses as they’d apparently been a good idea backstage. After breaking the microphone he enquired if it would be added to his bill at the end of the night.


Emperor hasn't released any new material (that I know of) since 2001 and, if I’m honest, I barely listen to them beyond the live album these days. I’ve seen them at least three times before, the first time being in 1999 in a small club in Bradford on my birthday - it doesn’t get much better than that. I’m more of a fan of Ihsahn’s solo stuff these days and I still really enjoy Samoth’s Zyklon whenever I play it. Emperor, not so much anymore.

They played for the full ninety and for the most part were solid as you might expect. Whether or not Faust plays with is of no consequence to me and I certainly didn’t need to covers they played with him towards the end of the set. There was lots I knew and lots I enjoyed, but I wouldn't make an effort to see Emperor again.

The Middle Way – a.k.

a.k. from thus spake a.k.

A few years ago we spent some time implementing a number of the sorting, searching and set manipulation algorithms from the standard C++ library in JavaScript. Since the latter doesn't support the former's abstraction of container access via iterators we were compelled to restrict ourselves to using native Array objects following the conventions of its methods, such as slice and sort.
In this post we shall take a look at an algorithm for finding the centrally ranked element, or median, of an array, which is strongly related to the ak.nthElement function, and then at a particular use for it.

Twitter and evidence-based software engineering

Derek Jones from The Shape of Code

This year’s quest for software engineering data has led me to sign up to Twitter (all the software people I know, or know-of, have been contacted, and discovery through articles found on the Internet is a very slow process).

@evidenceSE is my Twitter handle. If you get into a discussion and want some evidence-based input, feel free to get me involved. Be warned that the most likely response, to many kinds of questions, is that there is no data.

My main reason for joining is to try and obtain software engineering data. Other reasons include trying to introduce an evidence-based approach to software engineering discussions and finding new (to me) problems that people want answers to (that are capable of being answered by analysing data).

The approach I’m taking is to find software engineering tweets discussing a topic for which some data is available, and to jump in with a response relating to this data. Appropriate tweets are found using the search pattern: (agile OR software OR "story points" OR "story point" OR "function points") (estimate OR estimates OR estimating OR estimation OR estimated OR #noestimates OR "evidence based" OR empirical OR evolution OR ecosystems OR cognitive). Suggestions for other keywords or patterns welcome.

My experience is that the only effective way to interact with developers is via meaningful discussion, i.e., cold-calling with a tweet is likely to be unproductive. Also, people with data often don’t think that anybody else would be interested in it, they have to convinced that it can provide valuable insight.

You never know who has data to share. At a minimum, I aim to have a brief tweet discussion with everybody on Twitter involved in software engineering. At a minute per tweet (when I get a lot more proficient than I am now, and have workable templates in place), I could spend two hours per day to reach 100 people, which is 35,000 per year; say 20K by the end of this year. Over the last three days I have managed around 10 per day, and obviously need to improve a lot.

How many developers are on Twitter? Waving arms wildly, say 50 million developers and 1 in 1,000 have a Twitter account, giving 50K developers (of which an unknown percentage are active). A lower bound estimate is the number of followers of popular software related Twitter accounts: CompSciFact has 238K, Unix tool tips has 87K; perhaps 1 in 200 developers have a Twitter account, or some developers have multiple accounts, or there are lots of bots out there.

I need some tools to improve the search process and help track progress and responses. Twitter has an API and a developer program. No need to worry about them blocking me or taking over my business; my usage is small fry and I’ not building a business to take over. I was at Twitter’s London developer meetup in the week (the first in-person event since Covid) and the youngsters present looked a lot younger than usual. I suspect this is because the slightly older youngsters remember how Twitter cut developers off at the knee a few years ago by shutting down some useful API services.

The Twitter version-2 API looks interesting, and the Twitter developer evangelists are keen to attract developers (having ‘wiped out’ many existing API users), and I’m happy to jump in. A Twitter API sandbox for trying things out, and there are lots of example projects on Github. Pointers to interesting tools welcome.