Unneeded requirements implemented in Waterfall & Agile

Derek Jones from The Shape of Code

Software does not wear out, but the world in which it runs evolves. Time and money is lost when, after implementing a feature in software, customer feedback is that the feature is not needed.

How do Waterfall and Agile implementation processes compare in the number of unneeded feature/requirements that they implement?

In a Waterfall process, a list of requirements is created and then implemented. The identity of ‘dead’ requirements is not known until customers start using the software, which is not until it is released at the end of development.

In an Agile process, a list of requirements is used to create a Minimal Viable Product, which is released to customers. An iterative development processes, driven by customer feedback, implements requirements, and makes frequent releases to customers, which reduces the likelihood of implementing known to be ‘dead’ requirements. Previously implemented requirements may be discovered to have become ‘dead’.

An analysis of the number of ‘dead’ requirements implemented by the two approaches appears at the end of this post.

The plot below shows the number of ‘dead’ requirements implemented in a project lasting a given number of working days (blue/red) and the difference between them (green), assuming that one requirement is implemented per working day, with the discovery after 100 working days that a given fraction of implemented requirements are not needed, and the number of requirements in the MVP is assumed to be small (fractions 0.5, 0.1, and 0.05 shown; code):

Dead requirements for Waterfall and Agile projects running for a given number of days, along with difference between them.

The values calculated using one requirement implemented per day scales linearly with requirements implemented per day.

By implementing fewer ‘dead’ requirements, an Agile project will finish earlier (assuming it only implements all the needed requirements of a Waterfall approach, and some subset of the ‘dead’ requirements). However, unless a project is long-running, or has a high requirements’ ‘death’ rate, the difference may not be compelling.

I’m not aware of any data on rate of discovery of ‘dead’ implemented requirements (there is some on rate of discovery of new requirements); as always, pointers to data most welcome.

The Waterfall projects I am familiar with, plus those where data is available, include some amount of requirement discovery during implementation. This has the potential to reduce the number of ‘dead’ implemented requirements, but who knows by how much.

As the size of Minimal Viable Product increases to become a significant fraction of the final software system, the number of fraction of ‘dead’ requirements will approach that of the Waterfall approach.

There are other factors that favor either Waterfall or Agile, which are left to be discussed in future posts.

The following is an analysis of Waterfall/Agile requirements’ implementation.

Define:

F_{live} is the fraction of requirements per day that remain relevant to customers. This value is likely to be very close to one, e.g., 0.999.
R_{done} requirements implemented per working day.

Waterfall

The implementation of R_{total} requirements takes I_{days}=R_{total}/R_{done}days, and the number of implemented ‘dead’ requirements is (assuming that the no ‘dead’ requirements were present at the end of the requirements gathering phase):

R_{Wdead}=R_{total}*(1-{F_{live}}^{I_{days}})

As I_{days} right infty effectively all implemented requirements are ‘dead’.

Agile

The number of implemented ‘live’ requirements on day n is given by:

R_n=F_{live}*R_{n-1}+R_{done}

with the initial condition that the number of implemented requirements at the start of the first day of iterative development is the number of requirements implemented in the Minimum Viable Product, i.e., R_0=R_{mvp}.

Solving this difference equation gives the number of ‘live’ requirements on day n:

R_n=R_{mvp}*{F_{live}}^n+{n*R_{done}}/{n(1-F_{live})+F_{live}}

as n right infty, R_n approaches to its maximum value of {R_{done}}/{1-F_{live}}

Subtracting the number of ‘live’ requirements from the total number of requirements implemented gives:

R_{Adead}=R_{mvp}+n*R_{done}-R_n

or

R_{Adead}=R_{mvp}(1-{F_{live}}^n)+n*R_{done}(1-1/{n(1-F_{live})+F_{live}})
or
R_{Adead}=R_{mvp}(1-{F_{live}}^n)+n*R_{done}{n-1}/{n+F_{live}/(1-F_{live})}

as n right infty effectively all implemented requirements are ‘dead’, because the number of ‘live’ requirements cannot exceed a known maximum.

Air-Source Heat Pump – 1 year later

Andy Balaam from Andy Balaam's Blog

10 months ago I wrote a blog post Air-Source Heat Pump – our experience so far, 2 months in about our new air source heat pump. Have a look back at that for photos of the device itself and more detail about installation etc.

Less energy

We used a lot less energy this year than last year. Here’s the graph for 2 years:

Graph showing 2 years of energy usage on gas and electricity. Gas usage stops halfway through because a heat pump was installed. The second year's gas usage is much lower than the first, especially during the heavy use, cold months.

As you can see, we used a lot less energy in kWh this year than last year. Air source heat pumps work!

More money

Our energy cost more this year than last year. I’ve calculated this graph based on fixed prices, and I used 2021 prices to keep consistency with the last blog post, but the real-prices story is similar. Here is the graph for the last 2 years:

Graph showing energy cost per day over 2 years. The first half shows gas usage, which drops to zero in the middle when a heat pump was installed. The second half is higher, showing that the cost of the electricity this year was more than last year. Towards the end, solar panels were installed and the cost drops below last year.

Why did our cost go up when our energy usage went down so dramatically?

Because electricity is too expensive!

Electricity is the right way to power our cars and homes, because it can be sourced sustainably, and in fact much of it really is being sourced sustainably right now.

Artificially-high electricity prices are preventing people switching to a better way.

Solar panels help!

In September this year we had solar panels installed. They look great and are working incredibly well. We were getting 10kWh per day from them in September. We don’t have a battery yet, but when it arrives we think we will be able to cover most of our summer energy costs using these panels.

Since the panels were installed, our reduced use of grid electricity meant that our energy costs for this year dropped below last year. Obviously it’s too soon to say for sure what the full impact is, but I can say we are very happy with our solar panels.

Our house is cozy

Our new heat pump heats our leaky house very well, and we are nice and cozy, even when temperatures outside drop below zero. The heat pump is less efficient when it’s cold outside, but still way better than a gas boiler.

Installed by Your Energy Your Way

[My wife used to be director of the company, so I declare an interest.]

Our heat pump, radiators and panels were installed by Your Energy Your Way and I can recommend them for good communication, service and quality.

Stochastic rounding reemerges

Derek Jones from The Shape of Code

Just like integer types, floating-point types are capable of representing a finite number of numeric values. An important difference between integer and floating types is that the result of arithmetic and relational operations using integer types is exactly representable in an integer type (provided they don’t overflow), while the result of arithmetic operations using floating types may not be exactly representable in the corresponding floating type.

When the result of a floating-point operation cannot be exactly represented, it is rounded to a value that can be represented. Rounding modes include: round to nearest (the default for IEEE-754), round towards zero (i.e., truncated), round up (i.e., towards infty), round down (i.e., towards -infty), and round to even. The following is an example of round to nearest:

      123456.7    = 1.234567    × 10^5
         101.7654 = 0.001017654 × 10^5
Adding
                  = 1.235584654 × 10^5
Round to nearest
                  = 1.235585    × 10^5

There is another round mode, one implemented in the 1950s, which faded away but could now be making a comeback: Stochastic rounding. As the name suggests, every round up/down decision is randomly chosen; a Google patent makes some claims about where the entropy needed for randomness can be obtained, and Nvidia also make some patent claims).

From the developer perspective, stochastic rounding has a very surprising behavior, which is not present in the other IEEE rounding modes; stochastic rounding is not monotonic. For instance: z < x+y does not imply that 0<(x+y)-z, because x+y may be close enough to z to have a 50% chance of being rounded to one of z or the next representable value greater than z, and in the comparison against zero the rounded value of (x+y) has an uncorrelated probability of being equal to z (rather than the next representable greater value).

For some problems, stochastic rounding avoids undesirable behaviors that can occur when round to nearest is used. For instance, round to nearest can produce correlated rounding errors that cause systematic error growth (by definition, stochastic rounding is uncorrelated); a behavior that has long been known to occur when numerically solving differential equations. The benefits of stochastic rounding are obtained for calculations involving long chains of calculations; the rounding error of the result of n operations is guaranteed to be proportional to sqrt{n}, i.e., just like a 1-D random walk, which is not guaranteed for round to nearest.

While stochastic rounding has been supported by some software packages for a while, commercial hardware support is still rare, with Graphcore's Intelligence Processing Unit being one. There are some research chips supporting stochastic rounding, e.g., Intel's Loihi.

What applications, other than solving differential equations, involve many long chain calculations?

Training of machine learning models can consume many cpu hours/days; the calculation chains just go on and on.

Machine learning is considered to be a big enough market for hardware vendors to support half-precision floating-point. The performance advantages of half-precision floating-point are large enough to attract developers to reworking code to make use of them.

Is the accuracy advantage of stochastic rounding a big enough selling point that hardware vendors will provide the support needed to attract a critical mass of developers willing to rework their code to take advantage of improved accuracy?

It's possible that the intrinsically fuzzy nature of many machine learning applications swamps the accuracy advantage that stochastic rounding can have over round to nearest, out-weighing the costs of supporting it.

The ecosystem of machine learning based applications is still evolving rapidly, and we will have to wait and see whether stochastic rounding becomes widely used.

Deleted my Twitter account

Andy Balaam from Andy Balaam&#039;s Blog

This evening I deleted my Twitter account. I’m feeling surprisingly unsettled by doing it, to be honest.

The Twitter mobile interface showing that my account has been deactivated.

I can’t participate in a platform that gives a voice to people who incite violence. I said I would leave if Trump were re-instated, and that is what I am doing. I know Twitter has failed in the past in many ways, but this deliberate decision is the last straw.

I have downloaded the archive of my tweets, and I will process it and upload it to my web site as soon as possible.

You can find me at @andybalaam@mastododon.social.

If you’d like help finding your way onto Mastodon, feel free to ask me for help via Mastodon itself, or by email on andybalaam at artificialworlds.net

Some human biases in conditional reasoning

Derek Jones from The Shape of Code

Tracking down coding mistakes is a common developer activity (for which training is rarely provided).

Debugging code involves reasoning about differences between the actual and expected output produced by particular program input. The goal is to figure out the coding mistake, or at least narrow down the portion of code likely to contain the mistake.

Interest in human reasoning dates back to at least ancient Greece, e.g., Aristotle and his syllogisms. The study of the psychology of reasoning is very recent; the field was essentially kick-started in 1966 by the surprising results of the Wason selection task.

Debugging involves a form of deductive reasoning known as conditional reasoning. The simplest form of conditional reasoning involves an input that can take one of two states, along with an output that can take one of two states. Using coding notation, this might be written as:

    if (p) then q       if (p) then !q
    if (!p) then q      if (!p) then !q

The notation used by the researchers who run these studies is a 2×2 contingency table (or conditional matrix):

          OUTPUT
          1    0
   
      1   A    B
INPUT
      0   C    D

where: A, B, C, and D are the number of occurrences of each case; in code notation, p is the input and q the output.

The fertilizer-plant problem is an example of the kind of scenario subjects answer questions about in studies. Subjects are told that a horticultural laboratory is testing the effectiveness of 31 fertilizers on the flowering of plants; they are told the number of plants that flowered when given fertilizer (A), the number that did not flower when given fertilizer (B), the number that flowered when not given fertilizer (C), and the number that did not flower when not given any fertilizer (D). They are then asked to evaluate the effectiveness of the fertilizer on plant flowering. After the experiment, subjects are asked about any strategies they used to make judgments.

Needless to say, subjects do not make use of the available information in a way that researchers consider to be optimal, e.g., Allan’s Delta p index Delta p=P(A vert C)-P(B vert D)=A/{A+B}-C/{C+D} (sorry about the double, vert, rather than single, vertical lines).

What do we know after 40+ years of active research into this basic form of conditional reasoning?

The results consistently find, for this and other problems, that the information A is given more weight than B, which is given by weight than C, which is given more weight than D.

That information provided by A and B is given more weight than C and D is an example of a positive test strategy, a well-known human characteristic.

Various models have been proposed to ‘explain’ the relative ordering of information weighting: w(A)>w(B) > w(C) > w(D)” title=”w(A)>w(B) > w(C) > w(D)”/><a href=, e.g., that subjects have a bias towards sufficiency information compared to necessary information.

Subjects do not always analyse separate contingency tables in isolation. The term blocking is given to the situation where the predictive strength of one input is influenced by the predictive strength of another input (this process is sometimes known as the cue competition effect). Debugging is an evolutionary process, often involving multiple test inputs. I’m sure readers will be familiar with the situation where the output behavior from one input motivates a misinterpretation of the behaviour produced by a different input.

The use of logical inference is a commonly used approach to the debugging process (my suggestions that a statistical approach may at times be more effective tend to attract odd looks). Early studies of contingency reasoning were dominated by statistical models, with inferential models appearing later.

Debugging also involves causal reasoning, i.e., searching for the coding mistake that is causing the current output to be different from that expected. False beliefs about causal relationships can be a huge waste of developer time, and research on the illusion of causality investigates, among other things, how human interpretation of the information contained in contingency tables can be ‘de-biased’.

The apparently simple problem of human conditional reasoning over two variables, each having two states, has proven to be a surprisingly difficult to model. It is tempting to think that the performance of professional software developers would be closer to the ideal, compared to the typical experimental subject (e.g., psychology undergraduates or Mturk workers), but I’m not sure whether I would put money on it.

IETF115 Trip Report (Can Matrix help messaging standardisation through MIMI?)

Andy Balaam from Andy Balaam&#039;s Blog

Geeks don’t like to be formal, but we do like to be precise. That is the
contrast that comes to mind as I attend my first IETF meeting, IETF 115 in
London in November 2022.

Like most standards bodies, IETF appears to have been formed as a reaction to
stuffy, process-encumbered standards bodies. The introductory material
emphasises a rough-consensus style, with an emphasis on getting things done
and avoiding being a talking shop. This is backed up by the hackathon organised
on the weekend before the conference, allowing attendees to get a taste of
really doing stuff, writing code to solve those thorny problems.

But the IETF has power, and, rightly, with power come checks and balances, so
although there is no formal secretary for the meeting I was in, a volunteer was
found to take minutes, and they made sure to store them in the official system.
It really matters what happens here, and the motley collection of (mostly
slightly aging) participants know that.

There is an excitement about coming together to solve technical problems, but
there is also an understandable caution about the various agendas being
represented.

We were at the meeting to discuss something incredibly exciting: finding a
practical solution for how huge messaging providers (e.g. WhatsApp) can make
their systems “interoperable”. In other words, making it possible for people
using one instant messenger to talk to people using another one.

Because of the EU’s new Digital Markets Act (DMA), the largest messaging
providers are going to be obliged to provide ways for outside systems to link
into them, and the intention of the meeting I went to (“More Instant Messaging
Interoperability”, codenamed MIMI) is to agree on a way they can link together
that makes the experience good for users.

The hardest part of this is preserving end-to-end encryption across different
instant messengers. If the content of a message is encrypted, then you can’t
have a translation layer (“bridge”, in Matrix terminology) that converts those
contents from one format into another, because the translator would need to
decrypt the message first. Instead, either the client (the app on the user’s
device) needs to understand all possible formats, or we need to agree on a
common format.

The meeting was very interesting, and conversation on the day revolved around
what exactly should be in scope or out of scope for the group that will
continue this work. In particular, we discussed whether the group should work
on questions of how to identify and find people on other messaging networks,
and to what extent the problems of spam and abuse should be considered. Mostly
the conclusion was that the “charter” of the “working group”, when/if it
is formed, should leave these topics open, so that the group can decide the
details when it is up and running.

From a Matrix point of view, this is very
exciting, because we have been working on exactly how to do this stuff for
years, and our goal from the beginning has been to allow interoperability
between other systems. So we have a lot to offer on how to design a system that
allows rich interactions to work between very different systems, while
providing effective systems to fight abuse. One part that we can’t tackle
without the co-operation of the dominant messengers is end-to-end encryption,
because we can’t control the message formats that are used on the clients, so
it’s really exciting that MIMI’s work might improve that situation.

I personally feel a little skeptical that the “gatekeepers” will implement DMA
in a really good way, like what we were discussing at IETF. I think it’s more
likely that they will do the bare minimum, and make it difficult to use so
that the experience is so bad that no-one uses it. However, if one or two major
players do adopt a proper interoperable standard, it could pick up all the
minor players, and become something genuinely useful.

I really enjoyed going to the IETF meeting, and am incredibly grateful to the
dedicated people doing this work. I really hope that the MIMI group forms and
successfully creates a useful standard. I’m also really optimistic that the
things we’ve learned while creating Matrix can help with this process.

Whether this standard begins to solve the horrible problems we have with closed
messaging silos and user lock-in to potentially exploitative or harmful
platforms is probably out of our hands.

If you’d like to get involved with MIMI, check out the page here:
datatracker.ietf.org/group/mimi/about.

Evidence-based Software Engineering book: two years later

Derek Jones from The Shape of Code

Two years ago, my book Evidence-based Software Engineering: based on the publicly available data was released. The first two weeks saw 0.25 million downloads, and 0.5 million after six months. The paperback version on Amazon has sold perhaps 20 copies.

How have the book contents fared, and how well has my claim to have discussed all the publicly available software engineering data stood up?

The contents have survived almost completely unscathed. This is primarily because reader feedback has been almost non-existent, and I have hardly spent any time rereading it.

In the last two years I have discovered maybe a dozen software engineering datasets that would have been included, had I known about them, and maybe another dozen non-software related datasets that could have been included in the Human behavior/Cognitive capitalism/Ecosystems/Reliability chapters. About half of these have been the subject of blog posts (links below), with the others waiting to be covered.

Each dataset provides a sliver of insight into the much larger picture that is software engineering; joining the appropriate dots, by analyzing multiple datasets, can provide a larger sliver of insight into the bigger picture. I have not spent much time attempting to join dots, but have joined a few tiny ones, and a few that are not so small, e.g., Estimating using a granular sequence of values and Task backlog waiting times are power laws.

I spent the first year, after the book came out, working through the backlog of tasks that had built up during the 10-years of writing. The second year was mostly dedicated to trying to find software project data (including joining Twitter), and reading papers at a much reduced rate.

The plot below shows the number of monthly downloads of the A4 and mobile friendly pdfs, along with the average kbytes per download (code+data):

Downloads of A4 and mobile pdf over 2-years... pending ISP disk not being full

The monthly averages for 2022 are around 6K A4 and 700 mobile friendly pdfs.

I have been averaging one in-person meetup per week in London. Nearly everybody I tell about the book has not previously heard of it.

The following is a list of blog posts either analyzing existing data or discussing/analyzing new data.

Introduction
analysis: Software effort estimation is mostly fake research
analysis: Moore’s law was a socially constructed project

Human behavior
data (reasoning): The impact of believability on reasoning performance
data: The Approximate Number System and software estimating
data (social conformance): How large an impact does social conformity have on estimates?
data (anchoring): Estimating quantities from several hundred to several thousand
data: Cognitive effort, whatever it might be

Ecosystems
data: Growth in number of packages for widely used languages
data: Analysis of a subset of the Linux Counter data
data: Overview of broad US data on IT job hiring/firing and quitting

Projects
analysis: Delphi and group estimation
analysis: The CESAW dataset: a brief introduction
analysis: Parkinson’s law, striving to meet a deadline, or happenstance?
analysis: Evaluating estimation performance
analysis: Complex software makes economic sense
analysis: Cost-effectiveness decision for fixing a known coding mistake
analysis: Optimal sizing of a product backlog
analysis: Evolution of the DORA metrics
analysis: Two failed software development projects in the High Court

data: Pomodoros worked during a day: an analysis of Alex’s data
data: Multi-state survival modeling of a Jira issues snapshot
data: Over/under estimation factor for ‘most estimates’
data: Estimation accuracy in the (building|road) construction industry
data: Rounding and heaping in non-software estimates
data: Patterns in the LSST:DM Sprint/Story-point/Story ‘done’ issues
data: Shopper estimates of the total value of items in their basket

Reliability
analysis: Most percentages are more than half

Statistical techniques
Fitting discontinuous data from disparate sources
Testing rounded data for a circular uniform distribution

Post 2020 data
Pomodoros worked during a day: an analysis of Alex’s data
Impact of number of files on number of review comments
Finding patterns in construction project drawing creation dates