Three books discuss three small data sets

Derek Jones from The Shape of Code

During the early years of a new field, experimental data relating to important topics can be very thin on the ground. Ever since the first computer was built, there has been a lot of data on the characteristics of the hardware. Data on the characteristics of software, and the people who write it has been (and often continues to be) very thin on the ground.

Books are sometimes written by the researchers who produce the first data associated with an important topic, even if the data set is tiny; being first often generates enough interest for a book length treatment to be considered worthwhile.

As a field progresses lots more data becomes available, and the discussion in subsequent books can be based on findings from more experiments and lots more data

Software engineering is a field where a few ‘first’ data books have been published, followed by silence, or rather lots of arm waving and little new data. The fall of Rome has been followed by a 40-year dark-age, from which we are slowly emerging.

Three of these ‘first’ data books are:

  • “Man-Computer Problem Solving” by Harold Sackman, published in 1970, relating to experimental data from 1966. The experiments investigated the impact of two different approaches to developing software, on programmer performance (i.e., batch processing vs. on-line development; code+data). The first paper on this work appeared in an obscure journal in 1967, and was followed in the same issue by a critique pointing out the wide margin of uncertainty in the measurements (the critique agreed that running such experiments was a laudable goal).

    Failing to deal with experimental uncertainty is nothing compared to what happened next. A 1968 paper in a widely read journal, the Communications of the ACM, contained the following table (extracted from a higher quality scan of a 1966 report by the same authors, and available online).

    Developer performance ratios.

    The tale of 1:28 ratio of programmer performance, found in an experiment by Grant/Sackman, took off (the technical detail that a lot of the difference was down to the techniques subjects’ used, and not the people themselves, got lost). The Grant/Sackman ‘finding’ used to be frequently quoted in some circles (or at least it did when I moved in them, I don’t know often it is cited today). In 1999, Lutz Prechelt wrote an expose on the sorry tale.

    Sackman’s book is very readable, and contains lots of details and data not present in the papers, including survey data and a discussion of the intrinsic uncertainties associated with the experiment; it also contains the table above.

  • “Software Engineering Economics” by Barry W. Boehm, published in 1981. I wrote about the poor analysis of the data contained in this book a few years ago.

    The rest of this book contains plenty of interesting material, and even sounds modern (because books moving the topic forward have not been written).

  • “Program Evolution: Process of Software Change” edited by M. M. Lehman and L. A. Belady, published in 1985, relating to experimental data from 1977 and before. Lehman and Belady managed to obtain data relating to 19 releases of an IBM software product (yes, 19, not nineteen-thousand); the data was primarily the date and number of modules contained in each release, plus less specific information about number of statements. This data was sliced and diced every which way, and the book contains many papers with the same data appearing in the same plot with different captions (had the book not been a collection of papers it would have been considerably shorter).

    With a lot less data than Isaac Newton had available to formulate his three laws, Lehman and Belady came up with five, six, seven… “laws of software evolution” (which themselves evolved with the publication of successive papers).

    The availability of Open source repositories means there is now a lot more software system evolution data available. Lehman’s laws have not stood the test of more data, although people still cite them every now and again.

On The Octogram Of Seth LaPod – student

student from thus spake a.k.

The latest wager that the Baron put to Sir R----- had them competing to first chalk a triangle between three of eight coins, with Sir R----- having the prize if neither of them managed to do so. I immediately recognised this as the game known as Clique and consequently that Sir R-----'s chances could be reckoned by applying the pigeonhole principle and the tactic of strategy stealing. Indeed, I said as much to the Baron but I got the distinct impression that he wasn't really listening.

Comparing expression usage in mathematics and C source

Derek Jones from The Shape of Code

Why does a particular expression appear in source code?

One reason is that the expression is the coded form of a formula from the application domain, e.g., E=mc^2.

Another reason is that the expression calculates an algorithm/housekeeping related address, or offset, to where a value of interest is held.

Most people (including me, many years ago) think that the majority of source code expressions relate to the application domain, in one-way or another.

Work on a compiler related optimizer, and you will soon learn the truth; most expressions are simple and calculate addresses/offsets. Optimizing compilers would not have much to do, if they only relied on expressions from the application domain (my numbers tool throws something up every now and again).

What are the characteristics of application domain expression?

I like to think of them as being complicated, but that’s because it used to be in my interest for them to be complicated (I used to work on optimizers, which have the potential to make big savings if things are complicated).

Measurements of expressions in scientific papers is needed, but who is going to be interested in measuring the characteristics of mathematical expressions appearing in papers? I’m interested, but not enough to do the work. Then, a few weeks ago I discovered: An Analysis of Mathematical Expressions Used in Practice, by Clare So; an analysis of 20,000 mathematical papers submitted to arXiv between 2000 and 2004.

The following discussion uses the measurements made for my C book, as the representative source code (I keep suggesting that detailed measurements of other languages is needed, but nobody has jumped in and made them, yet).

The table below shows percentage occurrence of operators in expressions. Minus is much more common than plus in mathematical expressions, the opposite of C source; the ‘popularity’ of the relational operators is also reversed.

Operator  Mathematics   C source
=         0.39          3.08
-         0.35          0.19 
+         0.24          0.38
<=        0.06          0.04
>         0.041         0.11
<         0.037         0.22

The most common single binary operator expression in mathematics is n-1 (the data counts expressions using different variable names as different expressions; yes, n is the most popular variable name, and adding up other uses does not change relative frequency by much). In C source var+int_constant is around twice as common as var-int_constant

The plot below shows the percentage of expressions containing a given number of operators (I've made a big assumption about exactly what Clare So is counting; code+data). The operator count starts at two because that is where the count starts for the mathematics data. In C source, around 99% of expressions have less than two operators, so the simple case completely dominates.

Percentage of expressions containing a given number of operators.

For expressions containing between two and five operators, frequency of occurrence is sort of about the same in mathematics and C, with C frequency decreasing more rapidly. The data disagrees with me again...

[HOWTO] Installing Emacs 26.3 on Ubuntu or XUbuntu 19.04

Timo Geusch from The Lone C++ Coder&#039;s Blog

My previous instructions for installing a newer Emacs version on Ubuntu still work. Ubuntu (and in my case, XUbuntu) 19.04 ships with Emacs 26.1 out of the box. As usual I want to run the latest version – Emacs 26.3 – as I run that on my other Linux, FreeBSD and macOS machines. I only […]

The post [HOWTO] Installing Emacs 26.3 on Ubuntu or XUbuntu 19.04 appeared first on The Lone C++ Coder's Blog.

kmscube Running on Orange Pi PC with Mainline Kernel

Christof Meerwald from blog

Managed to get kmscube running on my Orange Pi PC with a mainline Linux 5.3 kernel and an updated mesa package from Ubuntu's ubuntu-x-swat PPA. The amazing thing is that it's all just mainline now, no board-specific patches needed. Interestingly, a Raspberry Pi 3 still needs a kernel built from its own branch to get that level of hardware support.

On a slightly related note I have been looking at what level of support I get for my ODROID-C1 now. Unfortunately, there is still no HDMI output and no USB OTG support. Interestingly, HDMI output does work on NetBSD.

Cost ratio for bespoke hardware+software

Derek Jones from The Shape of Code

What percentage of the budget for a bespoke hardware/software system is spent on software, compared to hardware?

The plot below has become synonymous with this question (without the red line, which highlights 1973), and is often used to claim that software costs are many times more than hardware costs.

USAF bespoke hardware/Software cost ratio from 1955 to 1980.

The paper containing this plot was published in 1973 (the original source is a Rome period report), and is an extrapolation of data I assume was available in 1973, into what was then the future. The software and hardware costs are for bespoke command and control systems delivered to the U.S. Air Force, not commercial off-the-shelf solutions or even bespoke commercial systems.

Does bespoke software cost many times more than the hardware it runs on?

I don’t have any data that might be used to answer this questions, to any worthwhile degree of accuracy. I know of situations where I believe the bespoke software did cost a lot more than the hardware, and I know of some where the hardware cost more (I have never been privy to exact numbers on large projects).

Where did the pre-1973 data come from?

The USAF funded the creation of lots of source code, and the reports cite hardware and software figures from 1972.

To summarise: the above plot is for USAF spending on bespoke command and control hardware and software, and is extrapolated from 1973 into the future.

A Little Bit Slinky – a.k.

a.k. from thus spake a.k.

For several months we've for been taking a look at cluster analysis which seeks to partition sets of data into subsets of similar data, known as clusters. Most recently we have focused our attention on hierarchical clusterings, which are sequences of sets of clusters in which pairs of data that belong to the same cluster at one step belong to the same cluster in the next step.
A simple way of constructing them is to initially place each datum in its own cluster and then iteratively merge the closest pairs of clusters in each clustering to produce the next one in the sequence, stopping when all of the data belong to a single cluster. We have considered three ways of measuring the distance between pairs of clusters, the average distance between their members, the distance between their closest members and the distance between their farthest members, known as average linkage, single linkage and complete linkage respectively, and implemented a reasonably efficient algorithm for generating hierarchical clusterings defined with them, using a min-heap structure to cache the distances between clusters.
Finally, I claimed that there is a more efficient algorithm for generating single linkage hierarchical clusterings that would make the sorting of clusters by size in our ak.clustering type too expensive and so last time we implemented the ak.rawClustering type to represent clusterings without sorting their clusters which we shall now use in the implementation of that algorithm.

CppCon 2019 Trip Report and Slides

Anthony Williams from Just Software Solutions Blog

Having been back from CppCon 2019 for over a week, I thought it was about time I wrote up my trip report.

The venue

This year, CppCon was at a new venue: the Gaylord Rockies Resort near Denver, Colorado, USA. This is a huge conference centre, currently surrounded by vast tracts of empty space, though people told me there were many plans for developing the surrounding area.

There were hosting multiple conferences and events alongside CppCon; it was quite amusing to emerge from the conference rooms and find oneself surrounded by people in ballgowns and fancy evening wear for an event in the nearby ballroom!

There were a choice of eating establishments, but they all had one thing in common: they were overpriced, taking advantage of the captured nature of the hotel clientelle. The food was reasonably nice though.

The size of the venue did make for a fair amount of walking around between sessions.

Overall the venue was nice, and the staff were friendly and helpful.

Pre-conference Workshop

I ran a 2-day pre-conference class, entitled More Concurrent Thinking in C++: Beyond the Basics, which was for those looking to move beyond the basics of threads and locks to the next level: high level library and application design, as well as lock-free programming with atomics. This was well attended, and I had interesting discussions with people over lunch and in the evening.

If you would like to book this course for your company, please see my training page.

The main conference

Bjarne Stroustrup kicked off the main conference with his presentation on "C++20: C++ at 40". Bjarne again reiterated his vision for C++, and outlined some of the many nice language and library features we have to make development easier, and code clearer and less error-prone.

Matt Godbolt's presentation on "Compiler Explorer: Behind the Scenes" was good and entertaining. Matt showed how he'd evolved Compiler Explorer from a simple script to the current website, and demonstrated some nifty things about it along the way, including features you might not have known about such as the LLVM instruction cost view, or the new "run your code" facility.

In "If You Can't Open It, You Don't Own It", Matt Butler talked about security and trust, and how bad things can happen if something you trust is compromised. Mostly this was obvious if you thought about it, but not something we necessarily do think about, so it was nice to be reminded, especially with the concrete examples. His advice on what we can do to build more secure systems, and existing and proposed C++ features that help was also good.

Barbara Geller and Ansel Sermersheim made an enthusiastic duo presenting "High performance graphics and text rendering on the GPU for any C++ application". I am excited about the potential for their Copperspice wrapper for the Vulkan rendering library: rendering 3D graphics portably is hard, and text more so.

Andrew Sutton's presentation on "Reflections: Compile-time Introspection of Source Code" was an interesting end to Monday. There is a lot of scope for eliminating boilerplate if we can use reflection, so it is good to see the progress being made on it.

Tuesday morning began with a scary question posed by Michael Wong, Paul McKenney and Maged Michael: "Will Your Code Survive the Attack of the Zombie Pointers?" Currently, if you delete an object or call free then all copies of those pointers immediately become invalid across all threads. Since invalid pointers can't even be compared, this can result in zombies eating your brains. Michael, Paul and Maged looked at what we can do in our code to avoid this, and what they are proposing for the C++ Standard to fix the problem.

Andrei Alexandrescu's presentation on "Speed is found in the minds of people" was an insightful look at optimizing sort. Andrei showed how compiler and processor features mean that performance can be counter-intuitive, and code with a higher algorithmic complexity can run faster in the right conditions. Always use infinite loops (except for most cases).

I love the interactive slides in Hana Dusikova's presentation "A State of Compile Time Regular Expressions". She is pushing the boundaries of compile-time coding to make our code perform better at runtime. std::regex can be slow compared to other regular expression libraries, but ctre can be much better. I am excited to see how this can be extended to compile-time parsing of other DSLs.

In "Applied WebAssembly: Compiling and Running C++ in Your Web Browser", Ben Smith showed the use of WebAssembly as a target to allow you to write high-performance C++ code that will run in a suitable web browser on any platform, much like the "Write once, run anywhere" promise of Java. I am interested to see where this can lead.

Samy Al Bahra and Paul Khuong presented the final session I attended: "Abusing Your Memory Model for Fun and Profit". They discussed how they have written code that relies on the stronger memory ordering requirements imposed by X86 CPUs over and above the standard C++ memory model in order to write high-performance concurrent data structures. I am intrigued to see if any of their techniques can be used in a portable fashion, or used to improve Just::Thread Pro.

Whiteboard code

This year there were a few whiteboards around the conference area for people to use for impromptu discussions. One of them had a challenge written on it:

"Can you write a requires expression that ensures a class has a member function with a specified signature?"

This led to a lot of discussion, which Arthur O'Dwyer wrote up as a blog post. Though the premise of the question is wrong (we shouldn't want to constrain on such specifics), it was fun, interesting and enlightening trying to think how one might do it — it allows you to explore the corner cases of the language in ways that might turn out to be useful later.

My presentation

As well as the workshop, I presented a talk on "Concurrency in C++20 and beyond", which was on Tuesday afternoon. It was in an intermediate-sized room, and I believe was well attended, though it was hard to see the audience with the bright stage lighting. There were a number of interesting questions from the audience addressing the issues raised in my presentation, which is always good, though the acoustics did make it hard to hear some of them.

Slides are available here.


So that was an overview of another awesome CppCon. I love the in-person interactions with so many people involved in using C++ for such a wide variety of things. Everyone has their own perspective, and I always learn something.

The videos are being uploaded incrementally to the CppCon YouTube channel, so hopefully the video of my presentation and the ones above that aren't already available will be uploaded soon.

Posted by Anthony Williams
[/ news /] permanent link
Tags: , , , , , ,
Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

Follow me on Twitter