Dimensional analysis of the Halstead metrics

Derek Jones from The Shape of Code

One of the driving forces behind the Halstead complexity metrics was physics envy; the early reports by Halstead use the terms software physics and software science.

One very simple, and effective technique used by scientists and engineers to check whether an equation makes sense, is dimensional analysis. The basic idea is that when performing an operation between two variables, their measurement units must be consistent; for instance, two lengths can be added, but a length and a time cannot be added (a length can be divided by time, returning distance traveled per unit time, i.e., velocity).

Let’s run a dimensional analysis check on the Halstead equations.

The input variables to the Halstead metrics are: eta_1, the number of distinct operators, eta_2, the number of distinct operands, N_1, the total number of operators, and N_2, the total number of operands. These quantities can be interpreted as units of measurement in tokens.

The formula are:

  • Program length: N = N_1 + N_2
    There is a consistent interpretation of this equation: operators and operands are both kinds of tokens, and number of tokens can be interpreted as a length.
  • Calculated program length: hat{N} = eta_1 log_2 eta_1 + eta_2 log_2 eta_2
    There is a consistent interpretation of this equation: the operand of a logarithm has to be dimensionless, and the convention is to treat the operand as a ratio (if no denominator is specified, the value 1 is taken), the value returned is dimensionless, which can be multiplied by a variable having any kind of dimension; so again two (token) lengths are being added.
  • Volume: V = N * log_2 eta
    A volume has units of length^3 (i.e., it is created by multiplying three lengths). There is only one length in this equation; the equation is misnamed, it is a length.
  • Difficulty: D = {eta_1 / 2 } * {N_2 / eta_2}
    Here the dimensions of eta_1 and eta_2 cancel, leaving the dimensions of N_2 (a length); now Halstead is interpreting length as a difficulty unit (whatever that might be).
  • Effort: E =  D * V
    This equation multiplies two variables, both having a length dimension; the result should be interpreted as an area. In physics work is force times distance, and power is work per unit time; the term effort is not defined.

Halstead is claiming that a single dimension, program length, contains so much unique information that it can be used as a measure of a variety of disparate quantities.

Halstead’s colleagues at Purdue were rather damming in their analysis of these metrics. Their report Software Science Revisited: A Critical Analysis of the Theory and Its Empirical Support points out the lack of any theoretical foundation for some of the equations, that the analysis of the data was weak and that a more thorough analysis suggests theory and data don’t agree.

I pointed out in an earlier post, that people use Halstead’s metrics because everybody else does. This post is unlikely to change existing herd behavior, but it gives me another page to point people at, when people ask why I laugh at their use of these metrics.

C2X and undefined behavior

Derek Jones from The Shape of Code

The ISO C Standard is currently being revised by WG14, to create C2X.

There is a rather nebulous clustering of people who want to stop compilers using undefined behaviors to generate what these people (and probably most other developers) consider to be very surprising code. For instance, always printing p is truep is false, when executing the code: bool p; if ( p ) printf("p is true"); if ( !p ) printf("p is false"); (possible because p is uninitialized, and accessing an uninitialized value is undefined behavior).

This sounds like a good thing; nobody wants compilers generating surprising code.

All the proposals I have seen, so far, involve doing away with constructs that can produce undefined behavior. Again, this sounds like a good thing; nobody likes undefined behaviors.

The problem is, there is a reason for labeling certain constructs as producing undefined behavior; the behavior is who-knows-what.

Now the C Standard could specify the who-knows-what behavior; for instance, it could specify that the result of dividing by zero is 42. Standard’s conforming compilers would then have to generate code to check whether the denominator was zero, and return 42 for this case (until Intel, ARM and other processor vendors ‘updated’ the behavior of their divide instructions). Way-back-when a design decision was made, the behavior of divide by zero is undefined, not 42 or any other value; this was a design decision, code efficiency and compactness was considered to be more important.

I have not seen anybody arguing that the behavior of divide by zero should be specified. But I have seen people arguing that once C’s integer representation is specified as being twos-compliment (currently it can also be ones-compliment or signed-magnitude), then arithmetic overflow becomes defined. Wrong.

Twos-compliment is a specification of a representation, not a specification of behavior. What is the behavior when the result of adding two integers cannot be represented? The result might be to wrap (the behavior expected by many developers), to saturate at the maximum value (frequently needed in image and signal processing), to raise a signal (overflow is not usually supposed to happen), or something else.

WG14 could define the behavior, for when the result of an arithmetic operation is not representable in the number of bits available. Standard’s conforming compilers targeting processors whose arithmetic instructions did not behave as required would have to generate code, for any operation that could overflow, to do what was necessary. The embedded market are heavy users of C; in this market memory is limited, and processor performance is never fast enough, the overhead of supporting a defined behavior could just be too high (a more attractive solution is to code review, to make sure the undefined behavior cannot occur).

Is there another way of addressing the issue of compiler writers’ use/misuse of undefined behavior? Yes, offer them money. Compiler writing is a business, at least at the level at which gcc and llvm operate. If people really are keen to influence the code generated by gcc and llvm, money is the solution. Wot, no money? Then stop complaining.

Build with a different Java version (e.g. 11) using Docker

Andy Balaam from Andy Balaam's Blog

To spin up a temporary environment with a different Java version without touching your real environment, try this Docker command:

docker run -i -t --mount "type=bind,src=$PWD,dst=/code" openjdk:11-jdk bash

(Change “11-jdk” to the version you want as listed on the README.)

Then you can build the code inside the current directory something like this:

cd code
./gradlew test

Or similar for other build tools, although you may need to install them first.

ACCU 2019 Slides and Trip Report

Anthony Williams from Just Software Solutions Blog

I attended ACCU 2019 a couple of weeks ago, where I was presenting my session Here's my number; call me, maybe. Callbacks in a multithreaded world.

The conference proper started on Wednesday, after a day of pre-conference workshops on the Tuesday, and continued until Saturday. I was only there Wednesday to Friday.

Wednesday

I didn't arrive until Wednesday lunchtime, so I missed the first keynote and morning sessions. I did, however get to see Ivan Čukić presenting his session on Ranges for distributed and asynchronous systems. This was an interesting talk that covered similar ground to things I've thought about before. It was good to see Ivan's take, and think about how it differed to mine. It was was also good to see how modern C++ techniques can produce simpler code than I had when I thought about this a few years ago. Ivan's approach is a clean design for pipelined tasks that allows implicit parallelism.

After the break I then went to Gail Ollis's presentation and workshop on Helping Developers to Help Each Other . Gail shared some of her research into how developers feel about various aspects of software development, from the behaviour of others to the code that they write. She then got us to try one of the exercises she talked about in small groups. By picking developer behaviours from the cards she provided to each group, and telling stories about how that behaviour has affected us, either positively or negatively, we can share our experiences, and learn from each other.

Thursday

First up on Thursday was Herb Sutter's keynote: De-fragmenting C++: Making exceptions more affordable and usable . Herb was eloquent as always, talking about his idea for making exceptions in C++ lower cost, so that they can be used in all projects: a significant number of projects currently ban exceptions from at least some of their code. I think this is a worthwhile aim, and hope to see something like Herb's ideas get accepted for C++ in a future standard.

Next up was my session, Here's my number; call me, maybe. Callbacks in a multithreaded world. It was well attended, with interesting questions from the audience. My slides are available here, and the video is available on youtube. Several people came up to me later in the conference to say that they had enjoyed my talk, and that they thought it would be useful for them in their work, which pleased me no end: this is what I always hope to achieve from my presentations.

Thursday lunchtime was taken up with book signings. I was one of four authors of recently-published programming books set up in the conservatory area of the hotel to sell copies of our books, and sign books for people. I sold plenty, and signed more, which was great.

Kate Gregory's talk on What Do We Mean When We Say Nothing At All? was after lunch. She discussed the various places in C++ where we can choose to specify something (such as const, virtual, or explicit), but we don't have to. Can we interpret meaning from the lack of an annotation? If your codebase uses override everywhere, except in one place, is that an accidental omission, or is it a flag to say "this isn't actually an override of the base class function"? Is it a good or bad idea to omit the names of unused parameters? There was a lot to think about with this talk, but the key takeaway for me is Consistency is Key: if you are consistent in your use of optional annotations, then deviation from your usual pattern can convey meaning to the reader, whereas if you are inconsistent then the reader cannot infer anything.

The final session I attended on Thursday was the C++ Pub Quiz, which was hosted by Felix Petriconi. The presented code was intended to confuse, and elicit exclamations of "WTF!", and succeeded on both counts. However, it was fun as ever, helped by the free drinks, and the fact that my team "Ungarian Notation" were the eventual winners.

Friday

Friday was the last day of the conference for me (though there the conference had another full day on Saturday). It started with Paul Grenyer's keynote on the trials and tribulations of trying to form a "community" for developers in Norwich, with meet-ups and conferences. Paul managed to be entertaining, but having followed Paul's blog for a few years, there wasn't anything that was new to me.

Interactive C++ : Meet Jupyter / Cling - The data scientist's geeky younger sibling was the next session I attended, presented by Neil Horlock. This was an interesting session about cling, a C++ interpreter, complete with a REPL, and how this can be combined with Jupyter notebooks to create a wiki with embedded code that you can edit and run. Support for various libraries allows to write code to plot graphs and maps and things, and have the graphs appear right there in the web page immediately. This is an incredibly powerful tool, and I had discussions with people afterwards about how this could be used both as an educational tool, and for "live" documentation and customer-facing tests: "here is sample code, try it out right now" is an incredibly powerful thing to be able to say.

After lunch I went to see Andreas Weis talk about Taming Dynamic Memory - An Introduction to Custom Allocators. This was a good introduction to various simple allocators, along with how and why you might use them in your C++ code. With John Lakos in the front row, Andreas had to field many questions. I had hoped for more depth, but I thought the material was well-paced, and so there wouldn't have been time; that would have been quite a different presentation, and less of an "introduction".

The final session I attended was Elsewhere Memory by Niall Douglas. Niall talked about the C++ object model, and how that can cause difficulties for code that wants to serialize the binary representation of objects to disk, or over the network, or wants to directly share memory with another process. Niall is working on a standardization proposal which would allow creating objects "fully formed" from a binary representation, without running a constructor, and would allow terminating the lifetime of an object without running its destructor. This is a difficult area as it interacts with compilers' alias analysis and the normal deterministic lifetime rules. However, this is an area where people currently do have "working" code that violates the strict lifetime rules of the standard, so it would be good to have a way of making such code standards-conforming.

Between the Sessions

The sessions at a conference at ACCU are great, and I always enjoy attending them, and often learn things. However, you can often watch these on Youtube later. One of the best parts of physically attending a conference is the discussions had in person before and after the sessions. It is always great to chat to people in person who you primarily converse with via email, and it is exciting to meet new people.

The conference tries to encourage attendees to be open to new people joining discussions with the "Pacman rule" — don't form a closed circle when having a discussion, but leave a space for someone to join. This seemed to work well in practice.

I always have a great time at ACCU conferences, and this one was no different.

Posted by Anthony Williams
[/ news /] permanent link
Tags: , , , ,
Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

Follow me on Twitter

On Pennies From Heaven – student

student from thus spake a.k.

Recall that the Baron and Sir R-----'s most recent wager first had the Baron place three coins upon the table, choosing either heads or tails for each in turn, after which Sir R----- would follow suit. They then set to tossing coins until a run of three matched the Baron's or Sir R-----'s coins from left to right, with the Baron having three coins from Sir R----- if his made a match and Sir R----- having two from the Baron if his did.

When the Baron described the manner of play to me I immediately pointed out to him that it was Penney-Ante, which I recognised because one of my fellow students had recently employed it to enjoy a night at the tavern entirely at the expense of the rest of us! He was able to do so because it's an example of an intransitive wager in which the second player can always contrive to make a choice that will best the first player's.

OSI licenses: number and survival

Derek Jones from The Shape of Code

There is a lot of source code available which is said to be open source. One definition of open source is software that has an associated open source license. Along with promoting open source, the Open Source Initiative (OSI) has a rigorous review process for open source licenses (so they say, I have no expertise in this area), and have become the major licensing brand in this area.

Analyzing the use of licenses in source files and packages has become a niche research topic. The majority of source files don’t contain any license information, and, depending on language, many packages don’t include a license either (see Understanding the Usage, Impact, and Adoption of Non-OSI Approved Licenses). There is some evolution in license usage, i.e., changes of license terms.

I knew that a fair-few open source licenses had been created, but how many, and how long have they been in use?

I don’t know of any other work in this area, and the fastest way to get lots of information on open source licenses was to scrape the brand leader’s licensing page, using the Wayback Machine to obtain historical data. Starting in mid-2007, the OSI licensing page kept to a fixed format, making automatic extraction possible (via an awk script); there were few pages archived for 2000, 2001, and 2002, and no pages available for 2003, 2004, or 2005 (if you have any OSI license lists for these years, please send me a copy).

What do I now know?

Over the years OSI have listed 110 different open source licenses, and currently lists 81. The actual number of license names listed, since 2000, is 205; the ‘extra’ licenses are the result of naming differences, such as the use of dashes, inclusion of a bracketed acronym (or not), license vs License, etc.

Below is the Kaplan-Meier survival curve (with 95% confidence intervals) of licenses listed on the OSI licensing page (code+data):

Survival curve of OSI licenses.

How many license proposals have been submitted for review, but not been approved by OSI?

Patrick Masson, from the OSI, kindly replied to my query on number of license submissions. OSI don’t maintain a count, and what counts as a submission might be difficult to determine (OSI recently changed the review process to give a definitive rejection; they have also started providing a monthly review status). If any reader is keen, there is an archive of mailing list discussions on license submissions; trawling these would make a good thesis project :-)

ACCUConf 2019

Fran from BuontempoConsulting

The ACCU conference happened in Bristol again this year. For my first time ever, I was at a workshop. In fact, I ran a work shop with Chris Simons. We talked about Evolutionary Algorithms in practice. We gave a 90 minute talk later in the week, using the same Java framework (JCLEC), with slides here. High level summary: can your computer solve problems using a few random guesses and iteratively improve, using crossover (merging together previous attempts) and mutation (nudge things up or down, or flip bits). Answer yes: we managed to turn on all the bits in an array, code our way out of a paper bag (by firing virtual cannon balls based on an old Overload article), and finally in the 90 minute workshop, we generated code for Fizz Buzz.   

The full schedule is here. Having five tracks meant I missed lots, but talks will appear on the YouTube channel over time. I'll just give brief notes on a handful of talks I attended. The opening keynote was "Delivering software that is secure and usable - who’s job is it?" by  M Angela Sasse. Angela called out StackOverflow being functionally great but the security advice being bad, in contrast to using an official manual, wherein the security advice is great but it's functionally worse. This was based on measuring several developers attempting to use a software product. How can you actually measure security or usability? How are you currently measuring it? Mention was made of hard to follow security rules, which people tend to work around. Angela called for a way to reprogram the security experts. How good are they at conflict resolution? Do they have social marketing skills? Twitter devolved into quips about social engineering at that point.My final note says,
Programmers are tribal and seek approval. Try to trust and collaborate instead.

Next I'll mention "10 Techniques to Understand Code You Don’t Know" by Jonathan Boccara. He's written a book, which I've seen several people recommend. The 10 techniques fell into three groups: explore, speed read, and detail.
Exploring covered

  • using and finding the I/O frameworks, 
  • performing local analysis - getting the hang of one or two important functions, 
  • analysing call stacks to join the dots between modules. 

Speed reading covered

  • reading the end first - don't be put off by a long function, find the output or returns and worry about the rest later,
  • find frequent words, both count and span (total and lines with words)
  • filter on flow control - giving something like a table of contents for a book
  • scan for the main action - feel free to ignore catch blocks or elses, focus on the happy path

Finally, you sometimes need to start going into detail

  • try scratch refactoring
  • practice writing functions in the code
  • team up - strive for pair understanding

There was a discussion about flame graphs at the end, and he mentioned "How to read a book: the classic guide to intelligent reading" by Alder and van Doren. This points out you don't need to read a non-fiction book in order. Jump around, follow back links, jump straight in to what you want to learn. Very non-linear.

Next, I'll talk about "The anatomy of an exploit" by Patricia Aas. She started by mentioning the weird machine. You can see most programs as a finite state machine. An exploit jumps out of the finite states into other, unintended states. She looked at CWE-242; a list of potentially dangerous functions. The CWE is the common weakest enumeration, available online, listing things to avoid. Her talked pulled on things that might go wrong with gets or std::cin. Surprisingly, you get more warnings from C than C++. By disabling warnings, one at a time, we ended up with code to get a prompt. Once you have a shell on another machine, you can then do a variety of nefarious things.  She covered loads of things including ASLR; address randomisation, heap grooming and use after free. Security was a definite theme at this conference, and many developers understand far too little about it.

Herb Sutter gave Thursday's keynote on "De-fragmenting C++: Making exceptions more affordable and usable". He called out a divide between teams who can and who cannot use exceptions. Many libraries have a mix of exceptions and return code. He said "Pick a lane". C++ is supposed to be zero overhead

  • a feature only costs if it's used
  • it's better than coding it yourself

This is not true for exceptions. He considered the difference between program recoverable and non-recoverable errors. What can you do about stack exhaustion, for example? Who do you report problems to? Humans or code? Exceptions are automatic (a good thing) and invisible (a bad thing). He sketched out ways we could make exceptions have zero overhead. What this space.

Anthony Williams then talked about aysnc, executors and callbacks: "Here’s my number; call me, maybe. Callbacks in a multithreaded world". He called out a few things to be aware of. Does the order of your callbacks matter? Can you deregister them? He encouraged us to capture by value, rather than reference, unless we have a really good excuse.

At lunchtime, there was a book signing. I sold several copies of my book; "Genetic Algorithms and machine learning for programmers." Three others,  Anthony Williams, Ivan Cukic, and Jonathan Boccarra were also selling books, but I didn't get a chance to go talk to them. Thanks to ACCU for the chance to do this. I put mine in paper bags, and even wrote a receipt on one. The chapters in mine show how to code your way out of a paper bag, so it seemed sensible.



I gave a session with Chris Simons about how to teach your computer to code Fizz Buzz. We plan to write this up for ACCU's Overload magazine shortly.

On Friday, Paul Grenyer gave the keynote. He reminisced about people he'd met when he was an ACCU member, and all the things he'd done, some that worked and some that didn't, in Norwich, to grow the tech network. There's now a background discussion on accu-general and accu-members email lists about how to revive some things we used to do, and find new things to do, that will be valuable to the group. I'd love to see the mentored developers reboot.

Next, I went to "Interactive C++ : Meet Jupyter / Cling - The data scientist’s geeky younger sibling" by Neil Horlock. He talked about Code Club, and teaching people. This led nicely into using Cling/Jupyter to have notebooks for C++. Cling is an interactive (JITted) version of clang. I can't do his talk justice here. It was amazing. It managed to cope with templates, and a variety of things that blew my mind. He demonstrated using RISE to make RevealJS slides from a notebook, so I think I was watching a talk in a talk in a talk.

My notes have run out at this point. At the speakers' dinner we met EchoBorg. An actor (an echoborg) voiced the words of a chatbot. People volunteered to be interviewed to become an echoborg themselves. This set of cyberpunk style SciFi in my head. Again, I won't do it justice, but watching the conversation develop was incredibly interesting. Have a look at their websites:



I went to two talks on Saturday: "Windows Native API" by Roger Orr and
"Best practices when accessing Big Data or any other data!" by  Rosemary Francis. I was too tired to make notes by that point, and we left early, since we had a three hour drive home. Rog considered several ways to return 42 from a program and showed several steps that happen before main; something people don't always consider. He touched on security too. I didn't stay to the end of Rosemary's, but she was talking about tooling her company has developed to watch programs using big data and tracing bottlenecks. In my opinion, many data scientists make mistakes some programmers might avoid. Her first example was opening and closing a file in a loop. I wish machine learners and programmers could talk to each other more and help each other out.

I had a great conference, and the includecpp crew were there. I dipped in and out of the discord chat. It's lovely to see people supporting each other. Simple things, like chats about where to go for dinner.

Echoborg has left dystopian Sci-Fi short stories brewing in my head, and the Jupyter/cling talk left me with lots to explore. Thanks ACCUConf. Hope to be there next year.