OSI licenses: number and survival

Derek Jones from The Shape of Code

There is a lot of source code available which is said to be open source. One definition of open source is software that has an associated open source license. Along with promoting open source, the Open Source Initiative (OSI) has a rigorous review process for open source licenses (so they say, I have no expertise in this area), and have become the major licensing brand in this area.

Analyzing the use of licenses in source files and packages has become a niche research topic. The majority of source files don’t contain any license information, and, depending on language, many packages don’t include a license either (see Understanding the Usage, Impact, and Adoption of Non-OSI Approved Licenses). There is some evolution in license usage, i.e., changes of license terms.

I knew that a fair-few open source licenses had been created, but how many, and how long have they been in use?

I don’t know of any other work in this area, and the fastest way to get lots of information on open source licenses was to scrape the brand leader’s licensing page, using the Wayback Machine to obtain historical data. Starting in mid-2007, the OSI licensing page kept to a fixed format, making automatic extraction possible (via an awk script); there were few pages archived for 2000, 2001, and 2002, and no pages available for 2003, 2004, or 2005 (if you have any OSI license lists for these years, please send me a copy).

What do I now know?

Over the years OSI have listed 110 different open source licenses, and currently lists 81. The actual number of license names listed, since 2000, is 205; the ‘extra’ licenses are the result of naming differences, such as the use of dashes, inclusion of a bracketed acronym (or not), license vs License, etc.

Below is the Kaplan-Meier survival curve (with 95% confidence intervals) of licenses listed on the OSI licensing page (code+data):

Survival curve of OSI licenses.

How many license proposals have been submitted for review, but not been approved by OSI?

Patrick Masson, from the OSI, kindly replied to my query on number of license submissions. OSI don’t maintain a count, and what counts as a submission might be difficult to determine (OSI recently changed the review process to give a definitive rejection; they have also started providing a monthly review status). If any reader is keen, there is an archive of mailing list discussions on license submissions; trawling these would make a good thesis project :-)

ACCUConf 2019

Fran from BuontempoConsulting

The ACCU conference happened in Bristol again this year. For my first time ever, I was at a workshop. In fact, I ran a work shop with Chris Simons. We talked about Evolutionary Algorithms in practice. We gave a 90 minute talk later in the week, using the same Java framework (JCLEC), with slides here. High level summary: can your computer solve problems using a few random guesses and iteratively improve, using crossover (merging together previous attempts) and mutation (nudge things up or down, or flip bits). Answer yes: we managed to turn on all the bits in an array, code our way out of a paper bag (by firing virtual cannon balls based on an old Overload article), and finally in the 90 minute workshop, we generated code for Fizz Buzz.   

The full schedule is here. Having five tracks meant I missed lots, but talks will appear on the YouTube channel over time. I'll just give brief notes on a handful of talks I attended. The opening keynote was "Delivering software that is secure and usable - who’s job is it?" by  M Angela Sasse. Angela called out StackOverflow being functionally great but the security advice being bad, in contrast to using an official manual, wherein the security advice is great but it's functionally worse. This was based on measuring several developers attempting to use a software product. How can you actually measure security or usability? How are you currently measuring it? Mention was made of hard to follow security rules, which people tend to work around. Angela called for a way to reprogram the security experts. How good are they at conflict resolution? Do they have social marketing skills? Twitter devolved into quips about social engineering at that point.My final note says,
Programmers are tribal and seek approval. Try to trust and collaborate instead.

Next I'll mention "10 Techniques to Understand Code You Don’t Know" by Jonathan Boccara. He's written a book, which I've seen several people recommend. The 10 techniques fell into three groups: explore, speed read, and detail.
Exploring covered

  • using and finding the I/O frameworks, 
  • performing local analysis - getting the hang of one or two important functions, 
  • analysing call stacks to join the dots between modules. 

Speed reading covered

  • reading the end first - don't be put off by a long function, find the output or returns and worry about the rest later,
  • find frequent words, both count and span (total and lines with words)
  • filter on flow control - giving something like a table of contents for a book
  • scan for the main action - feel free to ignore catch blocks or elses, focus on the happy path

Finally, you sometimes need to start going into detail

  • try scratch refactoring
  • practice writing functions in the code
  • team up - strive for pair understanding

There was a discussion about flame graphs at the end, and he mentioned "How to read a book: the classic guide to intelligent reading" by Alder and van Doren. This points out you don't need to read a non-fiction book in order. Jump around, follow back links, jump straight in to what you want to learn. Very non-linear.

Next, I'll talk about "The anatomy of an exploit" by Patricia Aas. She started by mentioning the weird machine. You can see most programs as a finite state machine. An exploit jumps out of the finite states into other, unintended states. She looked at CWE-242; a list of potentially dangerous functions. The CWE is the common weakest enumeration, available online, listing things to avoid. Her talked pulled on things that might go wrong with gets or std::cin. Surprisingly, you get more warnings from C than C++. By disabling warnings, one at a time, we ended up with code to get a prompt. Once you have a shell on another machine, you can then do a variety of nefarious things.  She covered loads of things including ASLR; address randomisation, heap grooming and use after free. Security was a definite theme at this conference, and many developers understand far too little about it.

Herb Sutter gave Thursday's keynote on "De-fragmenting C++: Making exceptions more affordable and usable". He called out a divide between teams who can and who cannot use exceptions. Many libraries have a mix of exceptions and return code. He said "Pick a lane". C++ is supposed to be zero overhead

  • a feature only costs if it's used
  • it's better than coding it yourself

This is not true for exceptions. He considered the difference between program recoverable and non-recoverable errors. What can you do about stack exhaustion, for example? Who do you report problems to? Humans or code? Exceptions are automatic (a good thing) and invisible (a bad thing). He sketched out ways we could make exceptions have zero overhead. What this space.

Anthony Williams then talked about aysnc, executors and callbacks: "Here’s my number; call me, maybe. Callbacks in a multithreaded world". He called out a few things to be aware of. Does the order of your callbacks matter? Can you deregister them? He encouraged us to capture by value, rather than reference, unless we have a really good excuse.

At lunchtime, there was a book signing. I sold several copies of my book; "Genetic Algorithms and machine learning for programmers." Three others,  Anthony Williams, Ivan Cukic, and Jonathan Boccarra were also selling books, but I didn't get a chance to go talk to them. Thanks to ACCU for the chance to do this. I put mine in paper bags, and even wrote a receipt on one. The chapters in mine show how to code your way out of a paper bag, so it seemed sensible.



I gave a session with Chris Simons about how to teach your computer to code Fizz Buzz. We plan to write this up for ACCU's Overload magazine shortly.

On Friday, Paul Grenyer gave the keynote. He reminisced about people he'd met when he was an ACCU member, and all the things he'd done, some that worked and some that didn't, in Norwich, to grow the tech network. There's now a background discussion on accu-general and accu-members email lists about how to revive some things we used to do, and find new things to do, that will be valuable to the group. I'd love to see the mentored developers reboot.

Next, I went to "Interactive C++ : Meet Jupyter / Cling - The data scientist’s geeky younger sibling" by Neil Horlock. He talked about Code Club, and teaching people. This led nicely into using Cling/Jupyter to have notebooks for C++. Cling is an interactive (JITted) version of clang. I can't do his talk justice here. It was amazing. It managed to cope with templates, and a variety of things that blew my mind. He demonstrated using RISE to make RevealJS slides from a notebook, so I think I was watching a talk in a talk in a talk.

My notes have run out at this point. At the speakers' dinner we met EchoBorg. An actor (an echoborg) voiced the words of a chatbot. People volunteered to be interviewed to become an echoborg themselves. This set of cyberpunk style SciFi in my head. Again, I won't do it justice, but watching the conversation develop was incredibly interesting. Have a look at their websites:



I went to two talks on Saturday: "Windows Native API" by Roger Orr and
"Best practices when accessing Big Data or any other data!" by  Rosemary Francis. I was too tired to make notes by that point, and we left early, since we had a three hour drive home. Rog considered several ways to return 42 from a program and showed several steps that happen before main; something people don't always consider. He touched on security too. I didn't stay to the end of Rosemary's, but she was talking about tooling her company has developed to watch programs using big data and tracing bottlenecks. In my opinion, many data scientists make mistakes some programmers might avoid. Her first example was opening and closing a file in a loop. I wish machine learners and programmers could talk to each other more and help each other out.

I had a great conference, and the includecpp crew were there. I dipped in and out of the discord chat. It's lovely to see people supporting each other. Simple things, like chats about where to go for dinner.

Echoborg has left dystopian Sci-Fi short stories brewing in my head, and the Jupyter/cling talk left me with lots to explore. Thanks ACCUConf. Hope to be there next year.



The Algorithmic Accountability Act of 2019

Derek Jones from The Shape of Code

The Algorithmic Accountability Act of 2019 has been introduced to the US congress for consideration.

The Act applies to “person, partnership, or corporation” with “greater than $50,000,000 … annual gross receipts”, or “possesses or controls personal information on more than— 1,000,000 consumers; or 1,000,000 consumer devices;”.

What does this Act have to say?

(1) AUTOMATED DECISION SYSTEM.—The term ‘‘automated decision system’’ means a computational process, including one derived from machine learning, statistics, or other data processing or artificial intelligence techniques, that makes a decision or facilitates human decision making, that impacts consumers.

That is all encompassing.

The following is what the Act is really all about, i.e., impact assessment.

(2) AUTOMATED DECISION SYSTEM IMPACT ASSESSMENT.—The term ‘‘automated decision system impact assessment’’ means a study evaluating an automated decision system and the automated decision system’s development process, including the design and training data of the automated decision system, for impacts on accuracy, fairness, bias, discrimination, privacy, and security that includes, at a minimum—

I think there is a typo in the following: “training, data” -> “training data”

(A) a detailed description of the automated decision system, its design, its training, data, and its purpose;

How many words are there in a “detailed description of the automated decision system”, and I’m guessing the wording has to be something a consumer might be expected to understand. It would take a book to describe most systems, but I suspect that a page or two is what the Act’s proposers have in mind.

(B) an assessment of the relative benefits and costs of the automated decision system in light of its purpose, taking into account relevant factors, including—

Whose “benefits and costs”? Is the Act requiring that companies do a cost benefit analysis of their own projects? What are the benefits to the customer, compared to a company not using such a computerized approach? The main one I can think of is that the customer gets offered a service that would probably be too expensive to offer if the analysis was done manually.

The potential costs to the customer are listed next:

(i) data minimization practices;

(ii) the duration for which personal information and the results of the automated decision system are stored;

(iii) what information about the automated decision system is available to consumers;

This act seems to be more about issues around data retention, privacy, and customers having the right to find out what data companies have about them

(iv) the extent to which consumers have access to the results of the automated decision system and may correct or object to its results; and

(v) the recipients of the results of the automated decision system;

What might the results be? Yes/No, on a load/job application decision, product recommendations are a few.

Some more potential costs to the customer:

(C) an assessment of the risks posed by the automated decision system to the privacy or security of personal information of consumers and the risks that the automated decision system may result in or contribute to inaccurate, unfair, biased, or discriminatory decisions impacting consumers; and

What is an “unfair” or “biased” decision? Machine learning finds patterns in data; when is a pattern in data considered to be unfair or biased?

In the UK, the sex discrimination act has resulted in car insurance companies not being able to offer women cheaper insurance than men (because women have less costly accidents). So the application form does not contain a gender question. But the applicants first name often provides a big clue, as to their gender. So a similar Act in the UK would require that computer-based insurance quote generation systems did not make use of information on the applicant’s first name. There is other, less reliable, information that could be used to estimate gender, e.g., height, plays sport, etc.

Lots of very hard questions to be answered here.

Wrapping up the Emacs on Mac OS X saga

Timo Geusch from The Lone C++ Coder's Blog

In a previous post I mentioned that I upgraded my homebrew install of Emacs after Emacs 26.2 was released, and noticed that I had lost its GUI functionality. That’s a pretty serious restriction for me as I usually end up with multiple frames across my desktop. I did end up installing the homebrew Emacs for […]

The post Wrapping up the Emacs on Mac OS X saga appeared first on The Lone C++ Coder's Blog.

Emacs 26.2 on WSL with working X-Windows UI

Timo Geusch from The Lone C++ Coder's Blog

I’ve blogged about building Emacs 26 on WSL before. The text mode version of my WSL build always worked for me out of the box, but the last time I tried running an X-Windows version, I ran into rendering issues.  Those rendering issues unfortunately made the GUI version of Emacs unusable on WSL. Nothing like […]

The post Emacs 26.2 on WSL with working X-Windows UI appeared first on The Lone C++ Coder's Blog.

And now, an Emacs with a working org2blog installation again

Timo Geusch from The Lone C++ Coder's Blog

I mentioned in my previous post that I somehow had ended up with a non-working org2blog installation. My suspicion is that this was triggered by my pinning of the htmlize package to the “wrong” repo. I had it pinned to marmalade rather than melpa-stable, and marmalade had an old version of htmlize (1.39, from memory). […]

The post And now, an Emacs with a working org2blog installation again appeared first on The Lone C++ Coder's Blog.

Unwelcome surprise – homebrew Emacs has no GUI after OS X Mojave update

Timo Geusch from The Lone C++ Coder's Blog

I finally got around to upgrading my OS X installation from Mojave to High Sierra – my OS update schedule is usually based on the old pilot wisdom of “don’t fly the A model of anything”. As part of the upgrade, I ended up reinstalling all homebrew packages including Emacs to make sure I was […]

The post Unwelcome surprise – homebrew Emacs has no GUI after OS X Mojave update appeared first on The Lone C++ Coder's Blog.

In The Neighbourhood – a.k.

a.k. from thus spake a.k.

A little under four years ago we saw how we could use the k means algorithm to divide sets of data into distinct subsets, known as clusters, whose members are in some sense similar to each other. The interesting thing about clustering is that even though we find it easy to spot clusters, at least in two dimensions, it's incredibly difficult to give them a firm mathematical definition and so clustering algorithms invariably define them implicitly as the subsets identified by this algorithm.
The k means algorithm, for example, does so by first picking k different elements of the data as cluster representatives and then places every element in the cluster whose representative is nearest to it. The cluster representatives are then replaced by the means of the elements assign to it and the process is repeated iteratively until the clusters stop changing.
Now I'd like to introduce some more clustering algorithms but there are a few things that we'll need first.

Scheduling a task in Java within a CompletableFuture

Andy Balaam from Andy Balaam's Blog

When we want to do something later in our Java code, we often turn to the ScheduledExecutorService. This class has a method called schedule(), and we can pass it some code to be run later like this:

    ScheduledExecutorService executor =
        Executors.newScheduledThreadPool(4);
    executor.schedule(
        () -> {System.out.println("..later");},
        1,
        TimeUnit.SECONDS
    );
    System.out.println("do...");
    // (Don't forget to shut down the executor later...)

The above code prints “do…” and then one second later it prints “…later”.

We can even write code that does some work and returns a result in a similar way:

    // (Make the executor as above.)
    ScheduledFuture future = executor.schedule(
        () -> 10 + 25, 1, TimeUnit.SECONDS);
    System.out.println("answer=" + future.get())

The above code prints “answer=35”. When we call get() it blocks waiting for the scheduler to run the task and mark the ScheduledFuture as complete, and then returns the answer to the sum (10 + 25) when it is ready.

This is all very well, but you may note that the Future returned from schedule() is a ScheduledFuture, and a ScheduledFuture is not a CompletableFuture.

Why do you care? Well, you might care if you want to do something after the scheduled task is completed. Of course, you can call get(), and block, and then do something, but if you want to react asynchronously without blocking, this won’t work.

The normal way to run some code after a Future has completed is to call one of the “then*” or “when*” methods on the Future, but these methods are only available on CompletableFuture, not ScheduledFuture.

Never fear, we have figured this out for you. We present a small wrapper for schedule that transforms your ScheduledFuture into a CompletableFuture. Here’s how to use it:

    CompletableFuture future =
        ScheduledCompletable.schedule(
            executor,
            () -> 10 + 25,
            1,
            TimeUnit.SECONDS
         );
    future.thenAccept(
        answer -> {System.out.println(answer);}
    );
    System.out.println("Answer coming...")

The above code prints “Answer coming…” and then “35”, so we can see that we don’t block the main thread waiting for the answer to come back.

So far, we have scheduled a synchronous task to run in the background after a delay, and wrapped its result in a CompletableFuture to allow us to chain more tasks after it.

In fact, what we often want to do is schedule a delayed task that is itself asynchronous, and already returns a CompletableFuture. In this case it feels particularly natural to get the result back as a CompletableFuture, but with the current ScheduledExecutorService interface we can’t easily do it.

It’s OK, we’ve figured that out too:

    Supplier> asyncTask = () ->
        CompletableFuture.completedFuture(10 + 25);
    CompletableFuture future =
        ScheduledCompletable.scheduleAsync(
            executor, asyncTask, 1, TimeUnit.SECONDS);
    future.thenAccept(
        answer -> {System.out.println(answer);}
    );
    System.out.println("Answer coming...")

The above code prints “Answer coming…” and then “35”, so we can see that our existing asynchronous task was scheduled in the background, and we didn’t have to block the main thread waiting for it. Also, under the hood, we are not blocking the ScheduledExecutorService‘s thread (from its pool) while the async task is running – that task just runs in whatever thread it was assigned when it was created. (Note: in our example we don’t really run an async task at all, but just immediately return a completed Future, but this does work for real async tasks.)

I know you’re wondering how we achieved all this. First, here’s how we run a simple blocking task in the background and wrap it in a CompletableFuture:

    public static  CompletableFuture schedule(
        ScheduledExecutorService executor,
        Supplier command,
        long delay,
        TimeUnit unit
    ) {
        CompletableFuture completableFuture = new CompletableFuture<>();
        executor.schedule(
            (() -> {
                try {
                    return completableFuture.complete(command.get());
                } catch (Throwable t) {
                    return completableFuture.completeExceptionally(t);
                }
            }),
            delay,
            unit
        );
        return completableFuture;
    }

And here’s how we delay execution of an async task but still return its result in a CompletableFuture:

    public static  CompletableFuture scheduleAsync(
        ScheduledExecutorService executor,
        Supplier> command,
        long delay,
        TimeUnit unit
    ) {
        CompletableFuture completableFuture = new CompletableFuture<>();
        executor.schedule(
            (() -> {
                command.get().thenAccept(
                    t -> {completableFuture.complete(t);}
                )
                .exceptionally(
                    t -> {completableFuture.completeExceptionally(t);return null;}
                );
            }),
            delay,
            unit
        );
        return completableFuture;
    }

Note that this should all work to run methods like exceptionally(), thenAccept(), whenComplete() etc.

Feedback and improvements welcome!