Does machine learning really involve data?

Frances Buontempo from BuontempoConsulting

Many definitions of machine learning start by proclaiming it uses data, to learn. I want to challenge this, or remind us where the term originally came from and consider why the meaning has shifted.

For a long time machine learning seemed to be a new technology, but I notice we're starting to say AI and machine learning interchangeably. Job postings often sneak the word scientist in there too. What is a data scientist? What do any of these words mean?

Current trends often come with an air of mystery. I suspect a lot of data science roles involve data entry, in order to clean input data. Not as appealing as the headline role suggests. Several day to day techniques being described as machine learning  could also be described as statistics. In fact, look at the table of contents of a statistics book, such as An Introduction to Statistical Learning. Look at a small selection of the topics:

  • accuracy
  • k-means clustering
  • making predictions
  • cross-validation
  • support vector machines, SVM
  • principal component analysis, PCA


Most, if not all, of these topics are covered in an average machine learning course and included in ML software packages. Yet statistics doesn't sound as exciting as machine learning, to many people.

Wikipedia defines statistics as "a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation." No mention of learning, though each of these activities form an essential part of data science. The article goes on to discuss descriptive and inferential statistics. Inference involves making predictions: many people use the term machine learning to mean the very same. Can you spot patterns in purchases automatically and suggest other items a customer might be interested in? Can you detect unusual or anomalous behaviour, indicating fraud or similar? Again, these are now labelled as AI or machine learning, but usually rely on well established statistical techniques. Admittedly, today's faster machines mean number crunching can happen quickly. This has contributed to the resurgence of machine learning.

Many problem solving algorithms are not about numbers. Some techniques, such as evolutionary computing, including genetic algorithms, don't fit comfortably into a data-driven view of learning. Do these methods count as machine learning? I'll leave that for you to think about. My book explores genetic algorithms and several other areas that do not need numbers to learn.

Arthur Samuel came out with the phrase "machine learning", by which he meant something along the lines of a "field of study that gives computers the ability to learn without being explicitly programmed." The abstract of his 1959 paper, "Some studies in machine learning using the game of checkers" states,

Two machine-learning procedures have been investigated in some detail using the game of checkers. Enough work has been done to verify the fact that a computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program. Furthermore, it can learn to do this in a remarkably short period of time (8 or 10 hours of machine-playing time) when given only the rules of the game, a sense of direction, and a redundant and incomplete list of parameters which are thought to have something to do with the game, but whose correct signs and relative weights are unknown and unspecified. The principles of machine learning verified by these experiments are, of course, applicable to many other situations.

AI and machine learning are both very old terms. I think they encompass a much broader field than data analysis. As a final thought, Turing designed an algorithm to play chess. In effect, he was trying to make an artificial brain, before the term AI was invented or computers, in their modern sense, existed.

I think machine learning is much broader than investigating data. Its history involves attempting to get computers to learn, and specifically to learn to play games.Let the games continue.


Read my book and see what you think.


Foundations for Evidence-Based Policymaking Act of 2017

Derek Jones from The Shape of Code

The Foundations for Evidence-Based Policymaking Act of 2017 was enacted by the US Congress on 21st December.

A variety of US Federal agencies are responsible for ensuring the safety of US citizens, in some cases this safety is dependent on the behavior of software. The FDA is responsible for medical device safety and the FAA publishes various software safety handbooks relating to aviation (the Department of transportation has a wider remit).

Where do people go to learn about the evidence for software related issues?

The book: Evidence-based software engineering: based on the publicly available evidence sounds like a good place to start.

Quickly skimming this (currently draft) book shows that no public evidence is available on lots of issues. Oops.

Another issue is the evidence pointing to some suggested practices being at best useless and sometimes fraudulent, e.g., McCabe’s cyclomatic complexity metric.

The initial impact of evidence-based policymaking will be companies pushing back against pointless government requirements, in particular requirements that cost money to implement. In some cases this is a good, e.g., no more charades about software being more testable because its code has a low McCable complexity.

In the slightly longer term, people are going to have to get serious about collecting and analyzing software related evidence.

The Open, Public, Electronic, and Necessary Government Data Act or the OPEN Government Data Act (which is about to become law) will be a big help in obtaining evidence. I think there is a lot of software related data sitting on disks and tapes, waiting to be analysed (NASA appears to have loads to data that they have down almost nothing with, including not making it publicly available).

Interesting times ahead.

Postmortem of the unexpected blog outage

Timo Geusch from The Lone C++ Coder's Blog

Straight from the “make work for yourself because there aren’t enough hours in the day already” files. I’ve mentioned before that I am self-hosting this blog rather than using a hosted instance. I hosted the WordPress instance on FreeBSD and it’s been running quite well for a while, but during a double FreeBSD port upgrade […]

The post Postmortem of the unexpected blog outage appeared first on The Lone C++ Coder's Blog.

Postmortem of the unexpected blog outage

The Lone C++ Coder's Blog from The Lone C++ Coder's Blog

Straight from the “make work for yourself because there aren’t enough hours in the day already” files.

I’ve mentioned before that I am self-hosting this blog rather than using a hosted instance. I hosted the WordPress instance on FreeBSD and it’s been running quite well for a while, but during a double FreeBSD port upgrade to WordPress 5.0.1 and PHP 7.2 – after the php 7.0 port had been discontinued – broke the blog. php-fpm failed regularly with a signal 10, but I wasn’t able to figure out why in a hurry, so I started looking at alternatives.

Sorry for the disruption, normal service will resume shortly

The Lone C++ Coder's Blog from The Lone C++ Coder's Blog

My apologies for the sudden instability of my blog. I’ve managed to make a hash of an update on the main Wordpress site when trying to update to a newer PHP version and had to switch to the Jekyll “backup” site that isn’t quite production ready yet. Comments will be available in due course, at the moment I’m trying to get the static site fully functioning and the various RSS feeds going again.

New home page design

Andy Balaam from Andy Balaam's Blog

After years of getting around to it, I have redesigned my home page at artificialworlds.net.

It’s basically intended to make me look clever or productive or interesting or something. Alternatively, it gives you somewhere to find that thing you know I made but can’t find the link:

Screenshot of artificialworlds.net - colourful boxes with rich-coloured picture of drawers behind.

The background image is “Read where you are” by delaram bayat.

I am pleased with the page’s responsive design, clarity, fast page-load, and colourfulness.

What do you think?

Distorting the input profile, to stress test a program

Derek Jones from The Shape of Code

A fault is experienced in software when there is a mistake in the code, and a program is fed the input values needed for this mistake to generate faulty behavior.

There is suggestive evidence that the distribution of coding mistakes and inputs generating fault experiences both have an influence of fault discovery.

How might these coding mistakes be found?

Testing is one technique, it involves feeding inputs into a program and checking the resulting behavior. What are ‘good’ input values, i.e., values most likely to discover problems? There is no shortage of advice for manually writing tests, suggesting how to select input values, but automatic generation of inputs is often somewhat random (relying on quantity over quality).

Probabilistic grammar driven test generators are trivial to implement. The hard part is tuning the rules and the probability of them being applied.

In most situations an important design aim, when creating a grammar, is to have one rule for each construct, e.g., all arithmetic, logical and boolean expressions are handled by a single expression rule. When generating tests, it does not always make sense to follow this rule; for instance, logical and boolean expressions are much more common in conditional expressions (e.g., controlling an if-statement), than other contexts (e.g., assignment). If the intent is to mimic typical user input values, then the probability of generating a particular kind of binary operator needs to be context dependent; this might be done by having context dependent rules or by switching the selection probabilities by context.

Given a grammar for a program’s input (e.g., the language grammar used by a compiler), decisions have to be made about the probability of each rule triggering. One way of obtaining realistic values is to parse existing input, counting the number of times each rule triggers. Manually instrumenting a grammar to do this is a tedious process, but tool support is now available.

Once a grammar has been instrumented with probabilities, it can be used to generate tests.

Probabilities based on existing input will have the characteristics of that input. A recent paper on this topic (which prompted this post) suggests inverting rule probabilities, so that common becomes rare and vice versa; the idea is that this will maximise the likelihood of a fault being experienced (the assumption is that rarely occurring input will exercise rarely executed code, and such code is more likely to contain mistakes than frequently executed code).

I would go along with the assumption about rarely executed code having a greater probability of containing a mistake, but I don’t think this is the best test generation strategy.

Companies are only interested in fixing the coding mistakes that are likely to result of a fault being experienced by a customer. It is a waste of resources to fix a mistake that will never result in a fault experienced by a customer.

What input is likely to interact with coding mistakes to be the root cause of faults experienced by a customer? I have no good answer to this question. But, given there are customer input contains patterns (at least in the world of source code, and I’m told in other application domains), I would generate test cases that are very similar to existing input, but with one sub-characteristic changed.

In the academic world the incentive is to publish papers reporting loads-of-faults-found, the more the merrier. Papers reporting only a few faults are obviously using inferior techniques. I understand this incentive, but fixing problems costs money and companies want a customer oriented rationale before they will invest in fixing problems before they are reported.

The availability of tools that automate the profiling of a program’s existing input, followed by the generation of input having slightly, or very, different characteristics make it easier to answer some very tough questions about program behavior.

Alastair Reynolds is on form with this steampunk meets pirates space opera.

Paul Grenyer from Paul Grenyer

Revenger
Alastair Reynolds
ISBN-13: 978-0575090552

Alastair Reynolds is on form with this steampunk meets pirates space opera.

The story is cliched and almost totally predictable, but very enjoyable at the same time. I’ve started wondering a lot recently, if they body count in such stories is worth the life of the person who is being rescued and I think that remains to be seen in the sequel.

Fura Ness is extremely driven and I struggled to understand a lot of her decisions.

As with much of Reynolds’ work, there is no explanation for why this universe is the way it is and it feels as strange as when the clock strikes 13 in 1984, but makes me want to read more in the hope of understanding.

While most of the story is linear and complete, there’s a large chunk towards the end which feels missing. The climax is a little brief and just like in terminal world there is suddenly a lot of new plot in the final chapter.

Where the story goes next will very interesting.

Finally On Natural Analogarithms – student

student from thus spake a.k.

Over the course of the year my fellow students and I have spent much of our spare time investigating the properties of the set of infinite dimensional vectors associated with the roots of rational numbers by way of the former's elements being the powers to which the latter's prime factors are raised, which we have dubbed -space.
We proceeded to define functions of such numbers by applying operations of linear algebra to their -space vectors; firstly with their magnitudes and secondly with their inner products. This time, I shall report upon our explorations of the last operation that we have taken into consideration; the products of matrices and vectors.