Racing Up The Hierarchy – a.k.

a.k. from thus spake a.k.

In the previous post we saw how to identify subsets of a set of data that are in some sense similar to each other, known as clusters, by constructing sequences of clusterings starting with each datum in its own cluster and ending with all of the data in the same cluster, subject to the constraint that if a pair of data are in the same cluster in one clustering then they must also be in the same cluster in the next, which are known as hierarchical clusterings.
We did this by selecting the closest pairs of clusters in one clustering and merging them to create the next, using one of three different measures of the distance between a pair of clusters; the average distance between their members, the distance between their nearest members and the distance between their farthest members, known as average linkage, single linkage and complete linkage respectively.
Unfortunately our implementation came in at a rather costly O(n3) operations and so in this post we shall look at how we can improve its performance.

A Place In The Hierarchy – a.k.

a.k. from thus spake a.k.

Last time we implemented the clusterings type to store a set of clustering objects in order to represent hierarchical clusterings, which are sequences of clusterings having the property that if a pair of data are in the same cluster in one clustering then they will be in the same cluster in the next, where clusters are subsets of a set of data that are in some sense similar to each other.
We then went on to define the ak.clade type to represent hierarchical clusterings as trees, so named because that's what they're called in biology when they are used to show the relationships between species and their common ancestors.
Now that we have those structures in place we're ready to see how to create hierarchical clusterings and so in this post we shall start with a simple, general purpose, but admittedly rather inefficient, way to do so.

Making The Clade – a.k.

a.k. from thus spake a.k.

We have so far seen a couple of schemes for identifying clusters in sets of data which are subsets whose members are in some sense similar to each other. Specifically, we have looked at the k means and the shared near neighbours algorithms which implicitly define clusters by the closeness of each datum to the average of each cluster and by their closeness to each other respectively.
Note that both of these algorithms use a heuristic, or rule of thumb, to assign data to clusters, but there's another way to construct clusterings; define a heuristic to measure how close to each other a pair of clusters are and then, starting with each datum in a cluster of its own, progressively merge the closest pairs until we end up with a single cluster containing all of the data. This means that we'll end up with a sequence of clusterings and so before we can look at such algorithms we'll need a structure to represent them.

Nearest And Dearest – a.k.

a.k. from thus spake a.k.

Last time we saw how we could use the list of the nearest neighbours of a datum in a set of clustered data to calculate its strengths of association with their clusters. Specifically, we used the k nearest neighbours algorithm to define those strengths as the proportions of its k nearest neighbours that were members of each cluster or with a generalisation of it that assigned weights to the neighbours according to their positions in the list.
This time we shall take a look at a clustering algorithm that uses nearest neighbours to identify clusters, contrasting it with the k means clustering algorithm that we covered about four years ago.

In The Neighbourhood – a.k.

a.k. from thus spake a.k.

A little under four years ago we saw how we could use the k means algorithm to divide sets of data into distinct subsets, known as clusters, whose members are in some sense similar to each other. The interesting thing about clustering is that even though we find it easy to spot clusters, at least in two dimensions, it's incredibly difficult to give them a firm mathematical definition and so clustering algorithms invariably define them implicitly as the subsets identified by this algorithm.
The k means algorithm, for example, does so by first picking k different elements of the data as cluster representatives and then places every element in the cluster whose representative is nearest to it. The cluster representatives are then replaced by the means of the elements assign to it and the process is repeated iteratively until the clusters stop changing.
Now I'd like to introduce some more clustering algorithms but there are a few things that we'll need first.