Frances Buontempo from BuontempoConsulting
Yesterday I attended the Karen Spärck Jones
lecture at the BCS in London Dr Cordelia Schmid
talked about computer vision, giving an overview of it's history through to the current state of the art. This is a tall order to get into an hour or so.
Let's see if I can summarise what she covered.
Still pictures and moving pictures need different techniques. For still pictures, we start with attempting to recognise objects or classes of objects. For moving pictures we might be spotting actions, as well as objects; maybe tuning a stringed instrument or celebrating a birthday. For still pictures, spotting a known chair in pictures is slightly easier than getting a program to spot any chair in pictures. How do you generalise the definition of chair anyway?
For the simpler case of a specific chair, or other object, you still need to deal with problems such as the different viewpoints, or different scales. The pixels of a bridge/chair/object close up will be completely different to the same bridge further away, or at a slightly different angle. Techniques started with edge detection, then moved on to projective invariants (and geometric and photometric invariants - light levels affect the pixel). I regarded this as akin to the difference between bitmaps and scalable vector graphics.
A milestone in the move away from edge detection to feature selection came with "Eigenfaces" - see Turk and Pentland
. This uses principal component analysis
.. In essence you find the line of best fit through the points, plotted in n-dimensional space, if you have n features. This is the first eigenvector. It's a vector, as it has direction. It's "eigen" as it is a peculiar, singular or *characteristic* - etymology slightly uncertain
. If you project the data onto this, you will have lost lots of information. You then find a perpendicular line - the 2nd best fit line. And continue until you've captured enough information. This allows you to summarise datasets and is sometimes known as a feature reduction technique. Have you ever wonder how facebook recognises faces? Or how football programmes track how far a footballer has run? Actually the latter is more moving pictures, so I am ahead of myself.
These approaches look at the global scale - the whole picture. Next came local greyscale invariants, using a voting system to spot things. This can deal with photometric problems - varying light levels. Next we have SIFT
- scale-invariant feature transform.
Mention was then made of wavelet filters and boosting feature selection, trained on positive and negative examples, such as pictures with a given object, say a car, and pictures without the object. The code is in OpenCV. I wonder if this is similar to AdaBoost.
Mention was then made of histograms of orientation - see Datal and Triggs
. This is related to support vector machines, SVM, which finds a hyperplane between positive and negative examples. Some example still pictures were shown wherein this technique could be used to detect a, and I quote, "Zebra/non-zebra decision boundary."
This may not seem like a day-to-day problem many of us face, but made the important point that you need training data near the boundary - for example other animals with stripes, and other things with a similar profile, like a motorbike. In a more general setting I was thinking about flushing out edge-cases in unit tests. The importance of a good set of representative training data was made - you need more than just edge cases, you want many cases away from the edges too. This also applies to automated testing. But I digress.
Finally we move on to the current state of the art - convolution neural networks (CNN), which I have not met before. The find "high-dimensional aggregated descriptors" - they have a huge number of nodes and several layers and require some serious computing power - GPU etc etc. As always there is a trade-off between speed and accuracy. I presume the hand-tuned network may be incomprehensible afterwards. I have worked on "feature extraction" from feed-forward neural networks before which represent a trained network as a decision tree so a human can understand what the program has discovered. I presume for CNNs this is neither possible nor desirable. It just needs to get the job done and find Wally^H^H^H^H^H zebras. I previous mentioned python to find Wally on El Reg
. Aren't computers amazing?
Could a machine automatically tag things in a still picture? "Dog 1: Terror", "Man: John Smith". I wonder if we end up with CCTV automatically sending out Robocop to arrest people. Big brother is watching you and figuring out what you're doing.
This leads to the action recognition, mentioned at the start. Having got to a point where we tag things in a still picture, can we set the machines lose to do "weak supervised learning" - find an interesting thing in this video. We were shown examples of a programming picking out a bird or person etc moving in a video, Sometimes it worked, sometimes it didn't. Supervised learning involves giving training examples as input and getting the trained algo to find the same things in other inputs. For moving pictures describing the data - giving positive and negative examples would take hours. Would you go through frame by frame and label features? It would take far too long. Instead let it learn as it goes, setting it off with a few clues - here's a robin. Is there one in this movie? Or spot and label a moving thing - which happened to be a car moving very quickly so seeming to get much smaller - it didn't find that. It seems slow movements are easier to track than fast jerky ones. Though an algo did manage to draw a rectangle around a cat rolling about in another video. The two main techniques involved were dense trajectory features (Wang
) and CNN features for optical flow (Simonyan
). These made the front page in the last year or so.
A compelling throw away comment at the end was that hand-crafted models are NOT machine learning
(ML).Most ML I have attempted before has left me to chose some parameters - how many iterations, how fast to move towards a solution, how many layers in my neural network. The machine has learnt nothing - it just did what it was told. True ML would let the machine find its own parameters. Of course, I have seen a few people trying to do this. It's all very exciting.
Somebody asked, "How come I don't see any of this in my day to day life?" I presume the usual - this is all so academic. Pay attention at the back, I say...
- Have you ever been issued with an automatic speeding ticket? How did it find you?
- Have you ever uploaded a picture to facebook and found little boxes around faces (and the odd random tree, but what do you expect?)
My Dad once asked me how on earth the sports program he was watching could tell him how far a specific footballer had run in the course of a football match. This involves image recognition, including the optical flow - tracking an individual player over the course of a game, from various different angles, so captures many of the specific problems we mentioned above. Unless they just use a pedometer.
Fascinating stuff. I wonder if the machines could spot things we haven't spotted. For example, speckles or shadows in medical scans or even x-ray machines at passport control/baggage checks, that people might miss. Or imagine facebook looked at your holiday snaps and sent you an advert for a clinic dealing with skin cancer, having spotted the stirrings of a carcinoma in your holiday tan. Would you want this?
Further extensions including pairing up audio information, so we can find youtube videos of tuning a guitar - made much easier if the spoken commentary says "tuning" and "guitar" as well as just having the pictures to go on. Combine this with smell and haptics and the machines will soon be writing their own drivel all over the internet. Welcome Skynet.