I worked with Alexandre Sablayrolles on a Face Recognition project for a Computer Vision course at Ecole Polytechnique. Our goal was to download a lot of pictures from Google Image of famous people and use them to train a model capable to identify those people on new images. As this class was dealing with computer vision, the main focus of the project was on image processing rather than machine learning tools. In particular, there are several state-of-the-art Deep Learning methods to classify faces (see eg. here and here) but our goal was to use only a simple machine learning pipeline, to better approach the structure of images. That’s why the classification accuracies we get are not so important but should better be compared between methods to evaluate each approach.
In classic computer vision, a super famous method to compare images is to use SIFT keypoints, which consist in local descriptors of informative regions. For example, two images can be merged into a panorama by finding such keypoints in each image, comparing their descriptors, and trying to match them in order to superpose the pictures. These descriptors are very powerful and even though Deep Learning is making the buzz currently, they are still used. I actually attended an interesting talk given by A9.com where they discussed how they were using both classic computer vision approaches (like matching SIFTs) and Deep Learning. Anyway since SIFTs are powerful, our starting point was to build a Bag-of-word model with SIFTs descriptors, and classify with SVMs.
We tested different derived methods but basically our final approach was more to focus on key zones of the faces. In particular, there are pretty accurate methods to detect the eyes, the nose and the mouth on a picture. We leveraged this and decided to compute a single SIFT descriptor for a given zone (we actually ended up with a model a bit more complex, but the idea is the same). So for each picture we obtained a descriptor per zone. We clustered all the descriptors per zone to get a low dimensional representation of a picture, and finally performed multi-class classification through SVMs.