Are all SIFTs created equal? (part 1)

The general pipeline of finding image correspondences.

I am starting a series of posts about local image features. It is assumed that you know what they are in general and probably use them in your work. The posts would be about some nuances and details which often are overlooked.

If you are not familiar with local features, I suggest starting from here and here.

In their turn, detector and descriptor can be summed up as following:

  1. local feature detector (DoG (used in SIFT), MSER, Hessian-Affine, FAST) finds some repeatable structures and scale.
  2. Then, usually by some heuristic, local oriented (circular/affine/etc) region to be described is selected.
  3. Then it is extracted and/or resampled from image to some canonical size, like 41x41 or 32x32.
  4. The patch is described == converted to vector by some descriptor, e.g. SIFT.
The classical local image feature extraction pipeline. Measurement region (red) of a detected feature (blue) is warped from image I to a patch P normalizing the region based on the local affine shape of the feature described by matrix A. The image data in the patch are used to compute the SIFT descriptor. (from Lenc et.al.)

Here and next I refer to the SIFT not as a whole system, but patch descriptor-only part.

The SIFT descriptor is a spatial histogram of the image gradient. (image credit: http://www.vlfeat.org/sandbox/api/sift.html#sift-tech-descriptor)

There are multiple implementations of SIFT algorithm, which differ in languages and details. We would test the following:

  1. OpenCV. It is the most popular computer vision library ever and is commonly used in C++ and Python and, of course, it has an implementation of the most cited computer vision paper ever, i.e. SIFT.
  2. VlFeat is popular in past Matlab and C from Andrea Vedaldi group.
  3. numpy-sift lightweight pure numpy implementations by myself, which resembles Michal Perdoch C++ implementation (which was and still the state-of-the-art version for image retrieval).
  4. pytorch-sift — same as above, but in PyTorch, i.e. differentiable and with GPU support.

I have tested the SIFT implementations on HPatches benchmark: CVPR 2017 by Balnas et.al.

Benchmark

VLFeat provides explicit code snippet to describe the patch without doing the detection, which is used as a baseline in Hpatches:

https://github.com/hpatches/hpatches-benchmark/blob/master/matlab/%2Bdesc/%2Bfeats/sift.m

OpenCV does not do anything like this and it was not straightforward how to apply OpenCV to just patch. Luckily, one can use the same trick:

https://github.com/hpatches/hpatches-benchmark/blob/master/python/extract_opencv_sift.py

Finally, pytorch-sift is implemented as pytorch model, and expects tensor in a format (bs, 1, patch_size, patch_size):

https://github.com/ducha-aiki/pytorch-sift/edit/master/hpatches_extract_pytorchsift.py

So here are results: OpenCV is the leader, vlfeat and pytorch-vlfeat mode are runner-ups, mp version is below.

Results on HPatches dataset, “full” split, matching task
OPENCV-SIFT - mAP 
Easy Hard Tough mean
------- ------- --------- -------
0.47788 0.20997 0.0967711 0.26154

VLFeat-SIFT - mAP
Easy Hard Tough mean
-------- -------- --------- --------
0.466584 0.203966 0.0935743 0.254708

PYTORCH-SIFT-VLFEAT-65 - mAP
Easy Hard Tough mean
-------- -------- --------- --------
0.472563 0.202458 0.0910371 0.255353
NUMPY-SIFT-VLFEAT-65 - mAP
Easy Hard Tough mean
-------- -------- --------- --------
0.449431 0.197918 0.0905395 0.245963
PYTORCH-SIFT-MP-65 - mAP
Easy Hard Tough mean
-------- -------- --------- --------
0.430887 0.184834 0.0832707 0.232997
NUMPY-SIFT-MP-65 - mAP
Easy Hard Tough mean
-------- ------- --------- --------
0.417296 0.18114 0.0820582 0.226832

Where the difference comes from? Weighting window

Gaussian (left) and circular cropped Gaussian windows. First is used in vlfeat library, 2nd — in Michal Perdoch implementation. Besides the cropping, sigma differs a lot.

pytorch-sift-mp weights gradients with a clipped circular window. This means that stuff in corners is completely ignored and we are describing in fact circular, not a rectangular patch. It makes the descriptor to be less sensitive to occlusions. Vlfeat and OpenCV implementation both use Gaussian window instead.

I am going to benchmark the implementations on the non-planar subset of the CAIP-2019 contest when ground truth becomes available: to check if there is any benefit of using a clipped circular window.

Does patch size matter? Yes — up to some extent

SIFT matching performance depends on patch size quite significantly, but after commonly used size [41x41] pixels, returns are diminishing. Here are results for OpenCV SIFT.

Matching performance depending on patch size, OpenCV implementation

Computer Vision researcher and consultant. Co-founder of Ukrainian Research group “Szkocka”.