Are all SIFTs created equal? (part 1)
I am starting a series of posts about local image features. It is assumed that you know what they are in general and probably use them in your work. The posts would be about some nuances and details which often are overlooked.
If you are not familiar with local features, I suggest starting from here and here.
In their turn, detector and descriptor can be summed up as following:
- local feature detector (DoG (used in SIFT), MSER, Hessian-Affine, FAST) finds some repeatable structures and scale.
- Then, usually by some heuristic, local oriented (circular/affine/etc) region to be described is selected.
- Then it is extracted and/or resampled from image to some canonical size, like 41x41 or 32x32.
- The patch is described == converted to vector by some descriptor, e.g. SIFT.
Here and next I refer to the SIFT not as a whole system, but patch descriptor-only part.
There are multiple implementations of SIFT algorithm, which differ in languages and details. We would test the following:
- OpenCV. It is the most popular computer vision library ever and is commonly used in C++ and Python and, of course, it has an implementation of the most cited computer vision paper ever, i.e. SIFT.
- VlFeat is popular in past Matlab and C from Andrea Vedaldi group.
- numpy-sift lightweight pure numpy implementations by myself, which resembles Michal Perdoch C++ implementation (which was and still the state-of-the-art version for image retrieval).
- pytorch-sift — same as above, but in PyTorch, i.e. differentiable and with GPU support.
I have tested the SIFT implementations on HPatches benchmark: CVPR 2017 by Balnas et.al.
Benchmark
VLFeat provides explicit code snippet to describe the patch without doing the detection, which is used as a baseline in Hpatches:
OpenCV does not do anything like this and it was not straightforward how to apply OpenCV to just patch. Luckily, one can use the same trick:
Finally, pytorch-sift is implemented as pytorch model, and expects tensor in a format (bs, 1, patch_size, patch_size):
So here are results: OpenCV is the leader, vlfeat and pytorch-vlfeat mode are runner-ups, mp version is below.
OPENCV-SIFT - mAP
Easy Hard Tough mean
------- ------- --------- -------
0.47788 0.20997 0.0967711 0.26154
VLFeat-SIFT - mAP
Easy Hard Tough mean
-------- -------- --------- --------
0.466584 0.203966 0.0935743 0.254708
PYTORCH-SIFT-VLFEAT-65 - mAP
Easy Hard Tough mean
-------- -------- --------- --------
0.472563 0.202458 0.0910371 0.255353NUMPY-SIFT-VLFEAT-65 - mAP
Easy Hard Tough mean
-------- -------- --------- --------
0.449431 0.197918 0.0905395 0.245963PYTORCH-SIFT-MP-65 - mAP
Easy Hard Tough mean
-------- -------- --------- --------
0.430887 0.184834 0.0832707 0.232997NUMPY-SIFT-MP-65 - mAP
Easy Hard Tough mean
-------- ------- --------- --------
0.417296 0.18114 0.0820582 0.226832
Where the difference comes from? Weighting window
pytorch-sift-mp weights gradients with a clipped circular window. This means that stuff in corners is completely ignored and we are describing in fact circular, not a rectangular patch. It makes the descriptor to be less sensitive to occlusions. Vlfeat and OpenCV implementation both use Gaussian window instead.
I am going to benchmark the implementations on the non-planar subset of the CAIP-2019 contest when ground truth becomes available: to check if there is any benefit of using a clipped circular window.
Does patch size matter? Yes — up to some extent
SIFT matching performance depends on patch size quite significantly, but after commonly used size [41x41] pixels, returns are diminishing. Here are results for OpenCV SIFT.
The next post about local feature matching is here https://medium.com/@ducha.aiki/how-to-match-to-learn-or-not-to-learn-part-2-1ab52ede2022