14th place solution for Kaggle Google Landmark Retrieval Challenge

Dmytro Mishkin
3 min readJun 3, 2018

--

Landmark retrieval challenge — large scale competition for image search algorithms. Given the query image with some landmark, algorithm should output top-100 images, which contains the same landmark from 1M images in database.

Example of query and top-5 output from Landmark Retrieval Challenge

We (Anastasiia Mishchuk and me) entered this competition to test two things, which we created before:

Second was quite successful, while first was more like fail. Although failure is not specific for our local features, but local-feature based approach in general.

Let`s go. Our approach consists of 4 main steps: global CNN descriptor, nearest neighbor search, re-ranking, query expansion

  1. Global descriptor. We’ve finetuned ImageNet-pretrained PyTorch ResNet50 model on dataset from sister-competition https://www.kaggle.com/c/landmark-recognition-challenge . We replaced global average pooling in ResNet with GeM pooling layer from “Fine-tuning CNN Image Retrieval with No Human Annotation” by Radenovic et.al.
    Image size: 256x256 with random crops of final image 224x224. We’ve used several standard augmentations: random crop and random horizontal flip.
    Loss: hard-in-batch triplet margin.
    Batch size: 60
    We have also preliminary cleaned the dataset by removal irrelevant hard-positives based on intra-class distances. Example: query is image outside house and “hard positive” is photo inside the room. We left 3 positive examples per class id.
    Validation: on ROxford5k dataset.
  2. Fast nearest neighbor search done by faiss on GPU, it take only ~10 minutes for 100k top-100 results.
    This baseline model achieved 0.309 public score.
    We also tried MatConvNet ResNet101GeM from GeM paper, trained on 120k SfM dataset, but its score was much lower — 0.269 on public leaderboard.
  3. We have re-ranked top-100 images by performing classic image matching: detect keypoints -> extract local patches -> describe -> match -> RANSAC geometric verification. We used HesAffNet detector and HardNet descriptor, followed by LO-RANSAC geometric verification. For faster matching, we clustered descriptors in 65k visual word and just compared visual words ids.
    OpenCV RANSAC is awful — most of the results were false positives. Instead we wrapped LO-RANSAC, so it can be used directly from python. Results become much cleaner visually, but score improvement is really minor: +0.07 score on public leaderboard.
    Possible explanation, why local features help for 1M dataset “Revisiting Oxford and Paris”, but fail for this competition is that traditionally query are cropped to have related object only. But at this competition, query images contain lots of cars, faces and other distractors which are easily matched against other random images with cars, etc.
  4. The final stage is query expansion by diffusion, Iscen et.al “Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations”. We implemented it in python here
    https://github.com/ducha-aiki/manifold-diffusion
    That boosted the performance from 0.309 to 0.471 public score.
    For comparison, simple query expansion give only 0.376 score

That`s all :) CNN-related stuff was done by Anastasiia, while local features and diffusion by me.

--

--

Dmytro Mishkin
Dmytro Mishkin

Written by Dmytro Mishkin

Computer Vision researcher and consultant. Co-founder of Ukrainian Research group “Szkocka”.

Responses (3)