Thanks Radek 7th place solution to HWI 2019 competition


Task: openset fine-grained recognition. You need to correctly identify humpback whale (out of 5004 “persons”) by photo of its tail, or say “it is unknown one”.

It is not pure metric learning task — because you known identities beforehand, but not purely classification — because of “new_whale” class. It is not few-shot recognition, because you have quite massive training data: 25k image. From other hand, lots of classes have only one photo.

So it is hard.

Originally I entered this competition to try library and its magic out-of-the-box: one-cycle policy, fancy augmentation and no-boilerplate. Even though, I was too lazy to start until Radek posted starter pack.

I will later devote a post to my experience with (spoiler it is cool, but has its drawbacks as well), and now will try to be concise.

Original solution.

Resnet50 ->global concat (max, avg) pool -> BN->Dropout-> Linear(2048) -> ReLU-> BN->Dropout ->clf(5004).

First, head is trained for X epochs, then all network is fine-tuned on for Y epochs.

Validation set is 1 photo per whale for all whales with ≥2 photos + 1000 new_whales.

new_whale is inserted to softmaxed predictions with constant threshold, which is set on validation set by bruteforce search in range from 0 to 0.95. After threshold is found, network is retrained on full train set.

I thought, that proper way to solve this problem is some king of triplet/siamease loss, but wanted to check how far I can go with pure classification. It turned out, quite far.

Basic solution worked great with fast training, but with more epochs it overfitted a lot. Here is chain of things to improve basic solution:

  • Center loss. It allowed to train longer w/o severe overfitting. Basic ResNet50, trained for 32+64 epochs on 384x384 grayscale 0.813 lb
Center loss implementation in Sorry for screenshot, I will publish a repo soon
  • Add temperature scaling before softmax 0.834 lb. It is simple coefficient to multiply logits, found on validation set. For me 2.2 worked well.
  • Train on public bboxes 0.872 lb
  • Switch to ResNet152 0.877 lb. But ResNet152 is unstable and slow :(. So I continued experiments with ResNet50
  • Add 1-NN distance classifier by pre-last L2Norm(Linear(2048)) features. I transformed it to similarity by (2-distance)/2 and merged with classifier predictions. New whale is reperesented by another threshold, as well as nearest image from “new_whale” class.
    ResNet50 0.883, ResNet152 0.899 lb, Ensemble ResNet50 + ResNet152 0.904 lb. Train 100+100 epochs
  • ResNet50 with 4 heads, EnsembleNet-like, each of them is 2048 features (4*2048 after concat) and pooling 0.897 lb. Center is applied to each head.
Figure from Ensemble Feature for Person Re-Identification
  • SEResNeXt50 0.901 lb
  • Decrease each head dimensionality from 2048 to 512. This hurt softmax classifier, but 1-NN on features receives huge boost. So final concat is 2048. — 0.920 lb
  • I was wondering, why center loss helps so much, if it is basically another (random) classifier + feature norm constraint. May be it is enough to just constrain feature norm? It turned out, that yes. And it was already invented by name of Ring Loss. 0.934 lb
Ring Loss implementation
  • Change backbone to VGG16-BN0.942 lb
  • Change pooling to constant GeM(3.74) pooling — 0.944 lb.

This is the best results I was able to get from ImageNet-pretrained single network.

By that time I already teamed with Anastasiia Mishchuk and Igor. They were training networks with hard-negative mining triplet loss, and have described their part here: . One cool thing which we discovered, that if I initialized by models not from Imagenet, but from their models, it boosts score more.

So out best single network is SE-ResNeXt-50, pre-trained with hard-mining triplet loss and then trained as usual by my pipeline: 0.955 lb

Unfortunately, other networks were not so cool with such strategy:

  • Inception v4–0.925 lb
  • DenseNet121–0.945 lb
  • VGG16-BN — 0.944 lb

Ensembling gave us 0.961 lb, verification by local features (Hessian-AffNet + HardNet) -> 0.963. When local features tell that there is match, it is probably a match. If no — it means nothing, unfortunately. We verified only top-5 predictions for each class.

Some post-processing magic by Anastasiia — 0.966 lb.

Things, which do not work for me:

  • TTA. It is really sad :(
  • ArcFace, CosFace, *Face, LGM losses
  • Focal loss
  • BCE loss
  • Making any use of “new_whale” images, except for 1-NN prediciton
  • oversampling
  • pseudo-labeling
  • RandomErase aka CutOut augmentation, RGB or RGB2Gray augmentation. Default augmentation w/o flipping worked the best
  • Contour detection
  • mixup

I wish I would try:

That`s all. Thank my teammates, organizers and, of course, Radek ;)

Upd.: Cleaned-up code is here:
Upd2.: Metric learning part:

Computer Vision researcher and consultant. Co-founder of Ukrainian Research group “Szkocka”.