Bilinear CNNs for Fine-grained Visual Recognition
People
Abstract
We present a simple and effective architecture
for fine-grained visual recognition called
Bilinear
Convolutional Neural Networks (B-CNNs). These networks
represent an image as a
pooled outer product of features
derived from two CNNs and capture localized feature interactions
in a translationally invariant manner. B-CNNs belong to the class
of orderless texture representations but unlike prior work they
can be trained in an end-to-end manner. Our most accurate model
obtains
84.1%, 79.4%, 86.9% and
91.3% per-image accuracy on the
Caltech-UCSD birds, NABirds, FGVC aircraft, and
Stanford cars
datasets respectively and runs at 30 frames-per-second on an
NVIDIA Titan X GPU. We then present a systematic analysis of
these networks and show that (1) the bilinear features are highly
redundant and can be reduced by an order of magnitude in size
without significant loss in accuracy, (2) are also effective for
other image classification tasks such as texture and scene
recognition, and (3) can be trained from scratch on the ImageNet
dataset offering consistent improvements over the baseline
architecture. Finally, we present visualizations of these models
on various datasets using top activations of neural units and
gradient-based inversion techniques.
Publications
-
Improved Bilinear Pooling with CNNs,
Tsung-Yu Lin and Subhransu Maji
British Machine Vision Conference (BMVC), 2017
pdf, bibtex,
code for matrix sqrt,
code for classification experiments,
talk slides
-
Bilinear CNNs for Fine-grained Visual Recognition,
Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji
Transactions on Pattern Analysis and Machine Intelligence, 2017
pdf
preprint, bibtex,
code
-
Visualizing and Understanding Deep Texture Representations,
Tsung-Yu Lin, and Subhransu Maji
Computer Vision and Pattern Recognition (CVPR), 2016
project page
-
Bilinear CNN Models for Fine-grained Visual
Recognition,
Tsung-Yu Lin, Aruni RoyChowdhury
and Subhransu Maji
International Conference on Computer Vision (ICCV), 2015
pdf,
pdf-supp,
bibtex,
code
Results
Accuracy on various fine-grained recognition datasets are below. See Table 2 in the PAMI paper for a detailed comparison.
Model |
Caltech-UCSD birds |
FGVC aircrafts |
Stanford cars |
NA birds |
Improved B-CNN (vgg-m) |
81.3 |
84.0 |
88.5 |
- |
Improved B-CNN (vgg-d) |
85.8 |
88.5 |
92.1 |
- |
B-CNN (vgg-m) |
78.1 |
79.5 |
86.5 |
- |
B-CNN (vgg-d) |
84.0 |
86.9† |
90.6 |
- |
B-CNN (vgg-m+vgg-d) |
84.1 |
86.6† |
91.3 |
79.4 |
† improvements over the ICCV'15 numbers are due to improved cropping (central 448x448 crop from a 512x512 image)
Talk slides
Acknowledgements
This research was supported in part by the
NSF IIS-1617917, a faculty gift from Facebook, and IARPA
IAR2014-14071600010. The experiments were performed using high
performance computing equipment obtained under a grant from the
Collaborative R&D Fund managed by the Massachusetts Tech
Collaborative and GPUs donated by NVIDIA.