The following comparative evaluation appears in
Fred Rothganger, Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce.
3D Object Modeling and Recognition Using Local Affine-Invariant
Image Descriptors and Multi-View Spatial Constraints.
International Journal of Computer Vision, accepted.

In order to obtain a quantitative comparison of our method with other
state-of-the-art object recognition systems, we have provided
our
dataset to several other research groups. The algorithms proposed
by Ferrari, Tuytelaars & Van Gool [2004], Lowe [2004],
Mahamud & Hebert [2003], and Moreels, Maire &
Perona [2004] have been tested by their authors in this
comparative study. As shown by the above figure, all the
algorithms perform well on our data set, achieving recognition rates
of 90% and above for false detection rates below 10%. In this
experiment, the color version of our algorithm and
Lowe's program perform best for very low false
detection rates, followed by the black-and-white version of our
algorithm. The technique proposed by Ferrari et
al. achieves an extremely high recognition rate
at the cost of a somewhat higher false detection rate. Although all
five algorithms use multiple views to form object models, only Lowe's
algorithm and ours actually combine the information associated with
multiple views in the recognition process. Lowe's algorithm
does not construct an explicit 3D model, but it allows multiple
training views sharing common patches to vote for the same
object. The other methods consider all training
pictures independently, which essentially reduces object recognition
to image matching.
The five algorithms use different geometric
constraints to reject inconsistent matches: We exploit the global 3D
(affine and Euclidean) rigidity of our object models. Ferrari et
al. use instead a set of local 2D affine
rigidity constraints, which are somewhat weaker but allow the
recognition of deformable objects such as magazines, and the
remaining authors exploit global 2D (affine or Euclidean)
rigidity constraints, best suited to situations where the training and
test views are close to each other, or the relief of the scene is
small compared to the distance separating it from the observer. To
test the power of these constraints, we have included in our
comparative study a baseline recognition method where the pairwise
image matching part of our modeling algorithm is used as a simple
recognition engine, an object being declared as recognized when a
sufficient percentage of the patches founds in a training view are
matched to the test image. The geometric constraints used in this
case are quite weak, and amount to exploiting the epipolar geometry
conventionally used in wide-baseline stereo. As shown by the figure,
although this simple method gives reasonable results
(over 50% true positive rate with no false positives), it gives the
worse recognition rates of all methods tested.
These results should not be interpreted as a conclusive ranking of the
tested algorithms, since our test dataset is quite small, and it is
probably biased in favor of our method. However, they provide some
evidence (and this should not be particularly surprising) that
combining multiple views improves recognition performance, and so does
the inclusion of geometric constraints in the matching process. Of
course, there is a price to pay for the integration of multiple images
into a single model: First, this makes modeling more costly and
complicated. Second, this requires the use of training views with
sufficient overlap, as confirmed by our experiments with the data of
Ferrari et al., where the input images have too
few patches in common to allow us to construct any meaningful model.
|