An Experimental Comparison of 3D Object Recognition Systems

The following comparative evaluation appears in
Fred Rothganger, Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce.
3D Object Modeling and Recognition Using Local Affine-Invariant Image Descriptors and Multi-View Spatial Constraints. International Journal of Computer Vision, accepted.

In order to obtain a quantitative comparison of our method with other state-of-the-art object recognition systems, we have provided our dataset to several other research groups. The algorithms proposed by Ferrari, Tuytelaars & Van Gool [2004], Lowe [2004], Mahamud & Hebert [2003], and Moreels, Maire & Perona [2004] have been tested by their authors in this comparative study. As shown by the above figure, all the algorithms perform well on our data set, achieving recognition rates of 90% and above for false detection rates below 10%. In this experiment, the color version of our algorithm and Lowe's program perform best for very low false detection rates, followed by the black-and-white version of our algorithm. The technique proposed by Ferrari et al. achieves an extremely high recognition rate at the cost of a somewhat higher false detection rate. Although all five algorithms use multiple views to form object models, only Lowe's algorithm and ours actually combine the information associated with multiple views in the recognition process. Lowe's algorithm does not construct an explicit 3D model, but it allows multiple training views sharing common patches to vote for the same object. The other methods consider all training pictures independently, which essentially reduces object recognition to image matching.

The five algorithms use different geometric constraints to reject inconsistent matches: We exploit the global 3D (affine and Euclidean) rigidity of our object models. Ferrari et al. use instead a set of local 2D affine rigidity constraints, which are somewhat weaker but allow the recognition of deformable objects such as magazines, and the remaining authors exploit global 2D (affine or Euclidean) rigidity constraints, best suited to situations where the training and test views are close to each other, or the relief of the scene is small compared to the distance separating it from the observer. To test the power of these constraints, we have included in our comparative study a baseline recognition method where the pairwise image matching part of our modeling algorithm is used as a simple recognition engine, an object being declared as recognized when a sufficient percentage of the patches founds in a training view are matched to the test image. The geometric constraints used in this case are quite weak, and amount to exploiting the epipolar geometry conventionally used in wide-baseline stereo. As shown by the figure, although this simple method gives reasonable results (over 50% true positive rate with no false positives), it gives the worse recognition rates of all methods tested.

These results should not be interpreted as a conclusive ranking of the tested algorithms, since our test dataset is quite small, and it is probably biased in favor of our method. However, they provide some evidence (and this should not be particularly surprising) that combining multiple views improves recognition performance, and so does the inclusion of geometric constraints in the matching process. Of course, there is a price to pay for the integration of multiple images into a single model: First, this makes modeling more costly and complicated. Second, this requires the use of training views with sufficient overlap, as confirmed by our experiments with the data of Ferrari et al., where the input images have too few patches in common to allow us to construct any meaningful model.