Object Recognition

The goal is to do 3D modeling and recognition. We use a novel and simple representation for 3D objects that consists of a collection of planar texture patches. For each patch, the model contains the 3D location, the slope of the plane, and the shape of the patch. The shape is always a parallelogram.

We are working in two domains. One is still pictures from widely separated viewpoints. The other is video image sequences. Each domain requires some specialization, but the underlying technique is the same. We will summarize that technique below. However, rather than making you read to the end, here are the results now...

The Technique

Theory (short version)

We assume an affine projection model. Although objects are generally not flat in the large, if we take a small enough region of the surface, we can treat it as flat. Suppose we see two different projections of such a flat surface patch (ie: in two different pictures of the object). There exists an affine transformation between the two image patches.

We assume that there is enough texture in the image patches that we can measure the transformation between them. In fact, what we do is determine the rectifying tranformation R between an image patch and its normalized form, which we will describe below. The normalization removes all relevant ambiguity, so we can determine the affine transformation T between the two image patches by multiplying one transformation by the inverse of the other. If S=R-1, then we have T=R1S2.

If we have the inverse rectification matrix S for each patch in a matched pair, then the information they contain is equivalent to three point matches between a pair of images. Given at least two matches (two or more pairs of patches), we can perform Tomasi-Kanade style factorization on the collected S matrices to get the object structure and camera positions (structure from motion). This extends to any number of views.

The resulting 3D information for a patch gives its position in space and two vectors in the plane that contains the patch. The vectors describe the size and shape of the surface patch as a parallelogram. It is a simple matter at this point to grab a copy of the texture from one of the images to finish constructing the model described in the introduction above.

The normalized residual of the factorization gives a measure of how rigidly connected the set of patches is. The multiview constraint says that a single set of 3D patch information and a single set of projection matrices should explain all the measured S matrices if the set of patches is indeed rigidly connected. This provides us a powerful tool for both modeling and recognition.

(Please see the CVPR paper for a more complete discussion of the theory.)

Still Images

Modeling

For widely separated still images, we assume there is only one rigid object being modelled. The goal is to extract patches from all the input images that belong to the surface of the object, and assemble a single rigid model from them.

The first step is to identify interest points in each of the input images. We use the affine-invariant interest point process developed by Mikolajczyk and Schmid. It involves finding Harris interest points, their characteristic scale, and an affine transformation for the patch around each point to make the variance of the intensity equal in all directions. In addition, we determine a rotation such that the gradient angle is zero.

The result is a collection of patches that are invariant to affine projection. Currently, all our work assumes affine cameras, and we assume that the patches cover a small enough portion of the surface of the object that within the patch the surface is approximately flat. (We also assume the object is opaque, so there is no front/back ambiguity.)

Given a set of affine-invariant patches for each image, the next step is to find matches between image pairs. The relationships between the images is given as part of the input. We find matches between each given pair using the matching algorithm described on the algorithm page.

The result of the matching process is a collection of image pairs, each containing a set of matched patches. Each patch in each image has its own identity, so we can form a graph using the patches as vertices and the connections between them as edges. If there are no matching errors, then each connected component of the graph represents all the views of one surface patch.

We use Tomasi and Kanade style factorization to construct the 3D model. However, not every surface patch is visible in every view, so we have missing data. We do not hallucinate data. Instead, we form dense blocks of views x surface patches and factorize them individually. Then we register all the blocks into a common frame and use a non-linear method to refine the camera and surface patch matrices. The error for the refinement is the difference between the measured image patch S matrices and the back-projected surface patch matrices.

Recognition

Recognition is similar to the beginning stages of modeling. We first find interest points and their affine invariant patches in a test image. Then we find a set of matches between the patches in the image and the patches in a given model using the matching algorithm. As a byproduct of the matching algorithm, we get an estimated pose for the model.

Video

Here we discuss how to make 3D models from video shots and how to find matches between these models. Video can contain multiple independently moving objects. We approximate this with multiple rigid components, and motion segmentation is a key problem we try to solve.

Modeling

For video, we can assume a strict linear sequence of images, as opposed to a more general graph of image relationships (as in the case of still images above). It is possible for a surface patch to disappear and reappear in the sequence, however we currently don't handle this case. Instead, we assume the same surface patch never reappears. (Also, presently we do not add new points after the first frame.)

The first step is (as usual) to find interest points and their affine-invariant patches in the first frame of the shot. From there, we use the Kanade-Lucas-Tomasi (KLT) feature tracker implemented by Birchfield to follow the interest points thru the subsequent frames. At each frame, for each point in the frame, we update the alignment between the image patch in the first frame and the current frame. This preserves the affine-invariance of each patch and gives the necessary S matrices for 3D modeling.

Given the resulting S matrices, the modeling proceeds exactly as in the still image case. The structure of the missing data is simpler in the video case, so we use a different approach to form dense blocks. The views for each surface patch are contiguous, so we can find the maximal dense blocks without solving the full NP-complete Clique problem.

Where the modeling really departs from the still image case is motion segmentation. We use the grouping algorithm to find groups of surface patches that move together rigidly. However, rather than simply merging all sufficiently large groups together, we use the following method to form maximal rigid groups:

  1. Find the cost for merging each possible pair of distinct groups. The cost function is essentially the normalized residual for factorizing the surface patches in the union of the two groups.
  2. Determine a threshold from statistics on the cost of each each individual group.
  3. If the lowest cost pair of groups falls below the threshold, merge them and go to step 1. Otherwise, stop.
Each maximal group becomes a separate rigid component in the final model.

Recognition

There are two possible approaches for matching two video shots. One is to form a model from one shot and match it to an individual frame in the other shot. The other approach is to form models of both shots and compare the models.

We currently take the second approach. For each possible pair of rigid components between the two models, we apply the matching algorithm. The result is a set of matched patches. We assign a confidence score to the match as follows: confidence = M / (A + B - M), where M is the number of matches, and A and B are the number of patches in the respective rigid components. For an overall confidence score between two models, we use the Earth Mover's Distance (EMD).

Future Work

The modeling is currently based on the affine camera assumption. We would like to generalize this to handle perspective cameras where the image patches remain locally affine. That is, the depth of the surface patch can affect the scale of the image patch, but the image patches remain parallelograms.

The video work is still in its beginnings. We need to work more on motion segmentation. We also need to actually implement efficient clique finding for stitching. Finally, we need to test shot matching on challenging scenes. We would like to handle humans in motion using articulated rigid models, albeit this is not a perfect fit.

Most algorithms used in the process can be replaced by other approaches. For instance, we could try RANSAC rather than grouping. We could also try other descriptors rather than normalized correlation of the actual rectified patches. (That would make the initial stage of matching less selective and therefore more robust.) The list of possible variations is endless; we will try to select the most promising options and experiment with them.

Contribution / Related Work

In wide-baseline matching, the affine-invariant patch around an interest point is usually used just to find the match in the other image. Thus, every interest point provides only 1 point match for reconstruction. We contribute the idea of using all the information available from an aligned pair of patches, which gives 3 points for reconstruction. The two additional points are relative to the patch center, and upon reconstruction provide us with relative information in 3D: the tangent plane of the surface patch.

The normalized residual of the factorization on these point triples provides two novel and useful things: a matching constraint in the case of widely separated views, and a measure of rigidity in the case of motion segmentation.

These ideas are a natural extension of the large amount of work that has been done in wide-baseline matching over the last few years. Only in the last couple of years have practical methods for finding affine-invariant patches arisen.

We also profit from all the work in structure from motion.