We are working in two domains. One is still pictures from widely separated viewpoints. The other is video image sequences. Each domain requires some specialization, but the underlying technique is the same. We will summarize that technique below. However, rather than making you read to the end, here are the results now...
We assume that there is enough texture in the image patches that we can measure the transformation between them. In fact, what we do is determine the rectifying tranformation R between an image patch and its normalized form, which we will describe below. The normalization removes all relevant ambiguity, so we can determine the affine transformation T between the two image patches by multiplying one transformation by the inverse of the other. If S=R-1, then we have T=R1S2.
If we have the inverse rectification matrix S for each patch in a matched pair, then the information they contain is equivalent to three point matches between a pair of images. Given at least two matches (two or more pairs of patches), we can perform Tomasi-Kanade style factorization on the collected S matrices to get the object structure and camera positions (structure from motion). This extends to any number of views.
The resulting 3D information for a patch gives its position in space and two vectors in the plane that contains the patch. The vectors describe the size and shape of the surface patch as a parallelogram. It is a simple matter at this point to grab a copy of the texture from one of the images to finish constructing the model described in the introduction above.
The normalized residual of the factorization gives a measure of how rigidly connected the set of patches is. The multiview constraint says that a single set of 3D patch information and a single set of projection matrices should explain all the measured S matrices if the set of patches is indeed rigidly connected. This provides us a powerful tool for both modeling and recognition.
(Please see the CVPR paper for a more
complete discussion of the theory.)
Still Images
The first step is to identify interest points in each of the input images. We use the affine-invariant interest point process developed by Mikolajczyk and Schmid. It involves finding Harris interest points, their characteristic scale, and an affine transformation for the patch around each point to make the variance of the intensity equal in all directions. In addition, we determine a rotation such that the gradient angle is zero.
The result is a collection of patches that are invariant to affine projection. Currently, all our work assumes affine cameras, and we assume that the patches cover a small enough portion of the surface of the object that within the patch the surface is approximately flat. (We also assume the object is opaque, so there is no front/back ambiguity.)
Given a set of affine-invariant patches for each image, the next step is to find matches between image pairs. The relationships between the images is given as part of the input. We find matches between each given pair using the matching algorithm described on the algorithm page.
The result of the matching process is a collection of image pairs, each containing a set of matched patches. Each patch in each image has its own identity, so we can form a graph using the patches as vertices and the connections between them as edges. If there are no matching errors, then each connected component of the graph represents all the views of one surface patch.
We use Tomasi and Kanade style factorization to construct the 3D model. However, not every surface patch is visible in every view, so we have missing data. We do not hallucinate data. Instead, we form dense blocks of views x surface patches and factorize them individually. Then we register all the blocks into a common frame and use a non-linear method to refine the camera and surface patch matrices. The error for the refinement is the difference between the measured image patch S matrices and the back-projected surface patch matrices.
The first step is (as usual) to find interest points and their affine-invariant patches in the first frame of the shot. From there, we use the Kanade-Lucas-Tomasi (KLT) feature tracker implemented by Birchfield to follow the interest points thru the subsequent frames. At each frame, for each point in the frame, we update the alignment between the image patch in the first frame and the current frame. This preserves the affine-invariance of each patch and gives the necessary S matrices for 3D modeling.
Given the resulting S matrices, the modeling proceeds exactly as in the still image case. The structure of the missing data is simpler in the video case, so we use a different approach to form dense blocks. The views for each surface patch are contiguous, so we can find the maximal dense blocks without solving the full NP-complete Clique problem.
Where the modeling really departs from the still image case is motion segmentation. We use the grouping algorithm to find groups of surface patches that move together rigidly. However, rather than simply merging all sufficiently large groups together, we use the following method to form maximal rigid groups:
We currently take the second approach. For each possible pair of rigid components between the two models, we apply the matching algorithm. The result is a set of matched patches. We assign a confidence score to the match as follows: confidence = M / (A + B - M), where M is the number of matches, and A and B are the number of patches in the respective rigid components. For an overall confidence score between two models, we use the Earth Mover's Distance (EMD).
The video work is still in its beginnings. We need to work more on motion segmentation. We also need to actually implement efficient clique finding for stitching. Finally, we need to test shot matching on challenging scenes. We would like to handle humans in motion using articulated rigid models, albeit this is not a perfect fit.
Most algorithms used in the process can be replaced by other approaches. For instance, we could try RANSAC rather than grouping. We could also try other descriptors rather than normalized correlation of the actual rectified patches. (That would make the initial stage of matching less selective and therefore more robust.) The list of possible variations is endless; we will try to select the most promising options and experiment with them.
The normalized residual of the factorization on these point triples provides two novel and useful things: a matching constraint in the case of widely separated views, and a measure of rigidity in the case of motion segmentation.
These ideas are a natural extension of the large amount of work that has been done in wide-baseline matching over the last few years. Only in the last couple of years have practical methods for finding affine-invariant patches arisen.
We also profit from all the work in structure from motion.