Algorithms

We use certain algorithms repeatedly in various contexts in the modeling and recognition processes. Here, we summarize them in a somewhat abstract form. The actual form of each algorithm depends on choices for certain components.

Grouping

Grouping is the process of finding subsets G of the input which are consistent according to a cost function d(G). The cost function measures the inconsistency of a given subset. Grouping also uses model based cost function d(G,x) which measures the inconsistency of the element x versus a model made from G. The procedure is:
  1. For every possible pair {a,b} of distinct elements, determine the cost d({a,b}).
  2. For each element a:
  3. Keep groups that are larger than a threshold T2.
We usually determine the thresholds T1 and T2 from statistics on the data itself. For T1, we use the average cost found in step 1. For T2, we use statistics on the sizes of the groups.

Matching

Matching is the process of finding a set of consistent pairings between two different sets of patches. "Patches" are either the affine-invariant patches in an image, or the the surface patches stored in a model. We use all three possible cases: image-image, model-image, model-model. The procedure is:
  1. Find patch pairs with high normalized correlation.
  2. Use grouping to find a maximal set of rigidly connected patches. The cost function varies. Merge all sufficiently large groups into a single big set.
  3. Find more patch pairs guided by a model reconstructed from the set.
  4. Repeat from step 2 until no more new patch pairs are found.
The cost function in step 2 is generally based on the normalized residual of the structure-from-motion factorization. In the case of model-model matching, the cost is the normalized residual of the least-squares estimation of the transformation between the two models.

In step 3, we use a different model for each of the three different types of matching:

Alignment

Alignment is really just an application of Levenberg-Marquardt (LM) to the problem of finding a consistent set of S matrices. To review, an S matrix is a 2 by 3 matrix that describes the transformation from a normalized form of a patch to the actual image patch. The transformation is between homogenous 2D coordinates and non-homogenous pixel coordinates. All the normalized patches are exactly the same size. What we do is fix one patch as a reference and adjust the other patche to maximize the normalized correlation it and the reference patch. For the purpose of LM, the parameters are the 6 elements of the S matrix, and the function values are the differences between each pixel in normalized patches.

The normalized patches must be discretized into pixels for the purpose of this process. We typically use 16x16 to 64x64 pixel grids, depending on the scale of the source patches in the image and the specific application.

If there are more than two patches to be aligned, we simply elect one as reference and align all the others to it.

The normalized form itself removes all relevant ambiguity, so we should be able to directly transform one image patch to another by multiplying one S matrix by the inverse of the other. However, it is not possible to determine the normalized form perfectly from image data due to noise and discretization. Using LM to refine the alignment helps to keep this error low.

Note that it isn't important that the reference patch be exactly in the normalized form in order to do 3D reconstruction. It is only important that all the S matrices associated with a surface patch are consistent with each other. Again, LM helps us achieve this.

Affine-adaptation

The process of adapting to local texture is called "affine adaptation" because under affine assumptions (planar surface, affine camera) the deformation in the texture between two views of the same object is a 2x2 linear transformation. The process is a search for a synthetic view of the local texture that fulfills certain criteria. An affine-invariant interest point then has 6 parameters: the (x,y) position and the four values of the 2x2 transformation matrix. It is convenient to think of uniform scaling as a seventh parameter and factor it out of the 2x2 matrix.

Mikolajczyk and Schmid extended the state of the art by proposing that affine-adaptation change not only the shape (2x2 transformation matrix) and scale, but also the location of the interest point. Here is a summary of the algorithm: