Algorithms
We use certain algorithms repeatedly in various contexts in the modeling and
recognition processes. Here, we summarize them in a somewhat abstract form.
The actual form of each algorithm depends on choices for certain components.
Grouping
Grouping is the process of finding subsets G of the input which are consistent
according to a cost function d(G). The cost function measures the
inconsistency of a given subset. Grouping also uses model based cost function
d(G,x) which measures the inconsistency of the element x versus a model made
from G. The procedure is:
- For every possible pair {a,b} of distinct elements, determine the cost
d({a,b}).
- For each element a:
- Use the partner b found in step 1 that gave the smallest cost to seed a
group: G={a,b}
- Find the element x that minimizes d(G,x).
- If d(G,x) is less than a threshold T1, add x to G and repeat the previous
step.
- Keep groups that are larger than a threshold T2.
We usually determine the thresholds T1 and T2 from statistics on the data
itself. For T1, we use the average cost found in step 1. For T2, we use
statistics on the sizes of the groups.
Matching
Matching is the process of finding a set of consistent pairings between two
different sets of patches. "Patches" are either the affine-invariant patches
in an image, or the the surface patches stored in a model. We use all three
possible cases: image-image, model-image, model-model. The procedure is:
- Find patch pairs with high normalized correlation.
- Use grouping to find a maximal set of rigidly
connected patches. The cost function varies. Merge all sufficiently large
groups into a single big set.
- Find more patch pairs guided by a model reconstructed from the set.
- Repeat from step 2 until no more new patch pairs are found.
The cost function in step 2 is generally based on the normalized residual
of the structure-from-motion factorization. In the case of model-model
matching, the cost is the normalized residual of the least-squares
estimation of the transformation between the two models.
In step 3, we use a different model for each of the three different types
of matching:
- image-image -- estimate the fundamental matrix and measure the symmetric
"epipolar distance" between points.
- image-model -- estimate the projection matrix from the model to the image,
and measure distances in the image between back-projected points and image
points.
- model-model -- estimate the transformation between the two models, and
measure the Euclidean distances between points in one model and the transformed
points from the other model.
Alignment
Alignment is really just an application of Levenberg-Marquardt (LM) to the
problem of finding a consistent set of S matrices. To review, an S matrix
is a 2 by 3 matrix that describes the transformation from a normalized form
of a patch to the actual image patch. The transformation is between
homogenous 2D coordinates and non-homogenous pixel coordinates. All the
normalized patches are exactly the same size. What we do is fix one patch
as a reference and adjust the other patche to maximize the
normalized correlation it and the reference patch. For the
purpose of LM, the parameters are the 6 elements of the S matrix, and the
function values are the differences between each pixel in normalized patches.
The normalized patches must be discretized into pixels for the purpose of
this process. We typically use 16x16 to 64x64 pixel grids, depending on the
scale of the source patches in the image and the specific application.
If there are more than two patches to be aligned, we simply elect one as
reference and align all the others to it.
The normalized form itself removes all relevant ambiguity, so we should
be able to directly transform one image patch to another by multiplying one
S matrix by the inverse of the other. However, it is not possible to
determine the normalized form perfectly from image data due to noise and
discretization. Using LM to refine the alignment helps to keep this error
low.
Note that it isn't important that the reference patch be exactly in the
normalized form in order to do 3D reconstruction. It is only important that
all the S matrices associated with a surface patch are consistent with each
other. Again, LM helps us achieve this.
Affine-adaptation
The process of adapting to local texture is called "affine adaptation"
because under affine assumptions (planar surface, affine camera) the
deformation in the texture between two views of the same object is a
2x2 linear transformation. The process is a search for a synthetic
view of the local texture that fulfills certain criteria.
An affine-invariant interest point then has 6 parameters: the
(x,y) position and the four values of the 2x2 transformation matrix.
It is convenient to think of uniform scaling as a seventh parameter and
factor it out of the 2x2 matrix.
Mikolajczyk and Schmid extended the state of the art by proposing that
affine-adaptation change not only the shape (2x2 transformation matrix)
and scale, but also the location of the interest point. Here is a summary
of the algorithm:
- Input: The image L and a point x in L.
- Output: The six parameters of the affine-invariant patch.
- Initialize the shape matrix U to the identity. Perform all subsequent
steps on the neighborhood around x deformed by the current value of U.
- Repeat
- Determine the characteristic scale s of x by finding the scale of
normalized Laplacian with strongest response at x.
- Update the position x by finding the nearest maximum of the interest
point function. (If s and U did not change, then the nearest point would
be exactly x.)
- Estimate the squared gradient matrix A in the neighborhood of x.
- Update U to make the current neighborhood isotropic:
U=A1/2U.
- Keep the uniform scaling component of U near 1.
- Until very little change in U.