An
Evaluation of Critical Points Filter with the Epipolar Constraint with
Occluded Points Hallucination
Project
Report for ECE497-YM:
An
Invitation to 3D Vision: From Images to Models
By
Jerome
Durand
James
Davidson
Foreword
This project was created as to satisfy a course requirement necessary for completion of ECE497-YM fall semester, 2002, at the University of Illinois at Urbana-Champaign.
The authors would like to thank the instructor, Professor Ma, for the opportunity to be part of his class.
To build a robust completely automatic system for
reconstruction or object tracking, two main goals needed to be
accomplished. First, correspondence
must be found between image frames, by far the most important aspect of the
system. To accomplish this we use an
adapted version of the Critical Points Filters to find matching points between
image pairs. Secondly, for robustness
and visualization, points that fail to find correspondence in the tracking
system needed to be extrapolated by other means. This was done by taking into account the geometric relationship
between different points and views. By
satisfying the epipolar and multiview constraints missing points could be
hallucinated into image frames where they did not exist. First addressed by Kanade [9], the
implementation focused on the orthographic case. The limitations to this will be discussed in the results.
The method
proposed is for finding the point correspondence for an eight point algorithm.
Primarily based on the results from Prof. Shinagawa [6][7] the method uses
image matching, with the main feature being that it is not affected by the
movements (within a certain reasonable range) of the camera, such as
translation, rotation or scaling. As apposed to most prior methods, the method proposed herein does not require
any prior information about the object.
The previous work
by the authors focused on the unconstrained matching of an object between two
images. The algorithm formerly developed enables the matching of objects from 2
images by mapping some points and then reconstructing the object however, the
mapping of the points can be done independently, and is what is being used in Figure 1.
Example:
|
|
|
In essence, to map
a point the distance is minimized between the points in the source image and the
points in the destination image: the point in the destination image that
minimizes the distance is said to be the correspondence point. Refer to Jerome Durand’s (Thesis) for
complete information concerning point mapping.
We computed a multiresolution hierarchy of
size 2l*2l (1
£ l £ n) images. At each level of the hierarchy,
four images were calculated based on the four images from the previous level of
hierarchy. The image size is divided by two compared with the ones at the
previous level, and the lowest level of hierarchy, containing the source image,
is reached. Each image corresponds to the extraction of “points of interest” we
will now be define. Let us call p(l,m)(i,j), the point (i, j) of the image m, from the
source image, at level l. The images are recursively computed as follows:

Where
![]()
which is a point in the original image. To
make this clearer, consider the following Figure 2:
Figure 2 Hierarchal representation in 2 dimensions

The reason why this is called a critical
point filters is because it extracts the minimum, maximum, and saddle points at
each level of the hierarchy. Computing of the multiresolution hierarchy is
achieved on both the source and destination images. We will call q(l,m)(i,j) the point (i, j), of the image number m, from the destination image, at
level l. An example of the images computed is shown in Figure 3:


Once the hierarchy was created, we determined
the matching between the points, which is done recursively. Let us take an
example, Figure 4, suppose we want to map a point p (at level m), from
the source image, to a point q, of the destination image. First, we
determine the four nearest neighbors of point p, which we call a,
b, c, and d. In our algorithm, we take the four diagonal
nearest points, but four other neighbors could have been taken without any
significant change in the result. For each point a, b, c, d,
we map the parents (A,B,C,D) recursively to A’,B’,C’,D’. If one point is located out of the image (which may be possible if
point p belongs to the border), the point is then mapped to the nearest
border. For each of the parents, we take one child (if b is the top left
child of B, then b’
will be the top left child of B’).
The four children define what we will call the inherited quadrilateral. The
point p will be mapped to one of the points inside thisquadrilateral. We
will then choose q so that it minimizes a certain distance (between p
and q), which is described below.
Figure 4
Point Mapping

The method to map
points used before starting this project was completely image-based. There was no
geometric information recovered out of the images. Inorder to use the geometric
information it was decided to recover the matrices R and T from the images
(uncalibrated case). The final goal of recovering R and T is to
refine the mapping.
Let p be a
point in the source image and t a point in the destination image
The distance is
defined by: D(p,t)=a*Di(p,t)+b*De(p,t)+c*Dg(p,t) where:
- Di(p,t) is the
distance related to the intensity of the pixels
- De(p,t) is the
distance related to the edges on the source and destination images
- Dg(p,t) is the
distance related to the geometric factors of the image
- A,b, and c are
parameters to adjust
First, R
and T are estimated from the two images. This is done using an eight
point algorithm. For full details about this algorithm refer to the book An
invitation to 3-D vision : From images to Models.. A correspondence needs
to be found between eight points in the source and destination image. The first
question addressed is: How to choose the points in the source image?
- The point cannot be chosen if it is less
than N/16 pixels from the edge of the image (N is the height of the square
image). This is to prevent points from not being correctly mapped because of
camera zooming or camera translation.
- The points
chosen correspond to the maximum of intensity for the edge detector. Like this,
only the ``key points" will be mapped, and these points are more likely to
be mapped correctly, as there is less ambiguity in their positions. We first
choose the point that corresponds to
the maximum of intensity, then the one corresponding to the second highest
intensity (if it meets the other constraints), and so on.
- Two chosen
points must be more then N/16 pixels away from each other. It is important not
to have the points chosen in the same area. This gives more accuracy to the
algorithm.
- If a point does not
meet any of the previous requirements, we then divide the minimum distance
required between two points by two (see the third requirement), or until the
point meets all the requirements.
The points correspondence
Remember that in
D(p,t)=a*Di(p,t)+b*De(p,t)+c*Dg(p,t), Dg(p,t) is not defined yet, as Dg
is based on R and T and we are estimating R and T.
Hence, for the mapping, we only use the first two terms of the distance.
|
|
|
From these
results, we extract R and T using the eight point algorithm.
The influence
of R and T on the distance (the term Dg(p,t))
Let p(i,j) be
the point in the source image (to be mapped) and t(i',j') the point in
the destination image (to test with), and let tgeom(i'',j'',1)=R*p(i,j,1)+T
in homogeneous coordinates be the point where p(i,j) should be mapped
to, using the geometric factors R and T.
Of course, this
assumes that we have an estimation on the depth, which is done manually right
now, but we expect to use some automatic methods to do it. The distance is
defined between the point where p(i,j) should be mapped and t(i',j').
As:Dg(p(i,j),t(i',j'))=|i''-i'|^2+|j''-j'|^2
Now with the Critical Points Filters, a robust method of tracking features from frame to frame exists. However, what happens when a point becomes occluded by another object in one frame or if for some reason correspondence between frames cannot be established?
2.1 Mathematical Foundations
To answer this question the mathematical properties of the two-view and multiview geometry will be examined. Only a cursory explanation will be given. For those interested please refer to [8] for a complete description. To start imagine a set of pictures taken from several views. Now assume for each view we have an even set of points:
![]()
Where j is the number of points. The projection of a point in 3D to a viewing plane is given by:
![]()
![]()
![]()
![]()
Where is the view and is the total number of views. If we substitute
into
the equation and then stack all points into a matrix and then stack the equations into matrix form for all views we obtain:
![]()
Or

![]()
![]()
![]()
![]()
For there to be a solution to this equation,
the columns of and must be linearly dependant and thus rank
deficient. By joining and into a matrix

it can clearly be seen that since the matrix
is dependant (and thus rank deficient) and that the rank must be
To
arrive at a more compact representation that is in terms of images and
co-images, Dr. Ma suggests annihilating
with the co-image of each
point placed on the diagonal a by
matrix:
![]()
The
annihilation occurs because
or more precisely:
for all i. The resulting
equation after the left-hand side multiplication becomes:
![]()
![]()
![]()
Obviously,
from the above equation, is the
null space of .. Therefore, we define the matrix as:
![]()
It is important to note that .
Now to be a bit tricky, we take advantage of the rank reduction lemma‡ and the property that multiplication of a matrix by another full rank matrix results in a matrix of the rank of the original matrix. To start, we define a matrix that is full rank:

As can be verified, the output is indeed in the form necessary to apply the rank reduction lemma. The resulting matrix:

Has
rank 1. This is important as it implies most of the factors that impose the
constraints for the algorithm in general.
Additionally,
the Epipolar geometry (two views) is stated in the above equation when is chosen to be two.
![]()
Which with a little manipulation yields the epipolar constraint:
![]()
![]()
So
now that an expression of the relationship of points across
views exists, what happens when points are lost between views? When this happens “holes” in the matrix
exist. These “holes” are manifested as
missing columns corresponding to the missing points in the matrix Figure 7.

In essence when a correspondence between points is not found, the epipolar to trilinear constraint can be used to extrapolate the location of the missing point. With R, T calculated between views, the point in the projected view must fall on the epipolar line defined by:
Since
the depth for the point in another frame is known, along with the corresponding
R and T, then the new point can simply be projected from one frame
to another by satisfying the equation:
Now that a basic understanding of the problem has been
explored the question arises, how, algorithmically, do we apply the equations
in the previous section. First, it
helps to understand what cases can occur when the matrix Mp is not full. Figure 8 shows the
connection between view and points for one real example. There are three basic cases for points:
not enough, enough, and more than enough.
For the first case if a point only exists in one or no image views, then
there is no way to hallucinate it onto other views. What can be recovered is the epipolar line for each view. The second case, exists when there are
exactly two views of the same point. If
this occurs the multiview method is no better than the bilinear method. The last case is most desirable because,
with more than the minimum number of points, optimization can be performed to
minimize error. Just as thee are
critical cases for points, there are also critical cases for views. If less than two views exist then, obviously
nothing can be done (except in the case of symmetry [8]). For two views, no hallucination can occur
for the same reasons discussed above:
if points exists in both views it does not need to be hallucinated and if it
exists in only one view it can not be hallucinated. With more than three views, hallucination can be performed given
that there is minimal 6 point overlap between pairs of views so that the
complete view list can be traversed. In
the two-view case the number has to 8, do to constrains arising from the
8-point algorithm, not the math.
Below two methods, multiview and two-view, are discussed for addressing the hallucination problem.. The two-view case was evaluated because it contains enough information to hallucinate, but lacks the elegance of the multiview method. More importantly, as the results indicate, the authors were unable to find satisfactory sets of points to gain convergence from the multiview factorization algorithm. This was unfortunate because most of the error reduction optimization lies in the evaluation of the multiview matrix Mp.
Ø The multiview method
o First, select set of reference views that satisfy the coverage of the most number of points across all views. Then select the view that each reference view shares the most number of points with and perform two-view initialization. Next create a sub matrix that contains all the alphas obtained in each reference frame and perform multiview factorization. Once this is done, each point missing in each frame can be hallucinated by jumping R, T for each reference frame until the point with a depth value is found.
Ø The
two-view method
o
Similar to the first approach a the bilinear
relationship is calculated between each view that satisfies the minimum shared
points criteria- in this case because of the use of the 8 point algorithm. Next, R, T are recovered from
each view and used to estimate lambda.
Then a recursive algorithm searches through the list of rotations and
translations of each view with the coinciding points that exist in each view to
find a suitable relationship between the reference view and the missing
point. One major drawback of this
technique is that eight points are needed to calculate R, T
whereas only six are needed in the multiview case.
Algorithm (multiview)
lso, because some points go missing between
frames, the depth value for each frame may not consist of every point. Much of the algorithm is designed around
dealing with this case.
1. Choose the reference views based on points coverage and minimum number needed to completely contain all data points with overlap between views.
2. Perform two-view initialization for each one of the reference views
3. Perform multiview factorization algorithm for the subset of the data set: Multiview factorization Algorithm†
4. Hallucinate the missing data points
a.
Go through list of absent
points (fig. ???) and find all points that are missing
b.
Then by
performing an R, T translation from reference frame to reference
frame, perform a recursive algorithm that searches through the matrix of known
(R, T)’s and points in each view until the missing point is found
in one of the views The relationship between the rotation and translations
between frames is:
.
Where g is defined as:

By multiplying through we obtain the equations in terms of R and T:
![]()
![]()
Then perform the inverse of agglomerated rotation and translation of the desired point by performing: gji = gji-1. Like with the previous equation when we multiply through we get two equations in terms of R and T:
![]()
![]()
c. Then normalize the z coordinate (or the homogeneous coordinate) to be 1.
Although the first step of the two methods are the same there are some major differences in approach.
1. Choose the reference views based on points coverage and minimum number needed to completely contain all data points with overlap between views.
2. Perform two-view initialization for each one of the reference views.
3. perform two-view initialization for every view that has 8 or more points shared with the reference view.
4. H
a. allucinate the missing data point
a. Go Through a list of absent points and finds all points
that are missing.
b. Then by performing an R, T translation from frame to frame, perform a recursive algorithm that searches through the matrix of known (R, T)’s and points in each view until the missing point is found in one of the views.
c. Then perform the inverse of agglomerated rotation and translation of the desired point then normalize the z coordinate (or the homogeneous coordinate) to be 1.
d.
Remember that
D(p,t)=a*Di(p,t)+b*De(p,t)+c*Dg(p,t).
Once the first
matrices R and T are obtained using the 8 point algorithm, it is
possible to refine the mapping by using this information to compute a new
correspondence between the eight points. This time, it is possible to use the
three terms of the distance to compute the mapping for the eight points. The
process can be iterated to find R and T and use them again for
the mapping of the eight points. Usually, the algorithm converges in less
than 10 steps. This gives much more accuracy to the overall results.
However, it multiplies the time of computation by about 10.
The algorithm is accurate
and robust because it is a combination of an image-based method and a
geometrical method. Here are the differences in the results, between the
image-based method, and the one using geometrical information.
|
|
|
|
|
|
Analysis of the
performance of this mix method was evaluated using five different sets of
images Table 1.

To test the
validity of the results of the hallucination experiments, a comparison to a
reference was necessary. Therefore,
simulated data sets were used in the analysis.
Multiview
![]()
Because of an early recognition that the mutliview factorization technique
failed in many cases, the two view approach discussed above was adopted. Below list the results from the
factorization method for the second dataset (lab example) with one point
removed from each view. The “C’s”
correspond to convergence and the “D’s” when it diverged.
With such unpredictable and error prone results, diverging 25% of the time with just one point missing, no further testing was done for the multiview case.
Two View
In a sequence to test the error of the results and the legitimacy of the implementation, several stages of experiments were performed:
1. Reconstruction of two views was performed on the dataset 2. This was done because the legitimacy of the dataset had already been verified by using this method.
v As was expected the algorithm worked. It recovered all the points in both shared views without hallucination. Additionally, it was unable to hallucinate onto the missing view, which also was expected.
2. Reconstruction of all complete dataset (no points omitted).
v Like the two-view case, the method had complete knowledge of all points before hand and thus did not hallucinate or diverge.
v These were compared to results from the generic factorization algorithm to verify result.
3. Reconstruction of partial dataset 2 with hallucination
Because the complete set of these data points was already known and because the multiview case verified that these sets of points were good candidates for analysis this set became the focus of evaluation the two-view algorithm. To test the algorithm, several repetitions were run. Each repetition attempted to represent likely scenarios that could occur. Additionally, several repetitions were run to test the limits of the method in general. The following Table 2 shows which points were omitted for each of the test cases.

The total error in pixels values of hallucinated points according the Euclidian distance for each test case is reported in Table 3.