An Evaluation of Critical Points Filter with the Epipolar Constraint with
Occluded Points Hallucination

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Project Report for ECE497-YM:

An Invitation to 3D Vision: From Images to Models

By

Jerome Durand

James Davidson


Foreword

 

This project was created as to satisfy a course requirement necessary for completion of ECE497-YM fall semester, 2002, at the University of Illinois at Urbana-Champaign. 

 

The authors would like to thank the instructor, Professor Ma, for the opportunity to be part of his class. 

 

 


 

 

 

 

Introduction

 

 

3D reconstruction of objects has become one of the most popular and important areas in Computer Vision.  The heart of 3D reconstruction is the ‘re-projection’ of 2D images into 3D space. We will attempt to work with the heart of 3D reconstruction. The main issue addressed herein is the 3D reconstruction of an object, using a sequence of images, with re-projection of the coordinates of occluded elements.   Up until recently no unified technique existed for reconstruction of multiple projective (non rectilinear) views greater than 4.  Nor did any of the available techniques take advantage of imposed geometry’s eloquent nature.   This changed with the development of the multiview rank based algorithm [8]. 

 

While the multiview rank based algorithm  method is robust for instances where all data points exist in all views, the implementation of the algorithm needs to be modified to deal with incomplete data sets.  Such incomplete data sets arise when points become occluded between views or when correspondence between points in different views cannot be established.  The former situation is commonly referred to as occlusion and is  one of the banes of Computer Vision with no exceptional technique developed to deal with the problems it creates.

 

In this paper we present a method that takes advantage of the rank condition of the multiview matrix Mp to solve for incomplete datasets and to re-project the coordinates of the points into the missing frames, therefore hallucinate the missing points.  This has a broad range of applications ranging from robust reconstruction to robust feature tracking. However, to do reconstruction we first need correspondence points.  Thus, we split the work into two sections: 

 

 

 

 

 


Methods

 

To build a robust completely automatic system for reconstruction or object tracking, two main goals needed to be accomplished.  First, correspondence must be found between image frames, by far the most important aspect of the system.  To accomplish this we use an adapted version of the Critical Points Filters to find matching points between image pairs.  Secondly, for robustness and visualization, points that fail to find correspondence in the tracking system needed to be extrapolated by other means.  This was done by taking into account the geometric relationship between different points and views.  By satisfying the epipolar and multiview constraints missing points could be hallucinated into image frames where they did not exist.  First addressed by Kanade [9], the implementation focused on the orthographic case.  The limitations to this will be discussed in the results.  

 

1 The Correspondence Between Points

 

1.1 Critical Points Filters

 

The method proposed is for finding the point correspondence for an eight point algorithm. Primarily based on the results from Prof. Shinagawa [6][7] the method uses image matching, with the main feature being that it is not affected by the movements (within a certain reasonable range) of the camera, such as translation, rotation or scaling. As apposed to  most prior methods, the method proposed herein does not require any prior information about the object.

 

The previous work by the authors focused on the unconstrained matching of an object between two images. The algorithm formerly developed enables the matching of objects from 2 images by mapping some points and then reconstructing the object however, the mapping of the points can be done independently, and is what is being used in  Figure 1.

 

Example:

Figure 1  From left to right a source and destination image

 

 

 

 

The matching algorithm

In essence, to map a point the distance is minimized between the points in the source image and the points in the destination image: the point in the destination image that minimizes the distance is said to be the correspondence point.  Refer to Jerome Durand’s (Thesis) for complete information concerning point mapping.

 

Algorithm Details

 

The Critical Points Filters in Dimension 2

We computed a multiresolution hierarchy of size 2l*2l (1 £ l £ n) images. At each level of the hierarchy, four images were calculated based on the four images from the previous level of hierarchy. The image size is divided by two compared with the ones at the previous level, and the lowest level of hierarchy, containing the source image, is reached. Each image corresponds to the extraction of “points of interest” we will now be define. Let us call p(l,m)(i,j), the point (i, j) of the image m, from the source image, at level l. The images are recursively computed as follows:

 

               

 

Where

 

                                       

which is a point in the original image. To make this clearer, consider the following Figure 2:

Figure 2  Hierarchal representation in 2 dimensions

 
Text Box:

 

 


The reason why this is called a critical point filters is because it extracts the minimum, maximum, and saddle points at each level of the hierarchy. Computing of the multiresolution hierarchy is achieved on both the source and destination images. We will call q(l,m)(i,j) the point (i, j), of the image number m, from the destination image, at level l. An example of the images computed is shown in Figure 3:

 

Text Box: Figure 3  The original image and the four images computed at level 5 of the hierarchy (original image is 256*256). From left to right: min(min,min), max(min,min), min(max,max), and max(max,max),using notation of previous figure

 

 

 

 

 

Determining the mapping from one point to another

Once the hierarchy was created, we determined the matching between the points, which is done recursively. Let us take an example, Figure 4, suppose we want to map a point p (at level m), from the source image, to a point q, of the destination image. First, we determine the four nearest neighbors of point p, which we call a, b, c, and d. In our algorithm, we take the four diagonal nearest points, but four other neighbors could have been taken without any significant change in the result. For each point a, b, c, d, we map the parents (A,B,C,D) recursively to A,B,C,D. If one point is located out of the image (which may be possible if point p belongs to the border), the point is then mapped to the nearest border. For each of the parents, we take one child (if b is the top left child of B, then b’ will be the top left child of B’). The four children define what we will call the inherited quadrilateral. The point p will be mapped to one of the points inside thisquadrilateral. We will then choose q so that it minimizes a certain distance (between p and q), which is described below.

 

Figure 4   Point Mapping

 

 

 

1.2 The use of geometric information

 

Information and new issues

The method to map points used before starting this project was completely image-based. There was no geometric information recovered out of the images. Inorder to use the geometric information it was decided to recover the matrices R and T from the images (uncalibrated case). The final goal of recovering R and T is to refine the mapping.

 

The distance between points

Let p be a point in the source image and t a point in the destination image

The distance is defined by: D(p,t)=a*Di(p,t)+b*De(p,t)+c*Dg(p,t) where:

- Di(p,t) is the distance related to the intensity of the pixels

- De(p,t) is the distance related to the edges on the source and destination images

- Dg(p,t) is the distance related to the geometric factors of the image

- A,b, and c are parameters to adjust

 

The different stages to recover R and T

First, R and T are estimated from the two images. This is done using an eight point algorithm. For full details about this algorithm refer to the book An invitation to 3-D vision : From images to Models.. A correspondence needs to be found between eight points in the source and destination image. The first question addressed is: How to choose the points in the source image?

 

Points selection

The points are chosen sequentially in the source image using the following requirements :

- The point cannot be chosen if it is less than N/16 pixels from the edge of the image (N is the height of the square image). This is to prevent points from not being correctly mapped because of camera zooming or camera translation.

- The points chosen correspond to the maximum of intensity for the edge detector. Like this, only the ``key points" will be mapped, and these points are more likely to be mapped correctly, as there is less ambiguity in their positions. We first choose  the point that corresponds to the maximum of intensity, then the one corresponding to the second highest intensity (if it meets the other constraints), and so on.

- Two chosen points must be more then N/16 pixels away from each other. It is important not to have the points chosen in the same area. This gives more accuracy to the algorithm.

- If a point does not meet any of the previous requirements, we then divide the minimum distance required between two points by two (see the third requirement), or until the point meets all the requirements.

 

 

 

 

 

 

 

 

 

 

 

The points correspondence

Remember that in D(p,t)=a*Di(p,t)+b*De(p,t)+c*Dg(p,t), Dg(p,t) is not defined yet, as Dg is based on R and T and we are estimating R and T. Hence, for the mapping, we only use the first two terms of the distance.

 

Figure 6  The mapping between 8 points  in source and destination images

 

 

 

 

From these results, we extract R and T using the eight point algorithm.

 

The influence of R and T on the distance (the term Dg(p,t))

Let p(i,j) be the point in the source image (to be mapped) and t(i',j') the point in the destination image (to test with), and let tgeom(i'',j'',1)=R*p(i,j,1)+T in homogeneous coordinates be the point where p(i,j) should be mapped to, using the geometric factors R and T.

Of course, this assumes that we have an estimation on the depth, which is done manually right now, but we expect to use some automatic methods to do it. The distance is defined between the point where p(i,j) should be mapped and t(i',j'). As:Dg(p(i,j),t(i',j'))=|i''-i'|^2+|j''-j'|^2

 

 

2. Occlusion Hallucination

 

Now with the Critical Points Filters, a robust method of tracking features from frame to frame exists.  However, what happens when a point becomes occluded by another object in one frame or if for some reason correspondence between frames cannot be established?

 

2.1 Mathematical Foundations

 

To answer this question the mathematical properties of the two-view and multiview geometry will be examined. Only a cursory explanation will be given.  For those interested please refer to [8] for a complete description.  To start imagine a set of pictures taken from several views.  Now assume for each view we have an even  set of points:

 

 


Where j is the number of points.  The projection of a point in 3D to a viewing plane is given by:

 

 


Where     is the view and        is the total number of views.  If we substitute                                

into the equation and then stack all points into a matrix     and then stack the equations into matrix form for all         views we obtain:

 

 


                                                                    Or

 

 

 

 

 

 

 

 

 

 

 

 


 For there to be a solution to this equation, the columns of     and      must be linearly dependant and thus rank deficient.  By joining      and       into a matrix       

 

 

 

 

 

 

 

 

 

 

 


 it can clearly be seen that since the matrix is dependant (and thus rank deficient) and that the rank must be

To arrive at a more compact representation that is in terms of images and co-images, Dr. Ma suggests annihilating         with the co-image of each point  placed on the diagonal a  by   matrix:

 

 

 

 

 

 

 

 

 

 

 

 


The annihilation occurs because                    or more precisely:                      for all i.  The resulting equation after the left-hand side multiplication becomes:

 

 


Obviously, from the above equation,       is the null space of       ..  Therefore, we define the matrix          as:

 

 

 

 

 

 

 

 


It is important to note that                               .

 

Now to be a bit tricky, we take advantage of the rank reduction lemma and the property that multiplication of a matrix by another full rank matrix results in a matrix of the rank of the original matrix.  To start, we define a matrix that is full rank:

 

 

 

 


Next, we multiply         by          to obtain:

 

 

 

 

 

 

 

 

 

 

 


As can be verified, the output is indeed in the form necessary to apply the rank reduction lemma.  The resulting matrix:

 

 

 

 

 

 

 

 

 


Has rank      1.  This is important as it implies most of the factors that impose the constraints for the algorithm in general.

 

Additionally, the Epipolar geometry (two views) is stated in the above equation when      is chosen to be two.

 

 


Which with a little manipulation yields the epipolar constraint:

 

 


So now that an expression of the relationship of       points across       views exists, what happens when points are lost between views?  When this happens “holes” in the matrix exist.  These “holes” are manifested as missing columns corresponding to the missing points in the matrix Figure 7.

Text Box:  

Figure 7  Effects on Matiz Mp by occluded points

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In essence when a correspondence between points is not found, the epipolar to trilinear constraint can be used to extrapolate the location of the missing point.  With R, T calculated between views, the point in the projected view must fall on the epipolar line defined by:

 

 


Since the depth for the point in another frame is known, along with the corresponding R and T, then the new point can simply be projected from one frame to another by satisfying the equation:

 

 

 

 

 

 

2.2 Algorithms

 

 

Now that a basic understanding of the problem has been explored the question arises, how, algorithmically, do we apply the equations in the previous section.  First, it helps to understand what cases can occur when the matrix Mp is not full.  Figure 8 shows the connection between view and points for one real example.    There are three basic cases for points: not enough, enough, and more than enough.  For the first case if a point only exists in one or no image views, then there is no way to hallucinate it onto other views.  What can be recovered is the epipolar line for each view.  The second case, exists when there are exactly two views of the same point.  If this occurs the multiview method is no better than the bilinear method.  The last case is most desirable because, with more than the minimum number of points, optimization can be performed to minimize error.  Just as thee are critical cases for points, there are also critical cases for views.  If less than two views exist then, obviously nothing can be done (except in the case of symmetry [8]).  For two views, no hallucination can occur for the same reasons  discussed above: if points exists in both views it does not need to be hallucinated and if it exists in only one view it can not be hallucinated.  With more than three views, hallucination can be performed given that there is minimal 6 point overlap between pairs of views so that the complete view list can be traversed.  In the two-view case the number has to 8, do to constrains arising from the 8-point algorithm, not the math. 

 

Below two methods, multiview and two-view, are discussed for addressing the hallucination problem..  The two-view case was evaluated because it contains enough information to hallucinate, but lacks the elegance of the multiview method.  More importantly,  as the results indicate, the authors were unable to find satisfactory sets of points to gain convergence from the multiview factorization algorithm.  This was unfortunate because most of the error reduction optimization lies in the evaluation of the multiview matrix Mp.

 

 

Ø      The multiview method

o       First, select set of reference views that satisfy the coverage of the most number of points across all views.  Then select the view that each reference view shares the most number of points with and perform two-view initialization.  Next create a sub matrix that contains all the alphas obtained in each reference frame and perform multiview factorization.  Once this is done, each point missing in each frame can be hallucinated by jumping R, T for each reference frame until the point with a depth value is found.

Ø     The two-view method

o       Similar to the first approach a the bilinear relationship is calculated between each view that satisfies the minimum shared points criteria- in this case because of the use of the 8 point algorithm.  Next, R, T are recovered from each view and used to estimate lambda.  Then a recursive algorithm searches through the list of rotations and translations of each view with the coinciding points that exist in each view to find a suitable relationship between the reference view and the missing point.  One major drawback of this technique is that eight points are needed to calculate R, T whereas only six are needed in the multiview case.

 

Algorithm (multiview)

 

 

lso, because some points go missing between frames, the depth value for each frame may not consist of every point.  Much of the algorithm is designed around dealing with this case.

 

           

1.      Choose the reference views based on points coverage and minimum number needed to completely contain all data points with overlap between views.

2.      Perform two-view initialization for each one of the reference views

3.      Perform multiview factorization algorithm for the subset of the data set: Multiview factorization Algorithm

 

4.      Hallucinate the missing data points

 

a.       Go through list of absent points (fig. ???) and find all points that are missing

 

b.      Then by performing an R, T translation from reference frame to reference frame, perform a recursive algorithm that searches through the matrix of known (R, T)’s and points in each view until the missing point is found in one of the views The relationship between the rotation and translations between frames is:                             . 

 

Where g is defined as:

 

 

 

 

 

 

 

 


By multiplying through we obtain the equations in terms of R and T:

 

 

 

 

 


Then perform the inverse of agglomerated rotation and translation of the desired point by performing: gji = gji-1.  Like with the previous equation when we multiply through we get two equations in terms of R and T:

 

 

 


 

c. Then normalize the z coordinate (or the homogeneous coordinate) to be 1. 

 

 

 

 

 

 

Two-View method

 

Although the first step of the two methods are the same there are some major differences in approach.

 

1.      Choose the reference views based on points coverage and minimum number needed to completely contain all data points with overlap between views.

2.      Perform two-view initialization for each one of the reference views.

3.      perform two-view initialization for every view that has 8 or more points shared with the reference view.

 

4.      H

 

a.       allucinate the missing data point

 

a.       Go Through a list of absent points and finds all points that are missing.

b.      Then by performing an R, T translation from frame to frame, perform a recursive algorithm that searches through the matrix of known (R, T)’s and points in each view until the missing point is found in one of the views. 

 

 

c.       Then perform the inverse of agglomerated rotation and translation of the desired point then normalize the z coordinate (or the homogeneous coordinate) to be 1.

d.       


Results

 

As both the hallucination and Critical Points Filters were developed as two separate entities, the experimental data is presented in two different sections:  Critical Points Filters Experiments and Occlusion Hallucination.

 

1. Critical Points Filters Experiments

 

The influence of the iteration

Remember that D(p,t)=a*Di(p,t)+b*De(p,t)+c*Dg(p,t).

Once the first matrices R and T are obtained using the 8 point algorithm, it is possible to refine the mapping by using this information to compute a new correspondence between the eight points. This time, it is possible to use the three terms of the distance to compute the mapping for the eight points. The process can be iterated to find R and T and use them again for the mapping of the eight points. Usually, the algorithm converges in less than 10 steps. This gives much more accuracy to the overall results. However, it multiplies the time of computation by about 10.

 

Accuracy of the mapping

The algorithm is accurate and robust because it is a combination of an image-based method and a geometrical method. Here are the differences in the results, between the image-based method, and the one using geometrical information.

 

Figure 9  Results using image-based method only (89.2 % of the object is correctly recovered)

 

 

 

 

Figure 10  Results using both methods (99.4 % of the object is correctly recovered)

 

 

 

              

Analysis of the performance of this mix method was evaluated using five different sets of images Table 1.

 

Text Box: Table 1

Number	No Geometric Information	With Geometric Information
1	86.5	98.6
2	88.2	99.0
3	89.0	99.7
4	82.4	97.6
6	87.4	99.2
Mean	86.7	98.82

 

 


2. Occlusion Hallucination Experiments

 

To test the validity of the results of the hallucination experiments, a comparison to a reference was necessary.  Therefore, simulated data sets were used in the analysis. 

 

2.1 Dataset

 

The dataset for this experiment consisted of a set of 14 points representing the corners and centers of each face of a square box.  The box coordinates, in 3D Euclidian space, were then projected to 8 different camera reference frames.  Figure 11 shows an example of the results.  The results of the projection are the coordinates of each of the 14 points relative to the viewing plane.  It is important to note to those unfamiliar with the subject that the projected points are homogeneous coordinates in 2D that lie on the viewing plane.

Text Box:  
Figure 11  Simulated datasets of a box with 8 different frames

In addition to the simulated dataset, two sets of real data were used.  The first came from an in class lab experiment.  This set consisted of 16 data points shared across three views. To simulate occlusion for this set, points were arbitrarily removed from the views. The other set was generated by the authors, which like the simulated data emulated encircling the object.  At most ten points existed between frames for this data set with a total of 14 points.

 

 

 

 

 

 

 

 

 

 

 

2.2 Experiments

 

Multiview

Text Box: C	D	C	C	C	D	C	D	C	C	C	D	D	C	C	C
Because of an early recognition that the mutliview factorization technique failed in many cases, the two view approach discussed above was adopted.  Below list the results from the factorization method for the second dataset (lab example) with one point removed from each view.  The “C’s” correspond to convergence and the “D’s” when it diverged.

 


With such unpredictable and error prone results, diverging 25% of the time with just one point missing, no further testing was done for the multiview case.

 

Two View

In a sequence to test the error of the results and the legitimacy of the implementation, several stages of experiments were performed:

 

1.  Reconstruction of two views was performed on the dataset 2.  This was done because the legitimacy of the dataset had already been verified by using this method.

v                 As was expected the algorithm worked.  It recovered all the points in both shared views without hallucination.  Additionally, it was unable to hallucinate onto the missing view, which also was expected.

2.  Reconstruction of all complete dataset (no points omitted).

v                 Like the two-view case, the method had complete knowledge of all points before hand and thus did not hallucinate or diverge. 

v                 These were compared to results from the generic factorization algorithm to verify result.

 

3.  Reconstruction of partial dataset 2 with hallucination

 

 

Because the complete set of these data points was already known and because the multiview case verified that these sets of points were good candidates for analysis  this set became the focus of evaluation the two-view algorithm.   To test the algorithm, several repetitions were run.  Each repetition attempted to represent likely scenarios that could occur.  Additionally, several repetitions were run to test the limits of the method in general.  The following Table 2 shows which points were omitted for each of the test cases.

 

Text Box: Table 2

Points
1		1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
	1	¨	¨	¨		¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	
	2	¨	¨	¨	¨	¨	¨		¨	¨	¨	¨	¨	¨	¨	¨	¨
	3	¨	¨		¨	¨	¨	¨	¨	¨	¨	¨		¨	¨	¨	¨
2	1	¨	¨	¨	¨	¨	¨	¨	¨	¨		¨	¨	¨	¨	¨	¨
	2	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨
	3	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	
3	1	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨				¨	
	2	¨	¨	¨	¨	¨	¨	¨	¨	¨		¨	¨	¨	¨		
	3	¨	¨	¨	¨	¨		¨	¨	¨	¨	¨	¨	¨			
4	1	¨	¨	¨	¨	¨	¨		¨	¨	¨	¨	¨		¨	¨	¨
	2	¨		¨	¨	¨	¨	¨	¨	¨		¨	¨		¨	¨	¨
	3		¨	¨		¨	¨	¨	¨	¨	¨	¨	¨		¨	¨	
5	1	¨		¨	¨	¨	¨	¨	¨	¨		¨		¨			
	2			¨	¨	¨	¨	¨	¨		¨	¨	¨	¨	¨		
	3		¨			¨	¨	¨	¨	¨	¨	¨	¨		¨	¨	
6	1	¨		¨		¨	¨	¨	¨	¨	¨	¨		¨			¨
	2			¨	¨	¨	¨	¨		¨	¨	¨	¨	¨	¨		¨
	3		¨			¨	¨	¨	¨	¨	¨	¨	¨		¨	¨	
7	1	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨				
	2		¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨		
	3					¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨
8	1	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨				
	2	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨
	3					¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨	¨

 


The total error in pixels values of hallucinated points according the Euclidian distance for each test case is reported in Table 3.