I took a series of photos of the blinds in my home with an Iphone. The set of blinds covers three windows in total, and I marked the vertices of the middle window as key points.
To compute the homography matrix \(H\), we use the fact that a homography is a 3x3 transformation matrix that relates corresponding points between two images. The homography maps points from one image plane to another under a perspective transformation.
Given a point \(p_1=(x_1,y_1,1)^T\) n the first image, its corresponding point in the second image \(p_2=(x_2,y_2,1)^T\) is related by the homography matrix \(H\) as:
\[\mathbf{p}_2 = H \mathbf{p}_1\] \[H = \begin{pmatrix} h_{00} & h_{01} & h_{02} \\ h_{10} & h_{11} & h_{12} \\ h_{20} & h_{21} & 1 \end{pmatrix}\]We use homogeneous coordinates to express this transformation. The third coordinate is set to 1 to handle the projective transformation.
The relation between the points in homogeneous coordinates is:
\[\begin{pmatrix} x_2 \\ y_2 \\ 1 \end{pmatrix} \sim H \begin{pmatrix} x_1 \\ y_1 \\ 1 \end{pmatrix}\]which expands to:
\[x_2 = \frac{h_{00} x_1 + h_{01} y_1 + h_{02}}{h_{20} x_1 + h_{21} y_1 + 1}\] \[y_2 = \frac{h_{10} x_1 + h_{11} y_1 + h_{12}}{h_{20} x_1 + h_{21} y_1 + 1}\]This transformation is non-linear, so we convert it to a linear system to solve for the homography parameters.
Expanding these results:
\[h_{00} x_1 + h_{01} y_1 + h_{02} - x_2 h_{20} x_1 - x_2 h_{21} y_1 = x_2\] \[h_{10} x_1 + h_{11} y_1 + h_{12} - y_2 h_{20} x_1 - y_2 h_{21} y_1 = y_2\]We can stack the equations for all corresponding point pairs into a system of equations of the form:
\[A \cdot \mathbf{h} = \mathbf{b}\]where:
Before proceeding further with image blending, I first verify the correctness of the homography transformation using some simple examples.
Blend two images by linearly combining their pixel values. First create binary masks to indicate where each image has non-zero pixels. The two masks are summed to form a combined mask, which is used to normalize the pixel values in overlapping regions. The pixel values from both images are added and divided by this combined mask. Finally, the result is clipped to ensure pixel values stay within the valid range (0 to 255) and returned as an 8-bit image.
Although this method is very fast and convenient, there will be slight artifacts in the overlapping areas of the two images.
Use a distance transform approach for blending two images, which helps to create a smoother transition in overlapping regions.
Two binary masks are created to identify the valid (non-zero) pixel areas of each image. These masks represent regions where the images contain information.
The distance transform computes the distance of each pixel in the mask to the nearest zero pixel.
\[D(i, j) = \text{min}_{(x, y) \in M} \sqrt{(i - x)^2 + (j - y)^2}\]The distance maps for both masks are normalized to ensure they range between 0 and 1:
\[D_{\text{norm}}(i, j) = \frac{D(i, j)}{D_{\text{max}} + \epsilon}\]The blending of the two images is performed using a weighted average based on the normalized distance transforms:
\[\text{blended}(i, j) = \frac{\text{img1}(i, j) \cdot D_{\text{norm1}}(i, j) + \text{img2}(i, j) \cdot D_{\text{norm2}}(i, j)}{D_{\text{norm1}}(i, j) + D_{\text{norm2}}(i, j) + \epsilon}\]Finally, the blended image is clipped to ensure that pixel values remain within the valid range for an 8-bit image (0 to 255):
\[\text{blended}(i, j) = \max(0, \min(\text{blended}(i, j), 255))\]This ensures the output is suitable for display or further processing.
Interesting perspective from the game Horizon 5:
Streets in the game Cyberpunk 2077:
The Harris response is calculated to detect corners in the image. This involves:
ANMS is used to select the most prominent corners from the Harris response:
Feature descriptors describe the local appearance around each detected corner:
Feature matching finds correspondences between feature descriptors from two images:
RANSAC estimates the homography matrix between two sets of matched points:
Image warping aligns one image with another using the estimated homography matrix:
The final result is exactly the same as the effect of manually aligning the key points. (The second row below shows the scenery in Forza, which is consistent with the result in Part A.)
However, there are a few less ideal results. For example, in the case of blinds, when the scenery outside the window also lacks distinct features, the corner points on the blinds all look almost the same. The algorithm finds it difficult to compute the homography matrix from many similar focal points.
For Part A, I think the coolest thing is that you can stitch images into a panorama with just four manually marked points. I also found that the more dispersed the four marked corner points are, the better the fusion effect. Additionally, I was pleasantly surprised to discover that using distance transform instead of linear interpolation when blending images can perfectly eliminate artifacts.
For Part B, I finally achieved automatic stitching without manual marking. The coolest part is the RANSAC algorithm, which can iterate multiple times in a short period and find the optimal homography matrix, regardless of how messy the feature matches obtained after ANMS are.