3D Imaging

For 3D imaging many images are collected to construct a smooth 3D scene. It is used in archeology (sharing discoveries), industrial inspection (verifying product standard), biometrics (recognizing 3D faces vs photos), augmented reality (IKEA app)

3D imaging is useful because it can remove the necessity to process low level features of 2D images: removes effects of directional illumination, segments foreground and background, distinguishes object textures from outlines

Depth - given a line on which the object exists, it is the shortest distance to the line from the point of interest

Passive Imaging

Passive Imaging - we are concerned with the light that reaches us. 3D information is acquired from a shared field of view (units don’t interfere!)

Environment may contain shadows (difficult seeing), reflections (change location of different views = fake correspondences), don’t have enough surface features

Stereophotogrammetry

Surface is observed from multiple viewpoints simultaneously. Unique regions of the scene are found in all pictures (which can differ based on perspective) and the distance to the object is calculated considering the dispersity.

Difficult to find pixel correspondence

Structure from motion

Surface is observed from multiple viewpoints sequentially. A single camera is moved around the object and the images are combined to construct an object representation.

If illumination, time of the day, cameras are changing, there is more sparsity

Depth from focus

Surface is observed from different focuses based on camera’s depth-of-field (which depends on the lens of the camera and the aperture). The changes are continuous. A focal stack can be taken and a depth map can be generated (though quite noisy).

Difficult to get quantitative numbers. Also mainly sharp edges become blurry when camera is not in focus but not everything has sharp edges

Stereophotogrammetry

Structure from motion

Depth from focus

Active Imaging

Active Imaging - we use the light that we control. It robustly performs dense measurements because little computation is involved, just direct measurements.

Light can be absorbed by dark surfaces and reflected by specular surfaces causing no signal, also multiple units may interfere with each other

Active stereophotogrammetry

Infrared light is used to project surface features and solve the problem of “not enough surface features”. Since there is more detail from different viewpoints and since we don’t care about patterns (only features themselves), it is easier to find correspondences

Lack of detail produces holes and error depends on distance between cameras. Also, projector brightness decreases with distance therefore things far away cannot be captured.

Time of flight

Distance from the object is found by the time taken for light to travel from camera to the scene. Directional illumination is used to focus photon shooting at just one direction which restricts the view area. Two devices are used:

Lidar - has a photosensor and a lase pulse on the same axis which is rotated and each measurement is acquired through time. It is robust, however hard to measure short distances due to high speed of light; also is large and expensive.
Time of flight camera - can image the scene all at once. It uses a spatially resolved sensor which fires light waves instead of pulses and each light wave is recognized by its relative phase. It is fast but depth measure depends on wave properties.

Structured light imaging

To save hardware resources, a projector and a camera are used at different perspectives (unlike in time of flight where they were on the same diagonal). Projected patterns give more information because they can encode locations onto the surface.

Point scanner (1D) - the position and the direction of the laser is checked and it is calculated where the intersection occurs in the image taken by camera. It is slow because for every point there has to be an image.
Laser line scanner (2D) - a plane is projected which gives information about the curve of the surface. The depth is then calculated for a line of points in a single image. It is faster but not ideal because image is still not measured at once.

This technique is more accurate (especially for small objects), however the field of view is reduced and there is more complexity in imaging and processing time. Useful in industries where object, instead of laser, moves.

It is time-consuming to move one stripe over time, instead multiple stripes can be used to capture all the object at once. However, if the stripes look the same, camera could not distinguish their location (e.g., if one stripe is behind an object). Time multiplexing is used.

Time multiplexing - different patterns of stripes (arrangements of binary stripes) are projected and their progression through time (i.e., a set of different projections) is inspected to give an encoding of each pixel location in a 3D space.

For example, in binary encoding, projecting 8 different binary patterns of black white, i.e., [bw, bwbw, $\cdots$, bwbwbwbw], onto an object, for every resolution point would give an encoding of 8 bits, each bit representing 1 or 0 for the stripe it belongs to at every pattern. There are $2^8$ ways to arrange these bits giving a total of 256 single pixel values.

Hamming distance - a metric for comparing 2 binary data (sum of digits of a difference between each bit), for instance, the distance between $4$ (0100) and $14$ (1110) is $2$, whereas between $4$ (0100) and $6$ (0110) is $1$
Given that Grey Code has a Hamming distance of 1 and Inverse Grey Code has a Hamming distance of N-1, progression binary encodings belong to neither category. This means a lot of flexibility and a variety of possible algorithms.

In time multiplexing, because of the lenses, defocus may occur which mangles the edges and because of the high frequency information about regions could be lost. Sinusoidal patterns are used which are continuous and less affected by defocus.

Sinusoidal patterns - patterns encoded by a geometric function ($\sin(2\pi f \phi) + 1) / 2$ which has a frequency and a phase which encodes a pixel location. They work as time of flight cameras - the longer the waves (lower $f$), the larger the range, but more noise.

Reconstructing the surface from the wave patterns can cause ambiguities if surface is discontinuous so waves either have to be additionally encoded (labeled) or frequency should be dynamic (low $\to$ high)

Photometric stereo

The goal is to calculate the orientation of the surfaces, not the depth - only the lighting changes, the location and perspective stays the same. If an absolute location of one point in the map is known, with integration other location can be found. LED illumination is used.

Lambertian reflectance - depends on the angle $\theta$ between the surface normal and the ray to light
Specular reflectance - depends on the angle $\alpha$ between reflected ray to light and ray to observer

For simplicity, we assume there is no specular reflectance (that surface normal doesn’t depend on it). We also assume light source is infinitely far away (constant) The diffusive illumination for each $i^{th}$ light source can be calculated as:

\[I_d=I_pk_d(\mathbf{l}_i \cdot \mathbf{n})\]

Where:

$\mathbf{l}_i$ - direction to light from surface
$\mathbf{n}$ - surface normal (different for each pixel)
$k_d$ - constant reflectance
$I_p$ -constant light spread

Assuming $\rho=I_pk_d$ and vectorizing the light sources, after some rearrangement we can estimate the surface normal as (knowing $\mathbf{n}$ at each pixel and a some starting point (integration constant), integration is the possible):

\[\rho\mathbf{n}=(L^{\top}L)^{-1}L^{\top}\mathbf{i}\]

Photometric stereo requires many assumptions about illumination and surface reflectance also a point of reference, also precision is no guaranteed because depth is estimated from estimated surface normals

Lidar

Structured light imaging

Photometric stereo

Data Representation

Surface reconstruction

After acquiring a depth map, it is converted (based on the pose and the optics of the camera) to point cloud which contains locations in space for each 3D point. Their neighborhoods help estimate surface normals which are used to build surfaces.

Poisson surface reconstruction - a function where surface normals are used as gradients - inside shape is annotated by ones, outside by zeros. It is a popular function to reconstruct surfaces from normals. It helps to create unstructured meshes.

Surface registration

Knowing a scene representation at different timesteps, they can be combined into one scene via image registration. A model of the scene is taken and its point cloud representations and the best fit is found. If point correspondences are missing, they are inferred. Once all points are known, Kabsch algorithm is applied to find the best single rigid transformation which optimizes the correspondences.

Kabsch algorithm - given 2 point sets $P$ and $Q$, the distance between points in one set and the transformed points in the other set are being minimized: $\min_{R,t} \mathbf{p}_i-(R\mathbf{q}_i+\mathbf{t}) _2^2$

Point inferring can be done in multiple ways, a simplest one is via iterative closest point where closest points in each set are looked for and the best transformation to reach those points is performed iteratively until the points match.