# High-Fidelity Depth Map Reconstruction System With RGB-Guided Super Resolution CNN and Cross-Calibrated Chaos LiDAR

## Abstract

High-fidelity 3D models are essential for immersive virtual and augmented reality (VR/AR) applications. However, the performance of current 3D recording devices is limited in several scenarios, such as dim light environments, long-distance measurements, and large-scale objects. Therefore, their applicability to indoor scenes is hindered. In this work, we propose a depth map reconstruction system that integrates an RGB-guided depth map super-resolution convolutional neural network (CNN) into a stand-alone Chaos LiDAR depth sensor. This system provides highly accurate depth estimates in various scenarios, particularly for indoor scenes with dim lighting or long distances ranging from 4 m to 6 m. We address two design challenges to maximize the quality of the reconstructed depth map of the system. First, the misalignment across RGB-depth sensors is addressed using a two-stage calibration pipeline. Second, the lack of large-scale real-world LiDAR datasets is addressed by generating a large-scale synthetic dataset and adopting transfer learning. Experimental results show that our proposed system significantly outperforms the commercial RGB-D recording device RealSense D435i in terms of subjective visual perception, precision, and density of depth estimates, making it a promising solution for general indoor scene recording.

## Authors

Yu-Chun Ding *Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan* [ORCID: 0000-0001-9004-3597](https://orcid.org/0000-0001-9004-3597)

Chia-Yu Chang *Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan*

Pei-Rong Li *Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan*

Chao-Tsung Huang *Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan* [ORCID: 0000-0002-9173-520X](https://orcid.org/0000-0002-9173-520X)

Yung-Chen Lin *Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan*

Tsung Chen *Department of Electrical Engineering, Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan* [ORCID: 0009-0008-9492-3784](https://orcid.org/0009-0008-9492-3784)

Wei-Lun Lin *Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan* [ORCID: 0009-0003-5187-0495](https://orcid.org/0009-0003-5187-0495)

Cheng-Ting Lee *Department of Electrical Engineering, Institute of Photonics Technologies, National Tsing Hua University, Hsinchu, Taiwan*

Fan-Yi Lin *Department of Electrical Engineering, Institute of Photonics Technologies, National Tsing Hua University, Hsinchu, Taiwan* [ORCID: 0000-0003-2160-9715](https://orcid.org/0000-0003-2160-9715)

Yuan-Hao Huang *Department of Electrical Engineering, Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan* [ORCID: 0000-0001-6781-7312](https://orcid.org/0000-0001-6781-7312)

## Publication Information

**Journal:** IEEE Access **Year:** 2025 **Volume:** 13 **Pages:** 19118-19131 **DOI:** [10.1109/ACCESS.2025.3532621](https://doi.org/10.1109/ACCESS.2025.3532621) **Article Number:** 10849544 **ISSN:** Electronic ISSN: 2169-3536

## Metrics

**Paper Citations:** 1 **Total Downloads:** 468

## Funding

- Ministry of Science and Technology, Taiwan (Grant: MOST 110-2218-E-007-046, MOST 110-2218-E-007-047 and MOST 110-2218-E-007-050)

---

## Keywords

**IEEE Keywords:** Laser radar, Chaos, Sensors, Mirrors, Micromechanical devices, Calibration, Three-dimensional displays, Cameras, Optical attenuators, Convolutional neural networks

**Index Terms:** Convolutional Neural Network, Depth Map, Super-resolution Convolutional Neural Network, Virtually, Transfer Learning, Large-scale Datasets, Real-world Datasets, Depth Camera, Depth Estimation, Lack Of Datasets, Immersive Virtual Reality, Augmented Reality Applications, LiDAR Datasets, High-resolution, High Precision, Point Cloud, Convolutional Neural Network Model, RGB Images, Projection Matrix, Raw Depth, Bicubic Interpolation, 3D Mesh, RGB Camera, Simple Harmonic Oscillator, Fast Axis, Linear Mode, Intrinsic Matrix, Resonant Modes, RGB-D Dataset

**Author Keywords:** Depth map super-resolution, Chaos LiDAR, depth sensing

undefined
## SECTION I. Introduction

High-fidelity 3D models of real-world scenes are crucial to the ongoing development of numerous emerging applications, such as virtual reality and augmented reality, which require dense depth information with high accuracy over long distances to refine the immersive quality. However, existing 3D imaging methods face challenges in simultaneously achieving high spatial resolution and accurate distance precision.

There are two main categories of 3D imaging methods: passive and active. Passive methods employ multiple cameras to mimic human vision and estimate depth values by computing disparities in the matching of features in stereo images via triangulation (i.e., parallax) [^1], [^2], [^3], [^4]. Passive methods can provide high spatial resolution using low-cost hardware (e.g., off-the-shelf cameras) without the need for additional optical devices. However, disparity estimation algorithms are computationally intensive and the error of depth values grows quadratically with an increase in detection range. Passive methods are also susceptible to illumination and noise, degrading their performance in indoor environments.

Active 3D imaging methods involve transmitting light toward the targeted object and then observing the reflected signals. This approach makes it possible to obtain depth estimates of high precision even under poor lighting conditions. Active methods can be further divided into two classes: structured light (using signal modulation in the spatial domain) and time-of-flight (using signal modulation in the temporal domain). Structured lighting methods project predefined planar patterns on the surface of objects and then calculate depth values based on distortions in the observed patterns. These methods are capable of high-precision depth information with high spatial resolution at close range; however, these approaches are limited in terms of measurable object size and distance, posing constraints for general indoor applications. Time-of-flight methods [^5], [^6], [^7], [^8] calculate depth values by measuring the time that it takes for a pulsed light beam to travel from an emitter to a target surface and back to a sensor. This approach is highly robust to light interference while enabling highly accurate depth measurements over a very long detection range. Nevertheless, the resolution of depth maps is limited by the latency imposed by pixel acquisition hardware. For example, a Chaos LiDAR depth sensor [^9] can provide sub-centimeter precision within 20 m; however, the pixel detection frequency is limited to roughly 100 KHz.

In recent years, convolutional neural networks (CNN) have achieved high-quality reconstructions of densely spaced depth information [^10], [^11], [^12]. Yet, most depth maps are very sparse to provide sufficient information, posing constraints for accurate up-sampling. To address this situation, CNN architectures incorporate two input images: sparse depth maps and an additional guidance image as investigated in [^13], [^14], and [^15]. The high spatial resolution of RGB images makes them an ideal guide for reconstructing super-resolution (SR) depth maps with rich high-frequency details. However, the applicability of these approaches to real-world LiDAR systems remains unclear because most benchmark datasets are not collected with LiDAR.

This study aims to generate high-fidelity depth maps suitable for immersive virtual and augmented reality applications, with a focus on challenging indoor scenes. We integrate an RGB-guided super-resolution CNN framework with a stand-alone Chaos LiDAR depth sensor prototype [^16] to achieve high precision and resolution. To our knowledge, this is the first work to integrate a front-end LiDAR sensor with a post-processing CNN framework. Unlike prior efforts focused solely on depth sensor refinement, we overcome physical limitations through CNN post-processing with RGB guidance. Additionally, unlike works limited to CNN-based post-processing, our approach incorporates sensor integration to address precision constrained by using existing datasets. We emphasize the importance of source data characteristics, which have been rarely discussed in previous works on depth SR CNNs. Based on this, we demonstrate the feasibility of integrating deep learning frameworks into sensor prototypes.

We address two design challenges to realize the system: 1) the cross-modality pixel misalignment across the RGB camera and the Chaos LiDAR depth sensor, and 2) the lack of a large-scale real-world Chaos LiDAR-based RGB-D dataset. The primary cause of the first problem is the mismatch between the rotational operation of MEMS scanning in the depth sensor and the planar project of the RGB camera, leading to significant shape distortions in the depth map. Regarding the second issue, existing RGB-D datasets either employ vastly different depth sensing techniques (e.g., stereo camera in Middlebury [^17] and light coding in NYUv2 [^18]) or are unsuitable for our target indoor scenarios (e.g., KITTI [^19] for autonomous driving). The main contributions of this study are summarized as follows:

- A cross-calibrated Chaos LiDAR-based RGB-D recording subsystem is proposed to acquire high-precision raw depth and corresponding well-aligned RGB image. With the proposed two-stage calibration, the issue of cross-modality pixel misalignment is well addressed.
- An RGB-guided depth super-resolution CNN subsystem is designed to obtain spatially-dense depth maps. To overcome the scarcity of real-world Chaos LiDAR-based RGB-D data in a cost-effective way, we constructed a large-scale synthetic dataset and adopted a transfer learning strategy.
- Two well-aligned Chaos LiDAR-based RGB-D datasets were collected to verify the effectiveness of the proposed framework.
- The overall system significantly outperforms the RealSense D435i commercial RGB- D recording device in terms of subjective visual perception, precision, and density of depth estimates, particularly when implemented at longer distances.

The remainder of this paper is organized as follows. The overall system is introduced in Section II. Section III describes the cross-calibrated Chaos LiDAR sensing system and the proposed calibration scheme. Section IV outlines the proposed RGB-guided CNN-based depth super-resolution framework. The experimental results are discussed in Section V. The proposed system is compared with the RealSense D435i in Section VI. Discussion is presented in Section VII. Section VIII summarizes related works. Finally, the conclusions are given in Section IX.

## SECTION II. System Overview

The overall system pipeline is illustrated in Fig. 1. We adopt a Chaos LiDAR sensing system to enable high-precision depth map acquisition and a Sony DSC-RX100M5A camera for the simultaneous capture of the corresponding RGB guide image. The Chaos LiDAR system undergoes a two-stage calibration: 1) MEMS calibration and 2) cross-sensor coordinate projection, ensuring accurate alignment with the high-resolution RGB camera. The former deals with the issue of uneven step effect in the Chaos LiDAR sensor and the latter addresses the cross-modality pixel misalignment. A well-aligned RGB-D data pair is then sent to the RGB-guided depth super-resolution CNN for accurate depth map up-sampling.

![Figure 1](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding1-3532621-large.gif)

*FIGURE 1. Processing pipeline of the proposed depth map reconstruction system.*

Cross-sensor coordinate projection is essentially an optimization algorithm for pixel alignment, which calibrates the projection matrix between two sensor coordinates based on the known location of a small number of checkerboard samples. The optimized projection matrix is then used during the inference phase for the RGB-D alignment of arbitrary objects. RGB-guided depth super-resolution CNN is tasked with reconstructing a high-quality depth map of high density. We prepare a synthetic dataset (referred to as ROOMv1) and formulate a transfer learning strategy to overcome the lack of large-scale LiDAR datasets in a cost-effective manner. After training the model using the synthetic dataset, a small number of LiDAR data samples are then used for parameter fine-tuning. We capture a given scene under two field of view (FOV) settings: 5° FOV (high-resolution ground truth) and 20° FOV (low-resolution input). The following sections will discuss the cross-calibrated Chaos LiDAR sensing system and the RGB-guided depth super-resolution CNN in detail.

## SECTION III. Cross-Calibrated Chaos LiDAR Signal Sensing System

### A. MEMS-Calibrated Chaos LiDAR Sensing System

Fig. 2 presents a schematic illustration of the 3D pulsed Chaos LiDAR system. A laser beam directed toward the target via a MEMS mirror for horizontal and vertical scanning generates an optical signal captured via a quadrant avalanche photo-diode (APD). Following the method in [^9], we use a Field Programmable Gate Array (FPGA) to calculate the cross-correlation time delay between target and reference signals at each scanning angle, generating a 3D point cloud. Note that 3D point clouds may deviate significantly from the target in the projection space due to variations in the scanning rate of the MEMS mirror. Note also that MEMS mirrors can be operated in linear scanning mode or resonance scanning mode [^20], [^21], [^22]. In linear mode, the scanning angle is directly proportional to the input voltage (i.e., a linear relationship), and the operation is performed at a lower scanning frequency. The fact that scanning is performed at a uniform speed ensures the even distribution of scanning points across the projection space. To increase the scanning frequency of the MEMS mirror (high-speed scanning), the mirror can be operated in resonance mode (i.e., oscillating in a simple harmonic motion). In resonance mode, the deceleration and reacceleration of the mirror due to a reversal of direction at the edges can induce an unequal distribution of projection points in the projection space, leading to image distortion. In this study, we adopt the resonance mode in the horizontal axis in the Chaos LiDAR system and have to deal with variations in scanning speed. Calibration of the MEMS mirror is performed in two steps: 1) measuring the degree of variation in the distribution of scanning points, and 2) LiDAR image calibration.

![Figure 2](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding2-3532621-large.gif)

*FIGURE 2. Schematic setup of the 3D pulsed Chaos LiDAR system [9]. Chaos laser: a 1550-nm single-mode semiconductor laser (Shengshi Optical SBF-D55W2-111PMS) subject to optical feedback; BOA: Booster optical amplifier (Thorlabs BOA1004PXS); FC: Fiber coupler; VA: Variable optical attenuator; APD: Avalanche photodetector (Idealphotonics QAD- 1000L); EDFA: Erbium-doped fiber amplifier (GIP CGB1E3128001A); MEMS: Microelectromechanical system (Mirrorcle, S6244); FPGA: Field Programmable Gate Array (Xilinx Virtex-7 VC707).*

Fig. 3 (a) presents an example target photo of straight stripes spaced evenly at intervals of 2 cm. Fig. 3 (b) shows the corresponding image constructed using Chaos LiDAR. We can see that the LiDAR image is thicker near the edge and thinner in the center, due to the variations of the scanning speed mentioned above. In the proposed LiDAR system, we adopted a light output frequency of 100 kHz, such that the pixel sensing duration was $10~\mu$s. The variations in scanning speed can be derived by calculating the measured width of the stripes in the LiDAR image, as shown in Fig. 3 (c).

![Figure 3](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding3abcd-3532621-large.gif)

*FIGURE 3. (a) RGB image and (b) LiDAR image of scanning target. (c) Variation of the scanning speed for the MEMS mirror. (d) LiDAR image of scanning target with distortion calibration.*

LiDAR imaging is used to obtain data related to the acceleration and scanning trajectory of a mirror operating in resonance mode. The movement of the mirror is characteristic of simple harmonic motion. Mathematically, this motion can be described using a cosine function to derive accurate estimates of the scanning trajectory. For the scanning angles $\alpha$ and $\beta$ in the slow and fast axes, respectively, we can derive their values based on the corresponding position (n,m) of the depth map:

$$
\begin{align*} & \alpha (n):\ - \frac {FOV_{y}}{ 2 } + \frac {FOV_{y}}{ N - 1 } { (n - 1) } \ (n=1,2,3,\ldots,N), \tag {1}\\ & \beta (m):\ - \frac {FOV_{x}}{ 2 } * \cos \left ({{\frac { \pi *(m-1) }{ (M-1) }}}\right) \ (m=1,2,3,\ldots,M), \tag {2}\end{align*}
$$

where $FOV_{y}$ and $FOV_{x}$ represent the maximum scanning angle set by the user for the slow and fast axes, respectively, and N and M represent the total scanning point set for the slow and fast axes, respectively. For the (n,m)-th point in the vertical and horizontal axes, $\alpha$ and $\beta$ are their corresponding scanning angles.

The operation of the mirror in linear mode is described in Eq. (1). Starting from the edge, the scanning points progress forward at a fixed angular velocity equal to the FOV divided by the total number of pixels. The operation of the mirror in resonance mode is described in Eq. (2), in which a cosine function is included to simulate the simple harmonic motion along the fast axis. The $\alpha$ and $\beta$ together with the range *r* in the spherical coordinates are then transformed into X, Y, and Z in the Cartesian coordinates:

$$
\begin{align*} & \text {X(n,m):}\ \cos (\alpha (n))* \sin (\beta (m))* r(n, m), \tag {3}\\ & \text {Y(n,m):}\ \sin (\alpha (n))* r(n, m), \tag {4}\\ & \text {Z(n,m):}\ \cos (\alpha (n))*\cos (\beta (m))* r(n, m), \tag {5}\end{align*}
$$

Eqs. (3)–(5) can be used to calculate the spatial position (X, Y) as well as depth information (Z). Finally, the depth information is rearranged within the simulated trajectory to match the real scanning trajectory to eliminate the image distortion caused by the operation of the MEMS mirror in resonance mode.

Fig. 3 (d) presents an image constructed using chaos LiDAR after calibration. Unlike the uneven distribution of fringes in Fig. 3 (b), the fringes in the calibrated image are evenly spaced at intervals of 2 cm, which matches the real target.

### B. SNR-Assisted Cross-Sensor Coordinate Projection

As shown in Fig. 4, the MEMS mirror in a Chaos LiDAR system performs depth-scanning via isometric rotation, whereas the CMOS sensor obtains pixel data in an equidistant manner. Under these conditions, it is not possible to align captured RGB images to the LiDAR depth maps without further processing. We address the RGB-D misalignment issue via calibration and alignment to obtain a transform from Chaos LiDAR coordinates to RGB camera coordinates. Similar to previous studies, RGB camera calibration was performed using a checkerboard pattern [^23], [^24]. To make the corner of the checkerboard observable in the Chaos LiDAR sensor, we create the black and white squares using a variety of materials with various degrees of reflectivity and then use signal-to-noise (SNR) information of the received laser signals in generating checkerboard images.

![Figure 4](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding4-3532621-large.gif)

*FIGURE 4. The different sensor characteristics of the Chaos LiDAR and the camera.*

The signal-to-noise ratio of the correlation coefficients ($SNR_{cc}$) is used to evaluate the signal-to-noise value of the reference signal and the target LiDAR signal for the reconstruction of the checkerboard pattern. $SNR_{cc}$ is defined as

$$
\begin{equation*} SNR_{cc} = 10\log \frac {\mathbf {Corr}[T_{peak}]}{3\sqrt {\frac {\sum _{N=t_{f}-100}^{t_{f}}{(\mathbf {Corr}[N] - \boldsymbol {\mu }_{corr})}}{100}}} \ (in \ dB), \tag {6}\end{equation*}
$$

where $T_{peak}$ refers to the index at which the correlation trace reaches its peak and $\mathbf {Corr}[T_{peak}]$ refers to the corresponding correlation value. $t_{f}$ is the final correlation trace index in the time domain, and $\mu _{corr}$ indicates the average of the last 100 samples in the correlation trace. Fig. 5 (a) presents a correlation trace with each $SNR_{cc}$ term marked. While the real-time computation of $SNR_{cc}$ for the entire Chaos LiDAR system would be computationally heavy, we use $\mathbf {Corr}[T_{peak}]$ instead to reduce computational complexity. Fig. 5 (b) shows a sampled $\mathbf {Corr}[T_{peak}]$ image.

![Figure 5](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding5ab-3532621-large.gif)

*FIGURE 5. (a) An example of correlation trace. $SNR_{cc}$ related terms are marked. (b) Sampled $\mathbf {Corr}[T_{peak}]$ image.*

The locations of the corners in the two coordinate systems are then used to obtain the optimized projection parameters, which are derived as

$$
\begin{equation*} s\tilde {\mathbf {m}} = \ \mathbf {A}[\mathbf {R} \ \ \mathbf {t}]\tilde {\mathbf {w}}, \tag {7}\end{equation*}
$$

where *s* is an arbitrary scale factor, $\tilde {\mathbf {m}}$ refers to measured points in the image coordinate vector (position of RGB corner), **A** is the intrinsic matrix, [$\mathbf {R} \ \mathbf {t}$] is the extrinsic matrix comprising a rotation matrix and a translation vector, and $\tilde {\mathbf {w}}$ refers to measured points in the world coordinate system (position of depth corner).

After multiplying the intrinsic and extrinsic matrix, the matrix can be further simplified as

$$
\begin{align*} s \left [{{ \begin{matrix} U \\ V \\ 1 \end{matrix} }}\right ] & = \left [{{ \begin{matrix} p_{11} & \quad p_{12} & \quad p_{13} & \quad p_{14} \\ p_{21} & \quad p_{22} & \quad p_{23} & \quad p_{24} \\ p_{31} & \quad p_{32} & \quad p_{33} & \quad p_{34} \\ \end{matrix} }}\right ] \left [{{ \begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix} }}\right ] \\ & = \ \mathbf {P}\tilde {\mathbf {w}}, \tag {8}\end{align*}
$$

where *U* and *V* are the indices in the image projection plane, and *X*, *Y*, and *Z* are the vectors obtained from Eqs. (3) $\times$ (5) in the homogeneous world coordinate representation. **P** is the camera project matrix as the multiplication result of intrinsic matrix **A** and extrinsic matrix [$\mathbf {R} \ \mathbf {t}$]. Fig. 6 presents the alignment process used to obtain the projection matrix **P**, where the input is raw depth information and the output is a calibrated and projected depth point cloud. We employ an optimization-based algorithm to estimate the projection matrix **P**. Training data included RGB images, raw Chaos LiDAR depth maps, and Chaos LiDAR $SNR_{cc}$ maps of the checkerboard from various angles and distances. We initialize the projection matrix **P** via direct linear transformation (DLT) and optimize the parameters using the Levenberg-Marquardt algorithm until the projection error converges. Detailed formulations are presented in Appendix. The locations of the corners captured in RGB images (image coordinate system) are adopted as the ground truth. Reprojection error is assessed by transforming the locations of the corners in the world coordinate system into those of corresponding corners in the image coordinate system based on optimized projection parameters. The root-mean-square error (RMSE) is used to measure the deviations between reprojection results and the ground truth in terms of the number of pixels. The average root-mean-square error (RMSE) values vary with Chaos LiDAR FOV as follows: 0.9362 pixels for 5° FOV, and 4.0403 pixels for 20° FOV.

![Figure 6](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding6-3532621-large.gif)

*FIGURE 6. Overall alignment flow for cross-sensor coordinate calibration [20]. N denotes the total image pairs for parameter optimization.*

## SECTION IV. RGB-Guided Depth Map Super-Resolution CNN for Chaos LiDAR

Chaos LiDAR sensors provide highly accurate depth information; however, low spatial resolution limits their applicability in real-world situations. We therefore integrated our cross-calibrated Chaos LiDAR with a CNN-based super-resolution system to render high-quality high-density depth maps. Note that our focus was on implementation in real-world scenarios, rather than objective metrics. Thus, we departed from the assumptions on existing synthetic datasets or benchmarks, such as KITTI [^19] and NYUv2 [^18], by focusing on the practicality and interpretability of the data collected by practical systems susceptible to non-ideal effects.

### A. RGB-Guided Depth Map Super-Resolution CNN System

Fig. 7 outlines the inference flow of the proposed RGB-guided depth map super-resolution CNN system, the inputs of which are calibrated depth point clouds (scanned using Chaos LiDAR) and RGB images (captured using a digital camera). Inference is performed in two stages: 1) nearest inverse warping and 2) CNN model inference. The calibrated 3D depth point cloud is converted into a 2D (planar) depth map and then sent to the CNN model with its corresponding RGB image to render a super-resolution depth map with 4x of the spatial resolution. Our depth map super-resolution CNN model is based on the network described in [^13]. The network leverages guidance images to enhance structural details in target depth maps. It consists of three CNN blocks: CNNT and CNNG (parallel) for feature extraction from depth maps and guidance RGB images, respectively, and CNNF for reconstruction using their concatenated outputs.

![Figure 7](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding7-3532621-large.gif)

*FIGURE 7. The processing flow of our proposed RGB-guided depth map super-resolution CNN system. A nearest inverse warping algorithm is adopted for transferring the calibrated point clouds into the integer camera grid.The coordinates of the aligned point cloud are denoted as X, Y, and Z.The 2D position of the projected LiDAR map in floating-point representation is denoted as $X'_{n}$ and $Y'_{n}$ , and the warped LiDAR map in integer grid is denoted as $X_{n}$ and $Y_{n}$ .*

### B. Nearest Inverse Warping for Practical LiDAR Application

Our RGB-guided CNN takes the 2D plane depth map as input, which is projected from aligned 3D point cloud information; however, the coordinate projection results in depths located at floating-point locations. In this work, we employ image warping to correct image distortion introduced by domain conversion. We opt to warp the depth point cloud to the RGB image because the depth points are of lower resolution.

An illustration of different warping algorithms is presented in Fig. 7. We chose to inversely warp from the destination RGB map to the source LiDAR depth map to avoid the depth hole issues introduced by forward warping. We take the x-coordinates and y-coordinates for each RGB pixel as a reference and map the depth point in the LiDAR map with the nearest distance. The LiDAR depth point closest to the corresponding reference point is warped to the reference integer point in the RGB system without interpolation. Although bicubic or bilinear interpolation could predict depth values closer to real-world values, they would likely smooth the high-frequency raw data, potentially degrading the performance of subsequent CNN inference operations. We therefore opt to retain the original depth values obtained by the LiDAR system. This transfer from the 3D to the 2D domain results in a well-aligned depth map warped to the reference RGB image. The resulting RGB-D image pair is then used for depth map super-resolution CNN.

### C. Transfer Learning for Practical LiDAR Application

The robustness of practical CNN models relies on abundant training data, which is scarce for 3D Chaos LiDAR systems due to the unique nature of their depth maps compared to other ranging systems. Most real-world benchmarks are incompatible with Chaos LiDAR, even as prior knowledge. The lengthy data collection process compounds this challenge due to mechanical hardware settings, cross-modality calibration, sensor corrections, and scene setup. In our experiments, the collection of just 12 data pairs required 16 hours. Thus, we choose to generate a synthetic LiDAR dataset for use in CNN transfer learning.

We simulate the depth acquisition of a Chaos LiDAR system to generate ROOMv1, an RGB-D dataset containing 508 pairs of ground truth point clouds and corresponding RGB images. The dataset includes 13 3D interior scenes created with Blender [^25] and captured using a simulated RGB camera. LiDAR depth scanning behavior is modeled using Blensor [^26], an open-source simulation tool. We expect that the consistency of the ROOMv1 dataset would make it more effective than existing benchmarks (e.g., NYUv2 or KITTI) for training.

We first pre-train the CNN for depth map super-resolution using the large ROOMv1 dataset to avoid overfitting. We then transfer the pre-trained parameters to the target model and perform fine-tuning using a small real-world LiDAR dataset, as shown in Fig. 9. This model is noted as ROOMv1 model transfer. For comparison, we also conduct the NYUv2 model transfer, which uses the NYUv2 dataset for model pertaining.

![Figure 8](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding8-3532621-large.gif)

*FIGURE 8. ROOMv1 dataset.*

![Figure 9](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding9-3532621-large.gif)

*FIGURE 9. Adopted transfer learning strategy.We pre-train our target model using the ROOMv1 dataset and perform model fine-tuning with the small-size real LiDAR data. For comparison, we also pre-train a model with the NYUv2 dataset.*

## SECTION V. Experimental Results

### A. Experiment Setup

We collect two real LiDAR datasets for evaluation, including relatively simple scenes with rudimentary lighting (LiDARv1) and more complex scenes with additional lighting sources and higher-quality RGB images to enhance contrast (LiDARv2). Considering the size of the general indoor space, target objects are placed at a distance of roughly 4 to 6 m from the LiDAR sensor. LiDARv1 includes 8 data pairs, while LiDARv2 consists of 12 data pairs.

Transfer learning involves training the model using ROOMv1 or NYUv2 through 2,000 epochs and then selecting the epoch with the smallest validation loss to be the pre-trained model. We adopt the same dataset splitting in [^13] (1000 pairs with a resolution of $640\times 480$) for NYUv2 data. To ensure a fair comparison, we select 295 pairs of ROOMv1 data with a resolution of $1024\times 1024$ for training the model. This selection ensures that the training information volume is similar. We allocate 60 sets of the remaining ROOMv1 data for validation while reserving 153 sets for testing. The pre-trained models are then transferred to the real LiDAR dataset to undergo fine-tuning for 200 epochs. We also assess the effectiveness of transfer learning by training the model from scratch using only the real-world LiDAR dataset, using 2,200 epochs to ensure a fair comparison.

Due to the limited availability of real-world LiDAR data, we adopt cross-validation [^27] to evaluate model performance. Note that cross-validation is a resampling procedure that uses different subsets of data for the training and testing of a model. This approach helps to prevent overfitting and selection bias while providing insights into the generalizability of the model to unseen data. LiDARv1 is divided into 6 training sets and 2 testing sets for each iteration. This process is repeated four times (resulting in one trial) to ensure that every data point is utilized once as testing data. The results of all tests in a single trial are averaged as a single estimate to evaluate model performance. The above process is also performed with LiDARv2 divided into 9 training sets and 3 testing sets for each test.

### B. Effectiveness of Cross-Modality Calibration

Fig. 10 presents the visual comparison of depth maps with and without cross-modality calibration. Direct mapping involves passing the sampled point cloud value into the 2D plane in a one-to-one Z-scan manner without considering differences in cross-sensor characteristics. The proposed cross-modality calibration approach provides well-aligned depth maps. Fig. 11 displays the results of RGB-guided CNN inference, revealing sharper boundaries in the calibrated depth maps compared to those without calibration.

![Figure 10](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding10abc-3532621-large.gif)

*FIGURE 10. Visual results of depth maps (b) with and (c) without cross-modality calibration. The corresponding RGB images are shown in (a). Serious shape distortions can be observed in depth maps without calibration.*

![Figure 11](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding11-3532621-large.gif)

*FIGURE 11. CNN inference results for RGB-guided depth SR with and without cross-modality calibration. The well-aligned depth maps have sharper boundaries.*

### C. Objective Quality Comparison

Table 1 and 2 present the quantitative comparison of the various models using objective metrics. RMSE measures the differences between the predicted depth values and ground truth. Bad pixels report the percentage of pixels whose relative error is greater than a specified threshold. Here we use Bad-1% and Bad-2% to denote thresholds of 1% and 2%, respectively. We further divide the image into texture, edge, and foreground regions for more detailed analysis. The corresponding masks used for each region are shown in Fig. 12. The methods based on deep learning significantly outperform those that used bicubic upsampling, and models that undergo pretraining on a large-scale dataset (transfer learning) outperform the model trained from scratch using the small real-world LiDAR dataset. Under 4x upscaling of LiDARv1, the RMSE of the ROOMv1-trained model was 13% (1.61 cm) lower than that of bicubic interpolation at object edges and 14% (0.67 cm) lower across whole images. In terms of bad-pixel percentage in Bad-1% and Bad-2%, the proposed system also outperforms bicubic interpolation with 4.36% and 6.08% reduction in edge estimation, respectively. Under 4x upscaling of LiDARv2, the RMSE of the ROOMv1-trained model was 21% (3.17 cm) lower than that of bicubic interpolation at object edges and 24% (1.79 cm) lower across whole images. Note that the benefits of transfer learning are more evident in this complex environment, due to its superior generalizability. Despite the limited diversity of our synthetic ROOMv1 dataset compared to NYUv2 (295 data pairs from 13 scenes versus 1,000 data pairs from 464 scenes), the accuracy achieved using ROOMv1 transfer learning was comparable to the one using NYUv2.

![Figure 12](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding.t1-3532621-large.gif)

*TABLE 1*

![Figure 13](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding.t2-3532621-large.gif)

*TABLE 2*

![Figure 14](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding12-3532621-large.gif)

*FIGURE 12. Visualization of the results of the two real LiDAR datasets. Images in the top four rows belong to the 1st real LiDAR dataset (LiDARv1), and those in the bottom four rows belong to the 2nd real LiDAR dataset (LiDARv2).In general, the resulting depth maps of ROOMv1-transferred have relatively clear boundaries.*

### D. Subjective Quality Comparison

Fig. 12 and Fig. 13 present visual comparisons of results obtained using transfer learning and those obtained when training from scratch. The depth maps generated via training from scratch are blurry, particularly in edge regions. This may explain why the model trained from scratch differed little from those trained via transfer learning in terms of RMSE but was significantly weaker in terms of the bad pixel metric (see Table 1). Training from scratch may also induce the over-texture problem. As shown in Fig. 13 (c), some of the objects in the depth maps (e.g., can, basketball, and book) exhibit textures similar to those in their corresponding RGB images. Although the ROOMv1-transferred and NYUv2-transferred models achieve similar performance in quantitative metrics, the ROOMv1-transferred model generally produces results with better perceptual quality, featuring sharper and clearer boundaries. We attribute this to ROOMv1’s properties being more similar to real LiDAR data.

![Figure 15](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding13abcde-3532621-large.gif)

*FIGURE 13. Subjective comparison of RealSense D435i and our system.(a) RGB images.(b) Depth maps captured by our Chaos LiDAR at 4.5m with 4x bicubic upsampling.(c) Depth maps captured by our Chaos LiDAR at 4.5m with SRx4 through training from scratch.(d) Depth maps captured by our Chaos LiDAR at 4.5m with SRx4 through transfer from ROOMv1 and fine-tuning.(e) Depth maps captured by RealSense D435i at 2m.Compared to RealSense D435i (e), our Chaos LiDAR (b, c, d) can generate more precise depth values at longer distances.The CNN-based upsampling methods (c, d) render clearer and smoother boundaries with respect to the bicubic upsampling (b).The proposed transfer learning framework (d) can suppress the over-texture problems in the training from scratch method (c).*

Further assessments were performed by rendering 3D triangular meshes from the depth maps using open-source MeshLab [^28] and MeshMixer [^29]. Fig. 14 presents rendered 3D meshes obtained via bicubic upsampling or ROOMv1 model transfer with fine-tuning. Bicubic upsampling could not render object edges effectively, due to inaccuracies in the depth map values. The shape of the stereoscopic 3D mesh generated using the ROOMv1 transfer model with fine-tuning is close to the corresponding RGB image, due to the high quality of its depth maps.

![Figure 16](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding14-3532621-large.gif)

*FIGURE 14. 3D meshes rendered from depth maps. The corresponding RGB images are shown in Fig. 13 (a).*

## SECTION VI. Comparison With Commercial 3D Recording Devices

Table 3 compares the specifications of various RGB-D cameras. The Chaos LiDAR provides exceptional precision over long distances, outperforming all other cameras. The whole system achieves a spatial resolution comparable to others by incorporating depth map super-resolution in post-processing. We also compare the visual quality of depth maps obtained using the proposed system with those obtained using RealSense D435i [^30]. Unlike LiDAR, which employs time of flight depth measurements, RealSense D435i calculates depth values using stereo imaging based on two cameras. With a depth map resolution of $1280\times 720$, this device provides a usable shooting range of 0.3 to 3.0 m. Based on the field of view, resolution, and shooting range, the spatial resolution (meters/pixel) of RealSense D435i at 2 m is similar to that of the proposed system at 4.5 m. Thus, we crop the RealSense D435i images to a coverage range of 2 to 3 m, while cropping our LiDAR images to a range of 4.5 to 5.5 m. We then normalize the values to a range of 0 to 1 and multiply the results by 255 to form depth maps. Fig. 13 compares the depth maps generated by RealSense D435i and the proposed system, and Fig. 14 presents the corresponding 3D meshes. Results from the RealSense D435i show inconsistencies in certain textures and inaccuracies in depth points around edges, leading to distortions and blurred object contours. The proposed system generates depth maps of far higher quality despite capturing images from a greater distance.

![Figure 17](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6287639/10820123/10849544/ding.t3-3532621-large.gif)

*TABLE 3*

## SECTION VII. Discussion

### A. Characteristics of the System

This system aims to deliver high-precision, high-resolution depth maps. To address the limitations of existing datasets and devices, we leverage our Chaos LiDAR prototype to ensure precise sensor performance. Given the resolution constraints of our sensor, we integrate the depth SR CNN model into real-world systems, tackling challenges in data collection. Unlike standalone depth SR model development, this work approaches the design from a holistic system perspective, emphasizing the critical role of data sources. This work validates the idea of integrating front-end sensor prototypes with CNN post-processing, providing a bridge between related works in these two fields. Additionally, our proposed system also has the potential to be applied in the outdoor environment, which will become our future experiment.

### B. Generalization of the CNN Model

This work focuses on system development, sensor integration, and feasibility verification. A small model ($\sim ~56$K parameters) is trained separately on each dataset for initial evaluation. Cross-dataset generalization and larger-scale model training will be our future experiments.

## SECTION VIII. Related Works

### A. Depth Map Super-Resolution

Numerous approaches [^31], [^32], [^33], [^34] have been proposed for depth map super-resolution. These studies employ advanced techniques to enhance the model structure or training process and improve the quality of reconstructed depth maps. In contrast, our framework primarily focuses on integrating an RGB-guided CNN into a real-world Chaos LiDAR prototype system and addressing non-ideal factors that could degrade CNN performance. Our attention is on data pre-processing of Chaos LiDAR raw depth, handling misalignment between RGB-D cameras, and incorporating transfer learning. We choose a relatively basic model [^13] to avoid overfitting. We anticipate that these prior studies on advanced CNN post-processing will be compatible with or even complementary to our framework.

### B. Depth Map Datasets

Several RGB-D datasets are available for training and evaluating depth map super-resolution tasks. However, the depth-sensing mechanisms in these datasets differ significantly from our Chaos LiDAR. For instance, the NYUv2 dataset [^18] is captured using the Microsoft Kinect sensor [^35], where depth error quadratically increases with detection distance. The Middlebury dataset [^17] is sampled with a stereo camera, producing relative depth maps (disparity) inversely proportional to absolute depth. The depth accuracy within existing datasets is limited, constraining their applicability, particularly in scenarios where high-precision depth maps are demanded.

### C. Chaos LiDAR Sensor with CNN

Although Chaos LiDAR and RGB cameras have also been used for RGB-D data acquisition in CNNs in previous work [^9], their application was focused on face classification, which required lower accuracy for cross-sensor calibration. In contrast, our application demands point-by-point predictions, requiring higher alignment precision. Based on this difference, this paper focuses on developing the corresponding technical enhancements.

## SECTION IX. Conclusion

This paper proposes a high-fidelity depth sensing system integrating a prototype Chaos LiDAR depth sensor with RGB-guided depth map super-resolution CNN to provide dense depth estimates of high accuracy for highly immersive virtual and augmented reality. The alignment of RGB-D sensors is addressed using a two-stage calibration technique. The lack of large-scale real-world LiDAR datasets is addressed by generating a large-scale synthetic dataset with transfer learning. The proposed system outperforms existing systems regarding subjective visual perception, even when implemented at longer distances. This demonstrates the potential of leveraging the proposed solution for immersive 3D applications in broader scenarios, specifically for general indoor scenes.

## Appendix Derivation of Initialization and Optimization of Projection Matrix Calibration

AppendixDerivation of Initialization and Optimization of Projection Matrix Calibration

We use Direct Linear Transformation (DLT) to provide an initial value for non-linear optimization. Set the value of $p_{34} =1$, we can derive Eq. (8) as

$$
\begin{align*} U(p_{31}X+p_{32}Y+p_{33}Z+1) & \!=\! p_{11}X+p_{12}Y+p_{13}Z+p_{14}, \\ V(p_{31}X+p_{32}Y+p_{33}Z+1) & \!=\! p_{21}X+p_{22}Y+p_{23}Z+p_{24}, \tag {9}\end{align*}
$$

the expression of *U* and *V* is a suitable assumption and the relationship can be denoted as a matrix representation **W** as shown in Eq. (10), shown at the top of the next page

$$
\begin{align*} \mathbf{W}=\left[\begin{array}{ccccccccccc} X_{1} & Y_{1} & Z_{1} & 1 & 0 & 0 & 0 & 0 & -U_{1} X_{1} & -U_{1} Y_{1} & -U_{1} Z_{1} \\ 0 & 0 & 0 & 0 & X_{1} & Y_{1} & Z_{1} & 1 & -V_{1} X_{1} & -V_{1} Y_{1} & -V_{1} Z_{1} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ X_i & Y_i & Z_i & 1 & 0 & 0 & 0 & 0 & -U_i X_i & -U_i Y_{i} & -U_{i} Z_{i} \\ 0 & 0 & 0 & 0 & X_{i} & Y_{i} & Z_{i} & 1 & -V_{i} X_{i} & -V_{i} Y_{i} & -V_{i} Z_{i} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ X_{N} & Y_{N} & Z_{N} & 1 & 0 & 0 & 0 & 0 & -U_{N} X_{N} & -U_{N} Y_{N} & -U_{N} Z_{N} \\ 0 & 0 & 0 & 0 & X_{N} & Y_{N} & Z_{N} & 1 & -V_{N} X_{N} & -V_{N} Y_{N} & -V_{N} Z_{N} \end{array}\right]\tag {10}\end{align*}
$$

with several points in different images, where *N* is the total number of points, $(U_{i},V_{i})$ is the $i^{th}$ corner index in image coordinate, and $(X_{i},Y_{i},Z_{i})$ is the $i^{th}$ corner index in the world coordinate.

Then, we can construct an equation as

$$
\begin{equation*} \mathbf {Wp} = \mathbf {c+n}, \tag {11}\end{equation*}
$$

where **c** is the corner in the image coordinate, **n** is the expression of noise term, and **p** is the vectorized form of the project matrix **P**. The solution of **p** can be initialized as

$$
\begin{equation*} \mathbf {p} = \mathbf {(W^{T}W)^{-1}W^{T}}\mathbf {(c+n)}. \tag {12}\end{equation*}
$$

Finally, we apply the Levenberg-Marquardt algorithm, as presented in Algorithm 1, for the non-linear optimization of the project matrix **P**. Algorithm 1 Levenberg-Marquardt Algorithm

Input:

DLT initialized $\mathbf {p_{0}}$,a function $f \ : R^{m} \to R^{n} \ with \ n \geq m$,a measured point in the world coordinate $\mathbf {c} \in R^{n}$,maximum iteration $k_{max}$

Output:

**p** for minimizing $\|{\mathbf {c - Wp}}\|$

1:

**Initialize**: $k = 0$, $v = 2$, $\mathbf {p} = \mathbf {p_{0}}$, $\mathbf {A} = \mathbf {J}^{T}~\mathbf {J}$, $\epsilon _{p} = \mathbf {c} - f(\mathbf {p})$, $\mathbf {g} = \mathbf {J}^{T}~\mathbf {\epsilon _{p}}$, $\mu$ = $\tau *max \{ a_{ii} \}$, found = $(\|{\mathbf {g}}_{\infty }\| \leq \epsilon _{1})$

2:

**while** (found == false) **and** (k $\lt k_{max}$) **then**

3:

k = k + 1, Solve ($\mathbf {A}+\mu \mathbf {I}$)$\boldsymbol {\delta }_{p} = -\mathbf {g}$

4:

**if** $|{\boldsymbol {\delta }_{p}}\| \leq \epsilon _{2}(\|{p}\|+\epsilon _{2})$ **then**

5:

found = true

6:

**else**

7:

$\mathbf {p}_{new} = \mathbf {p} + \boldsymbol {\delta }_{p}$

8:

$\rho = (\|{\epsilon _{p}}\|^{2} - \|{c - f(\mathbf {p}_{new})}\|)/(\delta ^{T}_{p}(\mu \delta _{p} + \mathbf {g}))$

9:

**if** $\rho \gt 0$ **then**

10:

found = ($\|{\epsilon _{\mathbf {p}}}\|$ - $\|{\mathbf {c} - f(\mathbf {p}_{new})}\| \lt \epsilon _{4}\|{\epsilon _{p}}\|$)

11:

$\mathbf {p} = \mathbf {p}_{new}$

12:

$\mathbf {A} = \mathbf {J}^{T}\mathbf {J}$, $\epsilon _{p} = \mathbf {c} - f(\mathbf {p})$, $\mathbf {g} = \mathbf {J}^{T}\epsilon _{p}$

13:

found = false or $(\|{\mathbf {g}}_{\infty }\| \leq \epsilon _{1})$

14:

$\mu = \mu *max\left ({{\frac {1}{3},1-(2\rho -1)^{3}}}\right)$, v =2

15:

**else**

16:

$\mu = \mu * v$, $v = 2*v$

17:

**end if**

18:

**end if**

19:

**end while**

## References

[^1]: S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A comparison and evaluation of multi-view stereo reconstruction algorithms,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1, Jun. 2006, pp. 519–528, doi: 10.1109/CVPR.2006.19. [IEEE](https://ieeexplore.ieee.org/document/1640800) [Google Scholar](https://scholar.google.com/scholar?as_q=A+comparison+and+evaluation+of+multi-view+stereo+reconstruction+algorithms&as_occt=title&hl=en&as_sdt=0%2C31)

[^2]: D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” in Proc. IEEE Workshop Stereo Multi-Baseline Vis. (SMBV), Dec. 2001, pp. 131–140, doi: 10.1109/SMBV.2001.988771. [IEEE](https://ieeexplore.ieee.org/document/988771) [Google Scholar](https://scholar.google.com/scholar?as_q=A+taxonomy+and+evaluation+of+dense+two-frame+stereo+correspondence+algorithms&as_occt=title&hl=en&as_sdt=0%2C31)

[^3]: X. Mei, X. Sun, M. Zhou, S. Jiao, H. Wang, and X. Zhang, “On building an accurate stereo matching system on graphics hardware,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops (ICCV Workshops), Nov. 2011, pp. 467–474, doi: 10.1109/ICCVW.2011.6130280. [IEEE](https://ieeexplore.ieee.org/document/6130280) [Google Scholar](https://scholar.google.com/scholar?as_q=On+building+an+accurate+stereo+matching+system+on+graphics+hardware&as_occt=title&hl=en&as_sdt=0%2C31)

[^4]: H.-G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y.-W. Tai, and I. S. Kweon, “Accurate depth map estimation from a lenslet light field camera,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1547–1555, doi: 10.1109/CVPR.2015.7298762. [IEEE](https://ieeexplore.ieee.org/document/7298762) [Google Scholar](https://scholar.google.com/scholar?as_q=Accurate+depth+map+estimation+from+a+lenslet+light+field+camera&as_occt=title&hl=en&as_sdt=0%2C31)

[^5]: J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, “High quality depth map upsampling for 3D-TOF cameras,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 1623–1630, doi: 10.1109/ICCV.2011.6126423. [IEEE](https://ieeexplore.ieee.org/document/6126423) [Google Scholar](https://scholar.google.com/scholar?as_q=High+quality+depth+map+upsampling+for+3D-TOF+cameras&as_occt=title&hl=en&as_sdt=0%2C31)

[^6]: S. B. Gokturk, H. Yalcin, and C. Bamji, “A time-of-flight depth sensor–system description, issues and solutions,” in Proc. Conf. Comput. Vis. Pattern Recognit. Workshop, 2004, p. 35, doi: 10.1109/CVPR.2004.291. [IEEE](https://ieeexplore.ieee.org/document/1384826) [Google Scholar](https://scholar.google.com/scholar?as_q=A+time-of-flight+depth+sensor%E2%80%93system+description%2C+issues+and+solutions&as_occt=title&hl=en&as_sdt=0%2C31)

[^7]: B. Schwarz, “Mapping the world in 3D,” Nature Photon., vol. 4, no. 7, pp. 429–430, Jul. 2010, doi: 10.1038/nphoton.2010.148. [DOI](https://doi.org/10.1038/nphoton.2010.148) [Google Scholar](https://scholar.google.com/scholar?as_q=Mapping+the+world+in+3D&as_occt=title&hl=en&as_sdt=0%2C31)

[^8]: C. Niclass, M. Soga, H. Matsubara, S. Kato, and M. Kagami, “A 100-m range 10-frame/s 340 × 96-pixel time-of-flight depth sensor in 0.18-μ m CMOS,” in IEEE J. Solid-State Circuits, vol. 48, no. 2, pp. 559–572, Feb. 2013, doi: 10.1109/JSSC.2012.2227607. [IEEE](https://ieeexplore.ieee.org/document/6387335) [Google Scholar](https://scholar.google.com/scholar?as_q=A+100-m+range+10-frame%2Fs+340+%C3%97+96-pixel+time-of-flight+depth+sensor+in+0.18-%CE%BC+m+CMOS&as_occt=title&hl=en&as_sdt=0%2C31)

[^9]: C.-T. Chiu, Y.-C. Ding, W.-C. Lin, W.-J. Chen, S.-Y. Wu, C.-T. Huang, C.-Y. Lin, C.-Y. Chang, M.-J. Lee, S. Tatsunori, T. Chen, F.-Y. Lin, and Y.-H. Huang, “Chaos LiDAR based RGB-D face classification system with embedded CNN accelerator on FPGAs,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 12, pp. 4847–4859, Dec. 2022, doi: 10.1109/TCSI.2022.3190430. [IEEE](https://ieeexplore.ieee.org/document/9837463) [Google Scholar](https://scholar.google.com/scholar?as_q=Chaos+LiDAR+based+RGB-D+face+classification+system+with+embedded+CNN+accelerator+on+FPGAs&as_occt=title&hl=en&as_sdt=0%2C31)

[^10]: T.-W. Hui, C. C. Loy, and X. Tang, “Depth map super-resolution by deep multi-scale guidance,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Jan. 2016, pp. 353–369. [DOI](https://doi.org/10.1007/978-3-319-46487-9_22) [Google Scholar](https://scholar.google.com/scholar?as_q=Depth+map+super-resolution+by+deep+multi-scale+guidance&as_occt=title&hl=en&as_sdt=0%2C31)

[^11]: X. Song, Y. Dai, D. Zhou, L. Liu, W. Li, H. Li, and R. Yang, “Channel attention based iterative residual learning for depth map super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5630–5639, doi: 10.1109/CVPR42600.2020.00567. [IEEE](https://ieeexplore.ieee.org/document/9156284) [Google Scholar](https://scholar.google.com/scholar?as_q=Channel+attention+based+iterative+residual+learning+for+depth+map+super-resolution&as_occt=title&hl=en&as_sdt=0%2C31)

[^12]: B. Chen and C. Jung, “Single depth image super-resolution using convolutional neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 1473–1477, doi: 10.1109/ICASSP.2018.8462043. [IEEE](https://ieeexplore.ieee.org/document/8462043) [Google Scholar](https://scholar.google.com/scholar?as_q=Single+depth+image+super-resolution+using+convolutional+neural+networks&as_occt=title&hl=en&as_sdt=0%2C31)

[^13]: Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Joint image filtering with deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1909–1923, Aug. 2019, doi: 10.1109/TPAMI.2018.2890623. [IEEE](https://ieeexplore.ieee.org/document/8598855) [Google Scholar](https://scholar.google.com/scholar?as_q=Joint+image+filtering+with+deep+convolutional+networks&as_occt=title&hl=en&as_sdt=0%2C31)

[^14]: X. Ye, X. Duan, and H. Li, “Depth super-resolution with deep edge-inference network and edge-guided depth filling,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 1398–1402, doi: 10.1109/ICASSP.2018.8461357. [IEEE](https://ieeexplore.ieee.org/document/8461357) [Google Scholar](https://scholar.google.com/scholar?as_q=Depth+super-resolution+with+deep+edge-inference+network+and+edge-guided+depth+filling&as_occt=title&hl=en&as_sdt=0%2C31)

[^15]: W. Zhou, X. Li, and D. Reynolds, “Guided deep network for depth map super-resolution: How much can color help?,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 1457–1461, doi: 10.1109/ICASSP.2017.7952398. [IEEE](https://ieeexplore.ieee.org/document/7952398) [Google Scholar](https://scholar.google.com/scholar?as_q=Guided+deep+network+for+depth+map+super-resolution%3A+How+much+can+color+help%3F&as_occt=title&hl=en&as_sdt=0%2C31)

[^16]: J.-D. Chen, H.-L. Ho, H.-L. Tsay, Y.-L. Lee, C.-A. Yang, K.-W. Wu, J.-L. Sun, D.-J. Tsai, and F.-Y. Lin, “3D chaos lid{a}r system with a pulsed master oscillator power amplifier scheme,” Opt. Exp., vol. 29, no. 17, pp. 27871–27881, 2021. [DOI](https://doi.org/10.1364/OE.433036) [Google Scholar](https://scholar.google.com/scholar?as_q=3D+chaos+lid%7Ba%7Dr+system+with+a+pulsed+master+oscillator+power+amplifier+scheme&as_occt=title&hl=en&as_sdt=0%2C31)

[^17]: D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using structured light,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2003, pp. I-195–I-202. [IEEE](https://ieeexplore.ieee.org/document/1211354) [Google Scholar](https://scholar.google.com/scholar?as_q=High-accuracy+stereo+depth+maps+using+structured+light&as_occt=title&hl=en&as_sdt=0%2C31)

[^18]: N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Jan. 2012, pp. 746–760. [DOI](https://doi.org/10.1007/978-3-642-33715-4_54) [Google Scholar](https://scholar.google.com/scholar?as_q=Indoor+segmentation+and+support+inference+from+RGBD+images&as_occt=title&hl=en&as_sdt=0%2C31)

[^19]: A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, Sep. 2013. [DOI](https://doi.org/10.1177/0278364913491297) [Google Scholar](https://scholar.google.com/scholar?as_q=Vision+meets+robotics%3A+The+KITTI+dataset&as_occt=title&hl=en&as_sdt=0%2C31)

[^20]: Mirrorcle Technologies. Accessed: Dec. 15, 2024. [Online]. Available: https://mirrorcletech.com [Google Scholar](https://scholar.google.com/scholar?as_q=Mirrorcle+Technologies&as_occt=title&hl=en&as_sdt=0%2C31)

[^21]: Hamamatsu Photonics. Accessed: Dec. 15, 2024. [Online]. Available: https://www.hamamatsu.com [Google Scholar](https://scholar.google.com/scholar?as_q=Hamamatsu+Photonics&as_occt=title&hl=en&as_sdt=0%2C31)

[^22]: Ultimems. Accessed: Dec. 15, 2024. [Online]. Available: http://www.ultimems.com/ [Google Scholar](https://scholar.google.com/scholar?as_q=Ultimems&as_occt=title&hl=en&as_sdt=0%2C31)

[^23]: Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330–1334, Nov. 2000, doi: 10.1109/34.888718. [IEEE](https://ieeexplore.ieee.org/document/888718) [Google Scholar](https://scholar.google.com/scholar?as_q=A+flexible+new+technique+for+camera+calibration&as_occt=title&hl=en&as_sdt=0%2C31)

[^24]: A. Geiger, F. Moosmann, Ö. Car, and B. Schuster, “Automatic camera and range sensor calibration using a single shot,” in Proc. IEEE Int. Conf. Robot. Autom., Saint Paul, MN, USA, May 2012, pp. 3936–3943, doi: 10.1109/ICRA.2012.6224570. [IEEE](https://ieeexplore.ieee.org/document/6224570) [Google Scholar](https://scholar.google.com/scholar?as_q=Automatic+camera+and+range+sensor+calibration+using+a+single+shot&as_occt=title&hl=en&as_sdt=0%2C31)

[^25]: Blender Online Community. Blender—A 3D Modelling and Rendering Package. Accessed: Dec. 15, 2024. [Online]. Available: http://www.blender.org [Google Scholar](https://scholar.google.com/scholar?as_q=Blender%E2%80%94A+3D+Modelling+and+Rendering++Package&as_occt=title&hl=en&as_sdt=0%2C31)

[^26]: M. Gschwandtner, R. Kwitt, A. Uhl, and W. Pree, BlenSor: Blender Sensor Simulation Toolbox. Berlin, Germany : Springer, 2011, pp. 199–208. [DOI](https://doi.org/10.1007/978-3-642-24031-7_20) [Google Scholar](https://scholar.google.com/scholar?as_q=BlenSor%3A+Blender+Sensor+Simulation+Toolbox&as_occt=title&hl=en&as_sdt=0%2C31)

[^27]: R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proc. Int. Joint Conferences Artif. Intell. (IJCAI), San Francisco, CA, USA, Aug. 1995, pp. 1137–1143. [DOI](https://doi.org/10.1067/mod.2000.109032) [Google Scholar](https://scholar.google.com/scholar?as_q=A+study+of+cross-validation+and+bootstrap+for+accuracy+estimation+and+model+selection&as_occt=title&hl=en&as_sdt=0%2C31)

[^28]: P. Cignoni, M. Callieri, M. Corsini, M. Dellepiane, F. Ganovelli, and G. Ranzuglia, “MeshLab: An open-source mesh processing tool,” in Proc. 6th Eurographics Italian Chapter Conf., Jan. 2008, pp. 129–136. [DOI](https://doi.org/10.2312/LocalChapterEvents/ItalChap/ItalianChapConf2008/129-136) [Google Scholar](https://scholar.google.com/scholar?as_q=MeshLab%3A+An+open-source+mesh+processing+tool&as_occt=title&hl=en&as_sdt=0%2C31)

[^29]: R. Schmidt and K. Singh, “Meshmixer: An interface for rapid mesh composition,” in Proc. SIGGRAPH, New York, NY, USA, 2010, p. 1, doi: 10.1145/1837026.1837034. [DOI](https://doi.org/10.1145/1837026.1837034) [Google Scholar](https://scholar.google.com/scholar?as_q=Meshmixer%3A+An+interface+for+rapid+mesh+composition&as_occt=title&hl=en&as_sdt=0%2C31)

[^30]: Intel RealSense Product Family D400 Series Datasheet. Accessed: Dec. 15, 2024. [Online]. Available: https://dev.intelrealsense.com/docs/intel-realsensed400-series-product-family-datasheet [Google Scholar](https://scholar.google.com/scholar?as_q=Intel+RealSense+Product+Family+D400+Series+Datasheet&as_occt=title&hl=en&as_sdt=0%2C31)

### Additional References