# Employing FPGA to Implement NN Search in 3D-LiDAR: A Focus on Cache Architecture

## Abstract

3-D Light Detection and Ranging (3-D LiDAR) sensors are essential for autonomous vehicle functions such as localization, sensing, and mapping. However, their significant computational processing requirement remains a compelling drawback. Although high-end GPUs and CPUs could resolve the abovementioned issues, their prohibitive costs and substantial power requirements hinder the commercial adoption of 3-D LiDAR in vehicles. This paper presents a novel Nearest Neighbour (NN) searching method based on Field-Programmable Gate Arrays (FPGA) for 3-D LiDAR, designed for exceptional efficiency and accuracy. The method aims to offer a real-time LiDAR data processing solution, which delivers superior results on both GPU and CPU platforms. The proposed method consists of three parts: LiDAR data pre-processing, hardware-accelerate NN search, and efficient caching architecture for point cloud Data. The LiDAR processing methods successfully were implemented on the proposed FPGA platform. Experiment results show that our custom test board can accelerate NN searching beyond the capabilities of CPUs. Finally, the system we’re proposing offers real-time functionality with a power consumption of just 2.8W. It upholds precision levels that can be compared favorably to software equivalents and even the latest LiDAR data processing techniques.

## Authors

Yueze Liu *School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China*

Yihong Tian *Advanced Technology Research Institte, Beijing Institute of Technology, Beijing, China*

Hongwei Yang *Advanced Technology Research Institte, Beijing Institute of Technology, Beijing, China*

Yaohan Jia *School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China*

Zhanhao Bu *School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China*

Xuemei Chen *Advanced Technology Research Institte, Beijing Institute of Technology, Beijing, China*

## Publication Information

**Journal:** 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP) **Year:** 2024 **Pages:** 1-6 **DOI:** [10.1109/ICSIDP62679.2024.10868383](https://doi.org/10.1109/ICSIDP62679.2024.10868383) **Article Number:** 10868383

## Metrics

**Total Downloads:** 49

## Funding

- National Natural Science Foundation of China

---

## Keywords

**IEEE Keywords:** Laser radar, Three-dimensional displays, Accuracy, Computer architecture, Data processing, Search problems, Real-time systems, Software, Sensors, Field programmable gate arrays

**Index Terms:** Cache Architecture, Power Consumption, Point Cloud, Real-time Performance, Autonomous Vehicles, Ranging, Caching, Point Cloud Data, Lidar Data, Computation Time, Computational Efficiency, Parallelization, Resource Consumption, Computational Capabilities, Data Frame, Digital Signal Processing, Random Access Memory, Buffer Size, Clock Cycles, ARM Processor, Data Cache, Hardware Architecture, Vertical Angle, Computation Latency

**Author Keywords:** LiDAR, FPGA, Nearest Neighbour, Cache Architecture

undefined
## SECTION I. Introduction

Mobile robotic platforms, including robots and autonomous vehicles, necessitate the utilization of standard point cloud data of Lidar for efficient operation within previously unchartered environments [^1]–​[^2]. Concurrently, the localization algorithm assembles a spatial representation of surroundings via landmark position estimates derived from the point cloud data and concurrently projects the robot’s stance and position [^3]. Importantly, all spatial delineations are denoted within a chosen world reference system, often originating from the robot’s initial coordinates. A significant aspect in the formulation of a robotic precise determination of the robot’s stance and location, a necessity especially critical for high-velocity applications such as autonomous vehicles [^4]. The computational rapidity of SLAM [^5], [^6] algorithm is of paramount importance, largely due to the substantial computational demand required for swift point-to-point minima searches within two frames of point cloud data to ascertain the robot’s position [^7], [^8].

Technical stalemate persists in balancing computational resources and processing speeds while ensuring energy-efficient, precise algorithmic results in mobile robotics. The autonomy degree is often bounded by the power consumption of the predominantly microprocessor-based hardware architecture in use [^9]. To address this, proficient embedded systems design is crucial, offering robust computational capabilities coupled with low power consumption. Modern reconfigurable devices, including FPGA, comprise configurable slices, reconfigurable architectures, and embedded Digital Signal Processors (DSP) suitable for floating-point applications [^10]. Intense computational algorithms can be processed parallelly on FPGAs with floating-point precision, offering an advantage over traditional Advanced RISC Machine (ARM) and GPU platforms [^11]. The specific memory management and reconfigurability of FPGA provide a higher computational efficiency [^12]. At the same level of computational cost, FPGAs also consume less power compared to GPUs [^13], [^14].

The K-dimensional (K-d) tree is a frequently used method for NN computations, known for its efficiency and low resource consumption. However, due to the large amount and sparse distribution of point cloud data, high memory occupancy is unavoidable. Additionally, its intrinsic dynamism demands a continuous reconstruction of the K-d tree [^15], [^16]. These factors contribute to the high resource consumption and low efficiency when deploying NN searching algorithms on FPGAs. The Brute Force Nearest Neighbor (BFNN) method can compute the optimum solution while only needing to store basic data and consume minimal space. However, this method requires a point-by-point comparison between the target and all candidate points to achieve 100% accuracy. Though accurate, this process is computationally intensive and often leads to computational latency. This study aims to alleviate these constraints. Drawing on the strengths and weaknesses of the aforementioned algorithms, we propose a Focus on Cache Architecture for NN Search algorithm based on FPGA. This design, implemented on FPGA, capitalizes on the high parallelism of this process, offering support for backend applications.

We propose the NN searching algorithm for efficient FPGA implementation that mitigates computational imbalance and latency with an energy-efficient data cache processing architecture. The approach includes an evolved BFNN algorithm using a Nearest Neighbor Threshold (NNT) to increase NN searching accuracy. A mechanism leveraging NNT reduces time duplication caused by multiple iterations in unsuccessful matches [^17], [^18].

## SECTION II. Design optimizatione

In this section, we detail the process of searching NN values. This includes developing the computation framework within FPGAs, processing raw LiDAR data [^19], accelerating the NN search, and configuring the matching algorithm to solve NN issues. Our explanation is primarily centered on enhancing understanding within an academic context.

### A. LiDAR data pre-processing

The software driver of the Robosenes LiDAR sensor has been revised and embedded into the on-chip processing system, where programmable logic functions as a bespoke hardware accelerator. The Main Data Stream Output Protocol (MSOP) and Device Information Output Protocol (DIFOP) provide the structure for the LiDAR data. DIFOP includes vertical angle calibration (*λch*), horizontal angle calibration (*µch*), and installation error parameters (*γx*, *γy*). The precision of the data after error mitigation hinges on these parameters, which are established for each LiDAR sensor. As such, they only need to be computed once at the onset of the system startup and then stored within the on-chip Block Random Access Memory (BRAM). This approach boosts computational efficiency significantly. LiDAR data conversion formulas are as follows:

$$
\begin{equation*}\begin{array}{l} X = r \cdot \cos {\omega _{ch}} \cdot \cos \left( {{\alpha _\theta } + {\lambda _{ch}} - {\mu _{ch}}} \right) + {\gamma _x} \cdot \cos {\omega _{ch}} \\ Y = - r \cdot \cos {\omega _{ch}} \cdot \sin \left( {{\alpha _\theta } + {\lambda _{ch}} - {\mu _{ch}}} \right) + {\gamma _x} \cdot \sin {\omega _{ch}} \\ Z = r \cdot \sin {\omega _{ch}} + {\gamma _z} \cdot \sin {\omega _{ch}} \end{array} \end{equation*}
$$

In this context, *r* represents the quantified distance while *ω* is indicative of the LiDAR’s vertical angle, and *α* signifies the horizontal rotational angle of the LiDAR unit. The Cartesian projections of the coordinates are denoted as *X, Y, Z.*

To leverage the inherent parallel computation capabilities of FPGA hardware and to optimize storage usage, the original equation is reformulated as follows:

$$
\begin{equation*}\begin{array}{l} X = r \cdot \cos {\omega _{ch}} \cdot \left( {\left( {\cos {\alpha _\theta }\cos {\lambda _{ch}} - \sin {\alpha _\theta }\sin {\lambda _{ch}}} \right)\cos {\mu _{ch}} - } \right. \\ \left. {\left( {\sin {\alpha _\theta }\cos {\lambda _{ch}} + \cos {\alpha _\theta }\sin {\lambda _{ch}}} \right)\sin {\mu _{ch}}} \right) + {\gamma _x} \cdot \cos {\omega _{ch}} \\ Y = - r \cdot \cos {\omega _{ch}} \cdot \left( {\left( {\sin {\alpha _\theta }\cos {\lambda _{ch}} + \cos {\alpha _\theta }\sin {\lambda _{ch}}} \right)\cos {\mu _{ch}}} \right. \\ \left. { - \left( {\cos {\alpha _\theta }\cos {\lambda _{ch}} - \sin {\alpha _\theta }\sin {\lambda _{ch}}} \right)\sin {\mu _{ch}}} \right) + {\gamma _x} \cdot \sin {\omega _{ch}} \\ Z = r \cdot \sin {\omega _{ch}} + {\gamma _z} \cdot \sin {\omega _{ch}} \end{array} \end{equation*}
$$

Considering the constant rotation of the LiDAR at a pre-established frequency and period, the deflection angle *α* remains stable at each iteration. In a similar vein, the emanated vertical angle *ω*, though dependent on specific LiDAR systems, also maintains a fixed status. As implied by the previous equation, the coefficients for cosine and sine can be ascertained during system initialization and pre-stored for subsequent applications. Real-time processing can be conducted with input *r*, fully capitalizing on the inherent parallel computing abilities of FPGA. Notably, the proposed optimization framework facilitates a remarkably efficient execution of the equation within just three clock cycles. This superb efficiency illustrates its significance for NN search.

The implementation of coordinate conversion on an FPGA is achieved through fixed-point arithmetic [^20] where each multiplication step is computed using combinational logic, while timing logic is reserved for addition. This harnesses the inherent parallel processing capabilities of FPGAs, facilitating single-point cloud data coordinate transformations in a strikingly swift timeframe - a mere three clock cycles, or 15*ns* at a clock rate of 200*MHz.* This methodology signifies a computational capability that is a whole order of magnitude higher than that of CPUs, showcasing an efficiently optimized strategy.

Upon the conversion of LiDAR data into Cartesian coordinates for point cloud data, it becomes vital to perform pose transformations on the single-frame point cloud to identify the nearest NN points effectively. To achieve data correspondence efficiently, we opt for the use of a rotation matrix *R* and translation *t* as parameters. The equation is as follows:

$$
\begin{equation*}R\left[ {\begin{array}{l} {{X_i}} \\ {{Y_i}} \\ {{Z_i}} \end{array}} \right] + \left[ {\begin{array}{l} {{t_x}} \\ {{t_y}} \\ {{t_z}} \end{array}} \right] = \left[ {\begin{array}{l} {{X_o}} \\ {{Y_o}} \\ {{Z_o}} \end{array}} \right]\end{equation*}
$$

In the equation, *Xi, Yi, Zi* represent the data post-compensation for LiDAR installation error, whereas *Xo, Yo, Zo* denotes data that has undergone attitudinal adjustments.

### B. Data cache architecture

The principle of LiDAR laser emission is shown in Fig. 1. The system rotates horizontally every 0.4 degrees and simultaneously sends the echo distance of the vertically emitted laser to the receiving end via the Ethernet interface. With our system directly dealing with raw radar data, it is feasible to directly process and store the original data according to the laser data output format. It takes the data at 0 degrees as the starting data for each frame, ensuring that the starting points of each frame’s data are the laser echoes from the same direction. Furthermore, it stores data sequentially, so when searching for the Nearest Neighbor Threshold (NNT) value, there is no need for sorting, and the NN value can be located rapidly. The overall storage architecture is shown in Fig. 2 as demonstrated.

![Figure 1](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/10867835/10867841/10868383/liu1-p6-liu-large.gif)

*Fig. 1: 3D-LiDAR schematic.*

The original LiDAR data, after preprocessing, is stored in the Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM) memory. Using the write channel switch and storage management architecture, data is allocated to pre-planned address areas and is tagged with "Data Mark". Post-algorithm processing determines if it serves as fundamental data "Base Mark", and an overall marking "All Mark" is performed on the address area where data is stored. Given that the system is in a dynamic state, when new data comes in with postural changes, it generates *R, t* via subsequent algorithms, conducting postural transformation on the new data. This ensures that the NN search can be performed on point cloud data. The BASE data and new data are transposed before being input into the parallel matrix for computation and NN value retrieval. When one frame of data has finished algorithmic computations, it is assessed for eligibility as "Base" data; if not qualified, the data and ’mark’ are cleared to free up storage space.

To realize the efficient operation of the parallel computation matrix, the system reads data directly from the cache, performs rotational computations, and inputs the resulting data into the First-In-First-Out (FIFO) pre-storage. This process allows for low-latency, direct input to the parallel computational array, improving computational efficiency. However, when deploying within an FPGA, we must define the NNT value, which represents the cut-off for data computation operations and triggers the input of new data for further calculation. However, defining an NNT value does not guarantee that the result will be the absolute NN value. Therefore, based on the Lidar input mode and laser rotational pattern, this system stores data according to an array. By searching and calculating the following 50 times, it can ensure that a smaller NN point is located, as illustrated in Fig. 3. Firstly, the system finds the red point that meets the NNT, but it is not the smallest. The subsequent green points all meet the NNT whereas only the yellow point corresponds with the absolute NN. In addition, our method uses a polling approach to read ’base’ data and ’data’, finding suitable data close to the existing NN value. Hence, this method can provide a higher precision NN value for subsequent algorithms, reducing system calculation and iterative time.

### C. Hardware accelerate NN search

In order to enhance computational efficiency, the system preloads the required base data and real-time input point cloud data and parallelizes these data sets for computation as shown in Fig. 4. In any given clock cycle, computation can be executed a number of times the size of the array, represented as *i*∗*k*, where *i* denotes the size of the query base buffer and *k* represents the size of the data buffer. Base data is updated in real-time. After the completion of single point cloud NN computation, the system continues to compute using point cloud data previously stored in the FIFO. As the system only needs to search for the NN value, in the worst-case scenario, the computation times for a single data point are related to the quantity of ’base’; given ’base’ quantity as *n*, the volume of single-frame point cloud data as *p*, and the computational frequency as *Fre*, the maximum number of computations would be $t = \frac{{(p*n)}}{{(i*k)}}*(1/Fre).$ Considering the amount of point cloud data from a 16-line LIDAR as 14,400,*n* = 14400 when there are 5 ’base’, the frequency *Fre* = 200*MHz*, the query base buffer size *i* = 20, the size of the data buffer *k* = 10, and the ’base’ number *nmax* = 5, the longest computational time would be *t* = 25*ms.* Given that the output frequency of the LIDAR is 10*Hz*, and single-frame input update time is 100*ms*, this method can fully satisfy the real-time requirements of subsequent algorithms. In practice, the computation time can be further reduced owing to the search method based on storage sequence.

## SECTION III. Implementation and experimental results

In the realm of this research, we created the NN search algorithm utilizing Verilog and put it into practice using Xilinx Vitis 2022.1. The procedure encompassed the execution of synthesis and place-and-route protocols. We selected a customized development platform—illustrated in Fig. 5 -as the objective device, in alignment with our agenda to assess the viability of economically feasible FPGAs in scenarios with constrained resources. The hardware of the custom development board is comprised of the Xilinx XC7K325T-FFG900 - an element that mirrors the capacities of Kinect-7. This board possesses the capability to conduct 326080 logical cell operations, and 840 DSP48 slice operations, and is furnished with a 2GB DDR3 DRAM.

![Figure 2](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/10867835/10867841/10868383/liu2-p6-liu-large.gif)

*Fig. 2: DDR Cache-based Parallel Data Processing Architecture Diagram.*

![Figure 3](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/10867835/10867841/10868383/liu3-p6-liu-large.gif)

*Fig. 3: NN Search Cache Architecture Diagram.*

![Figure 4](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/10867835/10867841/10868383/liu4-p6-liu-large.gif)

*Fig. 4: Parallel Search Array Computation Flowchart.*

### A. Analysis of NN strategy

Table I illustrates the resource utilization of the proposed HA-BFNN algorithm 200MHz frequencies and 20∗ 10 parallel array sizes on hardware platforms.

![Figure 5](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/10867835/10867841/10868383/liu.t1-p6-liu-large.gif)

*TABLE I:*

The computational method was described in Section II-C, that is *t* = *n*∗*p/i*∗*k*∗(1*/Fre*). However, in practical applications, the settings of the NNT can substantially accelerate the speed of acquiring the NN values. The subsequent chapters will exhibit the time consumed for each iterative computation and will balance resource utilization and speed based on these data.

![Figure 6](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/10867835/10867841/10868383/liu5-p6-liu-large.gif)

*Fig. 5: Customized platform.*

![Figure 7](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/10867835/10867841/10868383/liu6-p6-liu-large.gif)

*Fig. 6: Statistics on the number of calculations*

### B. Execution time breakdown and comparative

Table II shows execution time breakdowns of the K-d tree and BFNN and ours method on i9-14900K processor and our method on FPGA.

![Figure 8](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/10867835/10867841/10868383/liu.t2-p6-liu-large.gif)

*TABLE II:*

The enhancement in performance of our approach stems from the memory management model and parallelization strategy of our algorithm, predicated on the hardware architecture outlined in Section II. The substantial impact is corroborated by Table II: the NN search algorithm premised on ours facilitates a 900-fold augmentation on the CPU in comparison to the brute-force NN search. Nonetheless, the K-d tree gains the upper hand when the quantity of required algorithmic iterations surpasses the mark of the 5th percentile. In contrast, the FPGA-oriented NN search algorithm yields superior progress about the K-d tree algorithm, which necessitates prior construction on the CPU. It ascends to a competitive plane when the count of required algorithmic iterations encircles the 10th percentile, but with fewer than the 5th percentile of required iterations, data convergence is feasible, and further iterations fail to furnish notable augmentation of precision to the comprehensive metrics. Therefore, the FPGA-based NN search algorithm boosts system efficiency.

### C. Power consumption

We evaluated the power requisites of the complete customized board with a wattmeter. The power consumption reading varied between 2.5 to 2.8 Watts when running the NN systems. It’s significant to underline that the figures reported above incorporate the power consumption of other ancillary peripherals present on the board. This implies that the power demand of the NN core, in isolation, is less than the 2.5 to 2.8 Watts range specified.

## SECTION IV. Conclusion

This study presents the development of a novel, high-performance NN search methodology employing 3-D LiDAR technology. Saliently, the method can be executed in real-time on FPGA, thereby achieving promising performance outcomes. Our proposed NN search algorithm significantly accelerates NN searching, boasting an improvement of up to 900 times compared to BFNN methods. Additionally, our method demonstrates superior accuracy and expedient speeds within five iterations when contrasted with K-d tree methods, thanks to the point-by-point search strategy of BFNN. Importantly, our FPGA-based NN search showcased comparable performance with software implementations and even surpassed some of the most advanced LiDAR NN processing methods when applying LiDAR exclusively as a sensor.

## References

[^1]: Y. Jia, X. Yan, and Y. Xu, “A survey of simultaneous localization and mapping for robot,” in 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), vol. 1. IEEE, 2019, pp. 857–861. [IEEE](https://ieeexplore.ieee.org/document/8997820) [Google Scholar](https://scholar.google.com/scholar?as_q=A+survey+of+simultaneous+localization+and+mapping+for+robot&as_occt=title&hl=en&as_sdt=0%2C31)

[^2]: Z. Ren, L. Wang, and L. Bi, “Robust gicp-based 3d lidar slam for underground mining environment,” Sensors, vol. 19, no. 13, p. 2915, 2019. [DOI](https://doi.org/10.3390/s19132915) [Google Scholar](https://scholar.google.com/scholar?as_q=Robust+gicp-based+3d+lidar+slam+for+underground+mining+environment&as_occt=title&hl=en&as_sdt=0%2C31)

[^3]: X. Li, Y. Zhou, and B. Hua, “Study of a multi-beam lidar perception assessment model for real-time autonomous driving,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–15, 2021. [IEEE](https://ieeexplore.ieee.org/document/9475591) [Google Scholar](https://scholar.google.com/scholar?as_q=Study+of+a+multi-beam+lidar+perception+assessment+model+for+real-time+autonomous+driving&as_occt=title&hl=en&as_sdt=0%2C31)

[^4]: O. Jafari, P. Maurya, P. Nagarkar, K. M. Islam, and C. Crushev, “A survey on locality sensitive hashing algorithms and their applications,” arXiv preprint arXiv:2102.08942, 2021. [Google Scholar](https://scholar.google.com/scholar?as_q=A+survey+on+locality+sensitive+hashing+algorithms+and+their+applications&as_occt=title&hl=en&as_sdt=0%2C31)

[^5]: I. A. Kazerouni, L. Fitzgerald, G. Dooly, and D. Toal, “A survey of state-of-the-art on visual slam,” Expert Systems with Applications, vol. 205, p. 117734, 2022. [DOI](https://doi.org/10.1016/j.eswa.2022.117734) [Google Scholar](https://scholar.google.com/scholar?as_q=A+survey+of+state-of-the-art+on+visual+slam&as_occt=title&hl=en&as_sdt=0%2C31)

[^6]: H. Taheri and Z. C. Xia, “Slam; definition and evolution,” Engineering Applications of Artificial Intelligence, vol. 97, p. 104032, 2021. [DOI](https://doi.org/10.1016/j.engappai.2020.104032) [Google Scholar](https://scholar.google.com/scholar?as_q=Slam%3B+definition+and+evolution&as_occt=title&hl=en&as_sdt=0%2C31)

[^7]: T. Chong, X. Tang, C. Leng, M. Yogeswaran, O. Ng, and Y. Chong, “Sensor technologies and simultaneous localization and mapping (slam),” Procedia Computer Science, vol. 76, pp. 174–179, 2015. [DOI](https://doi.org/10.1016/j.procs.2015.12.336) [Google Scholar](https://scholar.google.com/scholar?as_q=Sensor+technologies+and+simultaneous+localization+and+mapping+%28slam%29&as_occt=title&hl=en&as_sdt=0%2C31)

[^8]: M. G. Dissanayake, P. Newman, S. Clark, H. F. Durrant-Whyte, and M. Csorba, “A solution to the simultaneous localization and map building (slam) problem,” IEEE Transactions on robotics and automation, vol. 17, no. 3, pp. 229–241, 2001. [IEEE](https://ieeexplore.ieee.org/document/938381) [Google Scholar](https://scholar.google.com/scholar?as_q=A+solution+to+the+simultaneous+localization+and+map+building+%28slam%29+problem&as_occt=title&hl=en&as_sdt=0%2C31)

[^9]: J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang, “Understanding performance differences of fpgas and gpus,” in 2018 IEEE 26th Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM). IEEE, 2018, pp. 93–96. [IEEE](https://ieeexplore.ieee.org/document/8457638) [Google Scholar](https://scholar.google.com/scholar?as_q=Understanding+performance+differences+of+fpgas+and+gpus&as_occt=title&hl=en&as_sdt=0%2C31)

[^10]: R. Tessier and W. Burleson, “Reconfigurable computing for digital signal processing: A survey,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 28, pp. 7–27, 2001. [DOI](https://doi.org/10.1023/A:1008155020711) [Google Scholar](https://scholar.google.com/scholar?as_q=Reconfigurable+computing+for+digital+signal+processing%3A+A+survey&as_occt=title&hl=en&as_sdt=0%2C31)

[^11]: S. Kaiser, M. S. Haq, A. Ş. Tosun, and T. Korkmaz, “Container technologies for arm architecture: A comprehensive survey of the state-of-the-art,” IEEE Access, vol. 10, pp. 84853–84881, 2022. [IEEE](https://ieeexplore.ieee.org/document/9852232) [Google Scholar](https://scholar.google.com/scholar?as_q=Container+technologies+for+arm+architecture%3A+A+comprehensive+survey+of+the+state-of-the-art&as_occt=title&hl=en&as_sdt=0%2C31)

[^12]: G. Singh, M. Alser, D. S. Cali, D. Diamantopoulos, J. GómezLuna, H. Corporaal, and O. Mutlu, “Fpga-based near-memory acceleration of modern data-intensive applications,” IEEE Micro, vol. 41, no. 4, pp. 39–48, 2021. [IEEE](https://ieeexplore.ieee.org/document/9451578) [Google Scholar](https://scholar.google.com/scholar?as_q=Fpga-based+near-memory+acceleration+of+modern+data-intensive+applications&as_occt=title&hl=en&as_sdt=0%2C31)

[^13]: M. Vestias and H. Neto, “Trends of cpu, gpu and fpga for high-performance computing,” in 2014 24th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2014, pp. 1–6. [IEEE](https://ieeexplore.ieee.org/document/6927483) [Google Scholar](https://scholar.google.com/scholar?as_q=Trends+of+cpu%2C+gpu+and+fpga+for+high-performance+computing&as_occt=title&hl=en&as_sdt=0%2C31)

[^14]: R. Tessier, K. Pocek, and A. DeHon, “Reconfigurable computing architectures,” Proceedings of the IEEE, vol. 103, no. 3, pp. 332–354, 2015. [IEEE](https://ieeexplore.ieee.org/document/7086414) [Google Scholar](https://scholar.google.com/scholar?as_q=Reconfigurable+computing+architectures&as_occt=title&hl=en&as_sdt=0%2C31)

[^15]: W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust monocular slam in dynamic environments,” in 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2013, pp. 209–218. [IEEE](https://ieeexplore.ieee.org/document/6671781) [Google Scholar](https://scholar.google.com/scholar?as_q=Robust+monocular+slam+in+dynamic+environments&as_occt=title&hl=en&as_sdt=0%2C31)

[^16]: T. Kuhara, T. Miyajima, M. Yoshimi, and H. Amano, “An fpga acceleration for the kd-tree search in photon mapping,” in Reconfigurable Computing: Architectures, Tools and Applications: 9th International Symposium, ARC 2013, Los Angeles, CA, USA, March 25-27, 2013. Proceedings 9. Springer, 2013, pp. 25–36. [DOI](https://doi.org/10.1007/978-3-642-36812-7_3) [Google Scholar](https://scholar.google.com/scholar?as_q=An+fpga+acceleration+for+the+kd-tree+search+in+photon+mapping&as_occt=title&hl=en&as_sdt=0%2C31)

[^17]: A. Kosuge, K. Yamamoto, Y. Akamine, T. Yamawaki, and T. Oshima, “A 4.8x faster fpga-based iterative closest point accelerator for object pose estimation of picking robot applications,” IEEE, 2019. [IEEE](https://ieeexplore.ieee.org/document/8735570) [Google Scholar](https://scholar.google.com/scholar?as_q=A+4.8x+faster+fpga-based+iterative+closest+point+accelerator+for+object+pose+estimation+of+picking+robot+applications&as_occt=title&hl=en&as_sdt=0%2C31)

[^18]: M. Magnusson, A. Nuchter, C. Lorken, A. J. Lilienthal, and J. Hertzberg, “Evaluation of 3d registration reliability and speed-a comparison of icp and ndt,” in 2009 IEEE International Conference on Robotics and Automation. IEEE, 2009, pp. 3907–3912. [IEEE](https://ieeexplore.ieee.org/document/5152538) [Google Scholar](https://scholar.google.com/scholar?as_q=Evaluation+of+3d+registration+reliability+and+speed-a+comparison+of+icp+and+ndt&as_occt=title&hl=en&as_sdt=0%2C31)

[^19]: J. An and E. Kim, “Novel vehicle bounding box tracking using a low-end 3d laser scanner,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 6, pp. 3403–3419, 2020. [IEEE](https://ieeexplore.ieee.org/document/9098054) [Google Scholar](https://scholar.google.com/scholar?as_q=Novel+vehicle+bounding+box+tracking+using+a+low-end+3d+laser+scanner&as_occt=title&hl=en&as_sdt=0%2C31)

[^20]: Najjar, A. Walid, Xiaoyin, Roy-Chowdhury, and K. Amit, “Evaluation and acceleration of high-throughput fixed-point object detection on fpgas,” IEEE Transactions on Circuits and Systems for Video Technology, 2015. [IEEE](https://ieeexplore.ieee.org/document/6908986) [Google Scholar](https://scholar.google.com/scholar?as_q=Evaluation+and+acceleration+of+high-throughput+fixed-point+object+detection+on+fpgas&as_occt=title&hl=en&as_sdt=0%2C31)

### Additional References

2. Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3d point clouds: A survey,” <em>IEEE transactions on pattern analysis and machine intelligence</em>, vol. 43, no. 12, pp. 4338–4364, 2020.