# TinyFusionDet: Hardware-Efficient LiDAR-Camera Fusion Framework for 3D Object Detection at Edge

## Abstract

Current LiDAR-Camera fusion methods for 3D object detection achieve considerable accuracy at the immense cost of computation and storage, posing challenges for the deployment at the edge. To address this issue, we propose a lightweight 3D object detection framework, namely TinyFusionDet. Specially, we put forward an ingenious Hybrid Scale Pillar Strategy in LiDAR point cloud feature extraction to efficiently improve the detection accuracy of small objects. Meanwhile, a low cost Cross-Modal Heatmap Attention module is presented to suppress background interference in image features for reducing false positives. Moreover, a Cross-Modal Feature Interaction module is designed to enhance the cross-modal information fusion among channels for further promoting the detection precision. Extensive experiments demonstrated that TinyFusionDet achieves competitive accuracy with the lowest memory consumption and inference latency, making it suitable for hardware constrained edge devices. Furthermore, TinyFusionDet is implemented on a customized FPGA-based prototype system, yielding a record high energy efficiency up to 114.97GOPS/W. To the best of our knowledge, this marks the first real-time LiDAR-Camera fusion detection framework for edge applications.

## Authors

Yishi Li *Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education), School of Microelectronics, Xidian University, Xi’an, China; Chongqing Innovation Research Institute of Integrated Circuits, Xidian University, Chongqing, China* [ORCID: 0000-0002-6600-355X](https://orcid.org/0000-0002-6600-355X)

Fanhong Zeng *Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education), School of Microelectronics, Xidian University, Xi’an, China; Chongqing Innovation Research Institute of Integrated Circuits, Xidian University, Chongqing, China* [ORCID: 0009-0009-2433-2679](https://orcid.org/0009-0009-2433-2679)

Rui Lai *Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education), School of Microelectronics, Xidian University, Xi’an, China; Chongqing Innovation Research Institute of Integrated Circuits, Xidian University, Chongqing, China* [ORCID: 0000-0002-8458-6429](https://orcid.org/0000-0002-8458-6429)

Tong Wu *Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education), School of Microelectronics, Xidian University, Xi’an, China; Chongqing Innovation Research Institute of Integrated Circuits, Xidian University, Chongqing, China* [ORCID: 0009-0000-2766-1966](https://orcid.org/0009-0000-2766-1966)

Juntao Guan *Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education), Xidian University, Hangzhou, China* [ORCID: 0000-0002-1640-6799](https://orcid.org/0000-0002-1640-6799)

Anfu Zhu *School of Electronic Engineering, North China University of Water Resources and Electric Power, Zhengzhou, China*

Zhangming Zhu *Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education), Xidian University, Hangzhou, China* [ORCID: 0000-0002-7764-1928](https://orcid.org/0000-0002-7764-1928)

## Publication Information

**Journal:** IEEE Transactions on Circuits and Systems for Video Technology **Year:** 2025 **Volume:** 35 **Issue:** 9 **Pages:** 8819-8834 **DOI:** [10.1109/TCSVT.2025.3556711](https://doi.org/10.1109/TCSVT.2025.3556711) **Article Number:** 10947105 **ISSN:** Print ISSN: 1051-8215, Electronic ISSN: 1558-2205

## Metrics

**Total Downloads:** 186

## Funding

- National Science and Technology Innovation 2030-Major Projects (Grant: 2021ZD0114400)
- Natural Science Foundation for Young Scientists of Shanxi Province (Grant: 62304162)
- China Postdoctoral Science Foundation (Grant: 2024M762532)
- Postdoctoral Fellowship Program of CPSF (Grant: GZC20241313)
- Shaanxi Provincial Natural Science Foundation for Basic Research Program (Grant: 2024JC-YBMS-794)
- Fundamental Research Funds for the Central Universities (Grant: XJSJ24090)

---

## Keywords

**IEEE Keywords:** Three-dimensional displays, Feature extraction, Accuracy, Point cloud compression, Object detection, Image edge detection, Laser radar, Semantics, Proposals, Hardware

**Index Terms:** Object Detection, Detection Framework, 3D Object Detection, 3D Object Detection Framework, Computational Cost, Image Features, Detection Accuracy, Extensive Experiments, Point Cloud, Attention Module, Small Objects, Fusion Method, Memory Consumption, Prototype System, Edge Devices, Considerable Accuracy, LiDAR Point Clouds, Competitive Accuracy, Point Cloud Features, Objective Accuracy, 3D Detection, Feature Fusion, Voxel-based Methods, Feature Map Size, Feature Maps, Image Feature Extraction, Large Objects, Size Weight, 3D Bounding Box

**Author Keywords:** 3D object detection, LiDAR-camera fusion, tiny machine learning, FPGA

undefined
## SECTION I. Introduction

Object detection in a 3D world is a fundamental task in computer vision that plays a crucial role in various real-world applications, including autonomous driving, robotics and augmented reality [^1], [^2]. Its purpose is to annotate the 3D coordinates, dimensions (length, width, height), and provide a 3D bounding box for the objects of interest. With the advancement of LiDAR and cameras, their perception capabilities and reliability have significantly improved. Therefore, numerous classical detection methods based on 2D image [^3], [^4], [^5], [^6], [^7], [^8] and 3D point cloud [^9], [^10], [^11], [^12], [^13] have been proposed.

In recent years, unmanned platforms have experienced rapid development and are widely utilized in various fields such as surveying and mapping, military applications, and consumer electronics. Simultaneously, due to the miniaturization and technological maturity of LiDAR, the integration of LiDAR and cameras on unmanned platforms has been increasingly adopted. Numerous point cloud based and fusion-based 3D object detection methods [^14], [^15], [^16], [^17], [^18] on unmanned aerial vehicles (UAVs) have been proposed. Due to the constraints of power and size of UAVs, edge computing nodes with strict limitations on computing power and storage capacity are commonly employed. Specially, it is required to process data near sensors with only hundreds of milliwatts of power consumption and several megabytes of memory. To boost the performance under the constraints at edge, a series of works [^19], [^20], [^21], [^22], [^23], [^24], [^25] on tiny machine learning have been proposed.

LiDAR point cloud is composed of points with coordinate and reflectivity data, which provides abundant 3D shape and spatial information. However, compared to 2D images captured from camera, 3D point clouds are sparse and lack enough semantic and texture information, resulting in insufficient detection accuracy for objects at medium to long-range distances. In view of this, a significant amount of works [^26], [^27], [^28], [^29], [^30], [^31], [^32], [^33], [^34], [^35], [^36], [^37], [^38], [^39], [^40], [^41], [^42], [^43], [^44], [^45], [^46] have been successively presented to fuse the object information in 2D image with point cloud features for further promoting the accuracy of 3D object detection.

The existing LiDAR-Camera fusion detection methods can be broadly classified into three categories [^47], [^48]: proposal-level fusion, point-level fusion and feature-level fusion. The proposal-level fusion methods [^26], [^27], [^28], [^30], [^36] extract features respectively from image and point cloud, generating proposals [^26], [^27] or bounding box results [^28], [^30], [^36] for each modality. Then, a cross-scale fusion is arranged to generate the final 3D bounding boxes. Given that, proposal-level fusion heavily relies on previous stage detection results and overlooks the cross-modal relations, which leads to limitation in accuracy. To address this limitation, Point-level fusion methods [^31], [^32], [^33], [^34] paint semantic features of image on LiDAR foreground points, which significantly improves the detection precision. However, the effectiveness of point-level fusion is constrained by the rigid association between points and pixels established by calibration matrices. In contrast, feature-level fusion methods [^35], [^36], [^37], [^38], [^39], [^40], [^41], [^42], [^43], [^44] specially introduce a feature fusion process followed the feature extraction backbone, which leverages the cross-modal semantic features in detection, resulting in enhanced robustness and higher accuracy.

However, the current methods have not adequately considered the hardware implementation efficiency. This presents three main challenges: (1) The limited storage space of edge devices restricts the size of weights and feature maps; (2) The limited computational resource restricts the algorithmic complexity; (3) Optimal balance between the detection performance and hardware consumption is indeed an elusive target.

To address these challenges, this paper proposes a hardware-efficient lightweight framework termed TinyFusionDet for LiDAR-Camera fusion object detection at edge. Specially, we propose a Dense Symmetric Linear Residual Block (DSLRB) to expand the receptive field in deep layers of Image and LiDAR Feature Extractor, which helps to promote the detection precision for larger objects. Furthermore, a Hybrid Scale Pillar Strategy (HSPS) is put forward and leveraged in LiDAR Feature Extractor for enhancing the small object detection. Followed that, we improve the cross-modal feature augmentation and fusion process in a matrix multiplication free manner. Specially, we propose an innovative Cross-Modal Heatmap Attention module (CMHA), which predicts a learnable heatmap for strengthening the cross-modal representation of image features with low computation and memory cost. On this basis, we propose a Cross-Modal Feature Interaction module (CMFI) to further aggregate and enhance the cross-modal features among channels with a powerful information fusion structure Metaformer [^49], which yields a remarkable promotion in detection accuracy.

The extensive experiments on KITTI [^50] and nuScenes [^51] indicate that our proposed TinyFusionDet is superior to the competitive methods in inference speed, hardware consumption with a considerable detection accuracy. Furthermore, we design a customized hardware accelerated prototype system based on FPGA, which implement TinyFusionDet in real-time and consumes less than 3MB of memory overall. To the best of our knowledge, this is the first work to implement LiDAR-Camera fusion based 3D object detection at edge.

In summary, the contributions of this paper are as follows:

- We propose a lightweight 3D fusion detection framework termed TinyFusionDet, which achieves the lowest memory consumption and inference latency with competitive accuracy.
- We put forward an ingenious Hybrid Scale Pillar Strategy (HSPS), which encodes multi-scale pillars to especially improve the detection accuracy of small objects without increasing memory costs for features.
- We present a Cross-Modal Heatmap Attention module (CMHA), which predicts a learnable attention map to enhance the image feature representation at a much lower cost of computation and memory.
- We design a Cross-Modal Feature Interaction module (CMFI), which employs Meta framework to fuse the cross-modal features for a more precise detection.
- To the best of our knowledge, we are the first to deploy a LiDAR-Camera fusion detection framework on FPGA, achieving a record high energy efficiency up to 114.97GOPS/W.

## SECTION II. Related Works

### A. LiDAR-Based 3D Detection

Commonly, LiDAR-based 3D object detection methods can be classified into three categories: point-based, bird’s-eye-view (BEV) based and voxel-based methods.

Point-based methods [^9], [^10], [^28], [^36], [^52], [^53] take raw points as input. F-PointNet [^28] and PointFusion [^36] predict 3D bounding boxes from 2D detection results, which leads to an unstable accuracy. Given that, PointRCNN [^9] and STD [^10] employ PointNet++ [^54] as backbone and predict 3D boxes by point-wise features. In the high detection accuracy premise, the state-of-the-art 3DSSD [^53] proposes a feature distance-based sampling strategy to improve the inference speed up to 38ms on a powerful GPU. Even so, numerous experiments have shown that point-based methods are still computationally intensive, which hinders their application at edge. The bird’s-eye-view (BEV) based methods [^11], [^55], [^56], [^57] project the point cloud into 2D space, which significantly simplifies the computation. BirdNet [^55] and PIXOR [^11] exploit efficient data representation based BEV map with specific coding methods. HDNET [^57] proposes to fuse the BEV representation from an independent high-definition LiDAR map predictor. However, a considerable number of 3D features are lost with the height information along Z-axis omitted, making the detection accuracy seriously decline. The voxel-based methods [^12], [^13], [^58], [^59], [^60], [^61], [^62], [^63], [^64] divide the 3D space into regular voxels as input. VoxelNet [^59] firstly presented voxel feature coding (VFE) to extract point-wise features and then aggregate them with following 3D convolutions. For achieving higher precision, Part-A2 [^62] introduced an RoI-aware pooling operation to preserve information of all points within the proposals for eliminating the ambiguity. $H^{2}3$D-RCNN [^63] extracted 3D voxel features from both perspective view and bird-eye view. PDV [^64] addressed the challenges posed by non-uniform point cloud sampling and leveraged the point density-distance relationship for improving accuracy. However, the computational cost of 3D convolution or the complex structure hinders their expansion in real-time applications. Given this, SECOND [^12] presented sparse convolution, which avoids unnecessary computations in empty spaces and reduces the inference latency to 50ms. Based on SECOND, PointPillars [^13] further compresses the memory consumption of voxels, achieving an inference time of 23ms. In addition, Other hybrid of the above method [^65] is also available.

### B. LiDAR-Camera Fusion 3D Detection

LiDAR-Camera Fusion detection methods leverage the depth information from point clouds and textural details from images, offering a comprehensive understanding of the environment with complex fusion process. The existing methods can be classified into three main categories: proposal-level, point-level and feature-level fusion.

In detail, proposal-level fusion methods [^26], [^27], [^28], [^29], [^30] extract the final information from two modalities respectively and perform fusion either at the proposal stage or result stage. MV3D [^27] and AVOD [^26] generate a set of proposals from feature map and RoI pooling, which is used to convert feature blocks from disparate size to uniform size. However, ROI feature fusion only occurs on high-level feature maps and selectively fuses features of specific target regions on the feature map, causing partial detail information loss. Meanwhile, F-PointNet [^28], RoarNet [^29], CLOCs [^30] fuse the predicted bounding boxes from each sensor at result stage. F-PointNet [^28] combines the two predicted results through cascaded fusion of 2D and 3D detectors. However, the accuracy is limited by 2D detection. CLOCs [^30] proposes a sub-network to learn from 3D and 2D candidates and then predicts the 3D bounding boxes, which effectively promotes the detection accuracy.

Point-level fusion methods [^31], [^32], [^33], [^34] assign semantic information from image to foreground raw points and then implement fine-grained fusion at the point level. PI-RCNN [^31] proposes point-based continuous attention convolution fusion, which directly fuses multi-sensor features on 3D points. However, the sparsity of point cloud leads to the negative impact. PointAugmenting [^32] and PointPainting [^33] process both the aligned points and image semantics by a 3D object detection framework. However, the utilization of image features to enhance 3D points may introduce 2D semantic constraints. Conversely, projecting point clouds onto images may lead to severe performance degradation due to the disruption of consistency in multi-modal representations. Given this, VFF [^34] proposes an innovative point-to-ray projection approach. In general, point-level fusion methods have high accuracy but are susceptible to multi-sensor misalignment.

Feature-level fusion methods fuse the features from different sensors before proposal prediction, which is typically divided into hard and soft association methods. Hard association fusion [^35], [^36], [^37], [^38], [^39], [^40] directly combines point clouds and images at the element level. ContFuse [^35] proposes a fusion layer to integrate BEV and image feature maps. Similarly, PointFusion [^36] puts forward a dense feature fusion for point cloud and image. Then, DeepFusion [^37] finds that data augmentation may affect the relevance between features, and accordingly presents a solution of InverseAug robust data augmentation technique. Although abovementioned methods are simple and efficient, they still could not fully exploit the complementary nature between modalities. To alleviate the flexibility issues of hard fusion, EPNet [^39] introduces a LI-Fusion module that dynamically assesses the relevance of semantic characteristics in the image. As for soft association fusion methods [^41], [^42], [^43], [^44], [^66], [^67], they fuse the contextual relationships between features effectively. Due to the excellent performance of transformers in global search, many methods adopt transformers as fusion components. such as, TransFuser [^41] directly use the cross-attention module in transformer to fuse features at multiple scales, which is a robust solution for inferior image conditions. CAT-Det [^43] leverages the Pointformer branch and Imageformer branch to obtain a wide receptive field and capture comprehensive global information from point clouds and images, respectively. Similarly, LoGoNet [^44] introduces transformer based global and local fusion and utilizes a self-attention module to achieve information interaction between these globally and locally fused features. GraphAlign++ [^66] proposes a new graph matching based framework to construct graphs of point cloud features and matches neighboring fused features across modalities to find a more suitable alignment relationship compared to single point-pixel matching. While, since BEV space provides a unified coordinate system and efficient 2D convolutions can be used, RobBEV [^67] realizes independent extraction of LiDAR and camera features in a BEV space, and a mutual deformable attention module and temporal aggregation module are designed for adaptive cross-modal feature selection and continuous image fusion respectively.

To summarize, the soft association fusion mechanism utilizes the cross-attention mechanism to establish flexible associations between point clouds and images, which greatly contributes to improve the detection accuracy. However, such a methodology primarily relies on the transformer module with high computational load and memory consumption, making it hard to be deployed on hardware at the edge.

### C. LiDAR Processing on Edge

In recent years, to expedite LiDAR processing on edge devices, numerous studies have introduced solutions for networks based on point clouds. Zhang developed an accelerator for channel clustering in LiDAR point clouds, achieving a speedup of over 471.5 times compared to CPU execution [^68]. Feng introduced an ASIC-based accelerator called Mesorasi for PointNet++, along with optimization strategies for neighbor point searches [^69]. Zheng created a low-power FPGA-based accelerator, which enhanced the nonlinear operations in PointNet [^70]. PointAcc proposed an ASIC-based accelerator that consolidated various mapping operations into a multiply-accumulate operation via coordinate transformation, making it compatible with different point cloud networks [^71]. In contrast, FPGA-based CNN accelerators are more energy-efficient than GPUs and can perform more extensive parallel processing than CPUs [^72]. Consequently, there is a strong need for a dedicated FPGA-based accelerator for pillar-based methods to facilitate their deployment.

## SECTION III. Method

### A. Overview

In order to empower edge devices to run 3D object detection in real time, we specially propose a compact LiDAR-Camera fusion framework with minimal memory cost and computational load. As shown in Fig. 1, the workflow of the framework can be summarized as: (1) A lightweight Image Feature Extractor is utilized to extract texture features from 2D image; (2) As for the LiDAR Feature Extractor, Hybrid Scale Pillar Strategy is put forward to strengthen the intrinsic feature representation of tiny backbone network in our previous TinyPillarNet [^73]; (3) The Cross-Modal Heatmap Attention module employs the extracted Image and LiDAR features to jointly predict the attention map for enhancing the image features; (4) Cross-Modal Feature Interaction module with smaller and more powerful Metaformer architecture is designed to fuse the Image and LiDAR features; (5) The fused features is finally fed into the detection head to predict the classification and bounding boxes.

![Figure 1](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai1-3556711-large.gif)

*Fig. 1. The structure of our proposed TinyFusionDet framework. LiDAR and Image Feature Extractors obtain features from point cloud and 2D image, respectively. The Hybrid Scale Pillar Strategy is proposed to enhance the LiDAR feature of the small object. Following the approach in [73], PPME encodes the 3D points to 2D intrinsic and distributional pseudo-maps, which are then processed by the Tiny Backbone Network and Saliency Enhancement Network. Thereafter, LiDAR and image features in 2D space are fed into Cross-Modal Heatmap Attention module, which enhances the image feature by cross-modal heatmap. Additionally, a Cross-Modal Feature Interaction module further fuses 2D image and LiDAR features to effectively improve the detection accuracy.*

To present a LiDAR-Camera 3D detector suitable for edge devices, the key design consideration in this paper are: (1) Shrinking the size of feature maps and weights for saving storage space; (2) Simplifying the operation and structure of detector to facilitate easy deployment.

### B. LiDAR and Image Feature Extractor

#### 1) Pillar Encoder:

The point in the point cloud is represented as $p_{n}(x_{n},y_{n}, z_{n},r_{n})$, where $x_{n},y_{n}, z_{n}$ are 3D coordinates and $r_{n}$ is reflectance. The 3D space with the range $[x_{min}, y_{min}, z_{min}, x_{max}, y_{max}, z_{max}]$ is evenly divided into grid cells of size $(g_{x}, g_{y})$ along the X and Y axes. Then, the points are partitioned into subspaces termed pillars, which can be defined as

$$
\begin{align*} P_{i, j} & = \{ p_{n}~|~\lfloor x_{n}/g_{x} \rfloor = i, \\ ~\lfloor y_{n}/g_{y} \rfloor & = j, \\ ~x_{min} & \le x_{n} \le x_{max}, \\ ~y_{min} & \le y_{n} \le y_{max} \} \tag {1}\end{align*}
$$

where $\lfloor \cdot \rfloor$ is the floor function. ($i, j$) indicates the coordinate of a certain pillar.

To dramatically reduce the memory consumption of pillars, we have employed the PPME in our previous work [^73] to pre-encode pillars into two types of extremely compact pseudo-maps in 2D space, which can be defined as

$$
\begin{align*} I, D & = PPME(P) \\ I & = \{I_{zmin}, I_{zmax}, I_{r}\} \\ D & = \{D_{n}, D_{dd}\} \tag {2}\end{align*}
$$

where *P* represents the pillar set, *I* and *D* respectively stand for intrinsic pseudo-map and distributional pseudo-map. *I* is composed of descriptors $I_{zmin}$, $I_{zmax}$ and $I_{r}$, representing the minimum height, maximum height and average reflectance of the pillar, respectively. *D* consists of descriptors $D_{n}$ and $D_{dd}$, separately, representing the number of points and customized disorder degree.

#### 2) Hybrid Scale Pillar Strategy:

In voxel-based methods, the grid size directly affects the representation precision for the shape of 3D objects. Larger grid size applies to represent large objects such as cars, trucks, trains, etc., while smaller grid size can preserve more detailed 3D information and is advantageous for characterizing small objects such as person, cyclist, etc.

Previous works [^60], [^74] indicated that using a mixture of multi-scale voxels can significantly improve the detection accuracy. However, small grid sizes will create a higher resolution of the pseudo-map as well as result in a remarkable memory explosion.

To balance the pseudo-map resolution and memory consumption, a novel Hybrid Scale Pillar Strategy (HSPS) is proposed and applied in the Tiny Backbone Network. As shown in Fig. 2, we adopt multiple grid sizes to encode points to multi-resolution intrinsic pseudo-maps for various target sizes. In practice, we set two sets of grid sizes as $(g_{xs},g_{ys})$ and $(g_{xl}, g_{yl})$. The corresponding detection ranges are $[x_{mins}, y_{mins}, x_{maxs}, y_{maxs}]$ and $[x_{minl}, y_{minl}, x_{maxl}, y_{maxl}]$, which can be formulated as

$$
\begin{equation*} \frac {g_{xs}}{g_{xl}} = \frac {g_{ys}}{g_{yl}} = \frac {x_{maxs} - x_{mins}}{x_{maxl} - x_{minl}} = \frac {y_{maxs} - y_{mins}}{y_{maxl} - y_{minl}} \tag {3}\end{equation*}
$$

![Figure 2](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai2-3556711-large.gif)

*Fig. 2. The structure of the proposed Hybrid Scale Pillar Strategy (HSPS) in Tiny Backbone Network, which generates a hybrid feature from multi-scale intrinsic pseudo-maps.*

Followed that, intrinsic features from a dual-branch encoding process are further fused by embedding operation to generate a hybrid feature, which simply replaces the corresponding region in low-resolution feature map with the high-resolution BEV map. As for the embedding operation, it only enhances the features in central field and provides a significant performance improvement with minimal computational cost while compared to full large-grid-feature.

#### 3) LiDAR Feature Extractor:

The architecture of LiDAR Feature Extractor with HSPS is shown in Fig. 3, which mainly consists Tiny Backbone Network (TBN) and Saliency Enhancement Network (SEN). As for TBN, the 2D intrinsic pseudo-maps are firstly extracted by HSPS. Thereafter, a top-down sub-network consists of LRB and DSLRB extracts multi-scale 2D features, which then are aligned by up-sampling and summed up. In SEN, a global-sized 2D distributional pseudo-map is used to adaptively generate a saliency map, which highlights the location of targets. Finally, the summed feature is multiplied with the saliency map, generating the 2D LiDAR feature.

![Figure 3](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai3-3556711-large.gif)

*Fig. 3. The structure of Image and LiDAR Feature Extractors, as well as their basic modules, including the previously proposed Linear Residual Block (LRB) [73] and the newly proposed Dense Symmetric Linear Residual Block (DSLRB).*

In the proposed framework, we leverage the lightweight building block Linear Residual Block (LRB) presented in our previous work [^73]. As shown in Fig. 3, LRB consists of a DWConv and two PWConvs. The shortcut connection on features with more channels preserves more information from bottom layers and leads to more gradients propagated across multiple layers in training [^20]. Considering that stacking of $3 \times 3$ DWConvs will cause the sparsity of features to disappear rapidly and further make the shape of small objects be blurred [^75], only one DWConv is involved in LRB.

Specially, we propose a Dense Symmetric Linear Residual Block (DSLRB), which introduces an additional $3\times 3$ DWConv on the basis of LRB to enhance the receptive field and improve the detection precision of large targets. Moreover, dense connections are employed to propagate gradients and stabilize training. It’s worth noting that DSLRB is only used in deep layers. As for the reason, deploying DSLRB in shallow layers may fuzz up the features of small objects and reduce the detection accuracy. Briefly, the backbone network of 3D Feature Extractor mainly consists of 5 groups as shown in Fig. 3. We refer to G1 as the shallow layer, G2 and G3 as the deep layers.

#### 4) Image Feature Extractor:

As shown in Fig. 3, the Image Feature Extractor extracts the semantic features of the image captured by the camera. As for its architecture, DSLRBs are adopted as building blocks to construct the backbone. Different from LiDAR Feature Extractor, the shallow layer still adopt DSLRB instead of LRB, the reason lies in that the scale proportion of small objects in 2D scene is much larger than that in 3D scene. Following the existing methods [^37], [^44], the network includes 3 pooling operations with stride 2 and a $2\times$ up-sampling operation, resulting in the spatial size of the generated 2D features being one-fourth of the image. This ensures that the 2D features have dense semantics while keeping the computational complexity relatively low.

### C. Cross-Modal Heatmap Attention Module

Previous feature level fusion methods, especially the works [^41], [^42], [^43], [^44] based on transformer as shown in Fig. 4(a), typically predict the cross-attention information by calculating correlation between LiDAR and Image features. Then, Image features enhanced by the cross-attention scores to remove background information that is unrelated to the 3D information, which plays a crucial role in the effectiveness of the 2D image branch. However, there still exists some limitations to implement these methods on edge devices: (1) Complex connections in the network will lead to significant memory consumption. (2) In the cross-attention module, the matrix multiplication operation used to compute cross-modal feature similarity requires large cost of computation and storage space. (3) Complex exponentiations in Softmax are also major sources of inference latency.

![Figure 4](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai4-3556711-large.gif)

*Fig. 4. The structure of the commonly used cross-attention module and our proposed Cross-Modal Heatmap Attention module (CMHA). As can be seen, our CMHA maintains consistency with the input features in terms of feature size, making it relatively lightweight.*

To address the above issues, we present a Cross-Modal Heatmap Attention module (CMHA) to achieve a fused attention mask containing cross-modal correlations. The detailed structure of CMHA is illustrated in the Fig. 4(b). CMHA firstly predicts the LiDAR BEV heatmap as well as the Image heatmap with a pointwise convolution operation followed by a Sigmoid activation function. Since the point cloud encoding is applied along the Z-axis, the LiDAR heatmap is generated from a BEV perspective. We followed the BEVFusion [^76], and project the LiDAR BEV heatmap to the image view by the LiDAR-to-Image View Transform module through the calibration matrix. Thereafter, the LiDAR BEV heatmap can be transformed to the same view as the image heatmap and fused with it. Given this, the fused attention mask containing characteristics of LiDAR point clouds is finally employed to generate the enhanced image feature.

As for the ground truth LiDAR heatmaps $HM_{L}$ and image heatmaps $HM_{I}$, they are consistent with the keypoint heatmap used in CornerNet [^7] and CenterNet [^8]. The centers of ground truth $HM_{L}$ and $HM_{I}$ are both splatted onto the heatmap *HM* using a Gaussian kernel, which is represented as

$$
\begin{equation*} HM = exp\left \{{{-\frac { (x-gt_{x})^{2} + (y-gt_{y})^{2} }{2 \sigma _{gt}^{2} }}}\right \} \tag {4}\end{equation*}
$$

where $(gt_{x}, gt_{y})$ is the coordinate of a ground truth center, and $\sigma _{gt}$ is an object size-adaptive standard deviation [^7]. If two Gaussian kernels of the same class overlap, we take the element-wise maximum of them. Compared with the typical cross-attention module in the Fig. 4(a), our CMHA generates similar attention mask. However, the fused attention mask produced by multiplying LiDAR and image feature matrixes in the cross-attention module is much larger than the mask shape in the CMHA. Furthermore, CMHA avoids the complex operations of matrix multiplication and Softmax. The minimal memory consumption and simple operations are beneficial for edge applications.

In addition, our CMHA explicitly learns the ability to capture foreground positions by predicting heatmaps, which facilitates training for rapid convergence. However, cross-attention module relies on similarity calculation between modals, which requires multiple modals to have similar representation for the same object, which leads to high learning costs, higher requirements for feature extraction networks, and difficult convergence in training.

### D. Cross-Modal Feature Interaction Module

Before obtaining the detection results, the enhanced image features (mainly containing foreground object information) will be fused with the LiDAR features by the cross-modal fusion module. The design of the module directly impacts the expression capability of the object related features.

To facilitate the interaction of cross-modal information, commonly used methods include simple addition or concatenation and modules like self-attention module. However, we find that simple addition and concatenation may lead to feature confusion. Moreover, self-attention module and its variants are still high cost in hardware implementation.

Inspired by the high performance demonstrated by the work on Metaformer [^49], we propose the Cross-Modal Feature Interaction module (CMFI) with Metaformer-like framework to fuse heterologous features and further extract task related information.

As the structure shown in Fig. 1, before concatenating the image features and LiDAR features in 2D space, the enhanced image feature will firstly be projected onto the LiDAR plane through the Image-to-LiDAR view transform module. In Fig. 5, the LiDAR feature is firstly applied a precoding of $1\times 1$ convolution followed by a concatenation mixer for combining the unified view of LiDAR and image features. Then, another $1 \times 1$ convolution is introduced to reduce the channel size. It is worth noting that, we employ residuals in CFMI to strengthen the propagation and fusion of the LiDAR feature. The final MLP expands and then reduces channels, without using any intermediate BN layers, allowing for full interaction of information between the channels. This is also an effective feature fusion module in the Metaformer architecture.

![Figure 5](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai5-3556711-large.gif)

*Fig. 5. The structure of the proposed Cross-Modal Feature Interaction module (CMFI).*

### E. Training Losses

The proposed TinyFusionDet is trained in an end-to-end manner. Following the SECOND [^12] and PointPillar [^13], SSD-like detection head is introduced to predict the class, box, and direction of objects respectively with confidence loss $L_{cls}$, direction loss $L_{dir}$, regression box loss $L_{box}$, LiDAR heatmap loss $L_{HML}$ and Image heatmap loss $L_{HMI}$, which is defined as

$$
\begin{align*} L & =\alpha _{cls}L_{cls} + \alpha _{box}L_{box} + \alpha _{dir}L_{dir} \\ & \quad +L_{HMI} + L_{HML} \tag {5}\end{align*}
$$

where the $\alpha _{cls}$, $\alpha _{box}$ and $\alpha _{dir}$ are the hyper-parameter respectively set as 1, 2, 0.2 for balancing the losses.

According to CornerNet [^7] and CenterNet [^8], we specially formulate the heatmap loss function by using a penalty-reduced pixel-wise logistic regression with focal loss [^77] as

$$
\begin{align*} & L_{HM} = \\ & \quad -\frac {1}{N_{pos}} \sum _{xy} \begin{cases} (1 - \hat {HM})^{\alpha } log(\hat {HM}) & if \: HM=1 \\ \\ (1 - HM)^{\beta }(\hat {HM})^{\alpha }\\ log(1-\hat {HM}) & otherwise \end{cases} \tag {6}\end{align*}
$$

where $\hat {HM}$ represents the prediction heatmap. Noted that both LiDAR heatmap loss $L_{HMI}$ and Image heatmap loss are achieved by $L_{HM}$. $\alpha =2$ and $\beta =4$ indicate hyper-parameters of the focal loss, and $N_{pos}'$ stands for the number of centers.

## SECTION IV. FPGA-Based Prototype System

### A. Architecture of the Prototype System

In order to verify the feasibility for applying to edge devices, we design a prototype system based on Xilinx ZYNQ XCZU9EG FPGA to deploy our proposed TinyFusionDet. As the overall architecture shown in Fig. 6, our prototype accelerator is mainly consisted of system controller, memory unit and compute unit.

![Figure 6](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai6-3556711-large.gif)

*Fig. 6. The hardware architecture of the FPGA-based prototype system.*

The system controller is responsible for parsing customized 128bit instructions, controlling the state of the accelerator, managing data read/write operations, and handling external memory read/write operations. The memory unit can be divided into three functional parts: instruction buffer, weight buffer and 2 tensor buffers, which respectively store instructions, weight parameters and feature maps. The compute unit implements various neural network operators, such as convolutions, batch normalization (BN), pooling, activation functions. The compute unit is directly connected to the buffers in the memory unit through a 256bit internal data bus.

### B. Co-Design of Algorithm and Hardware

For an energy-efficient inference, we conduct the co-design of algorithm model and accelerator hardware as follows.

#### 1) Computing Unit:

Considering the remarkable hardware cost of regular $3 \times 3$ convolutions, TinyFusionDet employs depthwise separable convolution, where DWConv and PWConv module consist of 288 and 1024 processing elements (PEs), which supports int8 multiplication and addition. The output of convolution is typically followed by BN operation.

By simultaneously accessing multiple read/write interfaces, we realize reading input tensors and weights as well as writing output tensors within a single cycle to boost the efficiency. To further minimize the memory access, both of BN and ReLU are integrated into the convolution operation pipeline.

The workflow of a certain computing unit is as follows: (1) Fetch and decode an instruction; (2) Select the operation module by system controller; (3) Read input tensor and weights from corresponding buffers; (4) Implement computation and write the output tensor.

#### 2) Tensor Layout:

From observing the computing process, we find that different neural operations access features along different dimensions. For example, concatenation accesses the tensor along $W_{f}$, while embedding operation performs along $C_{f}$. The other operations are independent of access order, such as up-sampling, element-wise addition and element-wise multiplication. It is clear that a unified data layout cannot satisfy all the access mode. Given this, we propose to implement two types of layouts for tensors as Fig. 7(a) and (b). As for Spatial-Prioritized Tensor Layout, it first stores in the order of $H_{f}W_{f}$, and then connects different channels sequentially. In contrast, the Channel-Prioritized Tensor Layout stores feature tensors by connecting each pixel along the channel in a continuous manner. Different layouts as Fig. 7 are designed to access tensors as sequentially as possible.

![Figure 7](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai7-3556711-large.gif)

*Fig. 7. The typical configurations of the tensor layouts for the proposed TinyFusionDet, which can optimize sequential memory access and reduce the inference time.*

Naturally, different operations will support the tensor with corresponding layout as input according to the memory access order. For instance, PWConv and concatenation utilize the Spatial-Prioritized Tensor Layout, while DWConv, addition, up-sampling, and embedding employ the Channel-Prioritized Tensor Layout. Specially, we implement PWConv using the classic *im2col* operation and DWConv using the convolution pipeline method. The *im2col* operation rearranges the tensor into $C_{f}H_{f}W_{f}$ format, which aligns with the Spatial-Prioritized layout. In contrast, the pipeline method requires simultaneous access to multiple channels in a row, which corresponds to the Channel-Prioritized layout.

Furthermore, we implement *Move* operation to translate the layout of tensors. To prevent the efficiency from being affected by frequent layout translations, the main computing modules of DWConv and PWConv support the selection of output tensor layout by compiling in according to next operation.

#### 3) Memory Unit:

The most concern in the design is to balance the memory efficiency and computational accuracy. Based on extensive experiments, we firstly determine the grid size in 3D Feature Extractor as (0.08m, 0.08m) and (0.16m, 0.16m) for ensuring the accuracy. Then, considering the on-chip memory of FPGA, we control the size of input pseudo-maps by regulating the corresponding detection ranges as [3m, -10.24m, 28.6m, 10.24m] and [3m, -20.48m, 54.2m, 20.48m]. Therefore, the size of input pseudo-maps is (320, 256). Accordingly, the resolution of input images from camera is set to (512, 160) for matching the size of pseudo-map.

Followed that, we regulate the width of network and set both of the input and output tensor buffers to be 1280KB. In order to fully leverage the tensor buffer, we change the original inverse residual structure of LRB/DSLRB to a residual structure in the shallow layers of Image/LiDAR Feature Extractor. In detail, we set $2 C1 = C2$ in LRB/DSLRB blocks to ensure that addition and Pointwise Conv operation both satisfy the size of tensor buffer for a higher storage utilization ratio.

Finally, a 1280KB large weight buffer is arranged to realize complete on-chip weight access, simultaneously reducing the power consumption and inference latency.

## SECTION V. Experiments

### A. Training Strategy

#### 1) Dataset:

Following the practice of popular 3D detection models, we conduct the experiments on KITTI [^50] and nuScenes [^51] datasets. The KITTI training set consists of 7481 samples, while the test set contains 7518 samples. For experimental studies, we split the official training set into 3712 training samples and 3769 validation samples following [^78]. According to KITTI benchmark, we focus on the APs of *Car*, *Cyclist* and *Pedestrian* to compare the accuracy. The dataset categorizes the 3D bounding boxes into three levels of difficulty (easy, moderate and hard) by the height, the occlusion level and the truncation ratio of bounding boxes [^50]. In the experiments, the accuracy of these three difficulty levels will be calculated separately.

As for nuScenes dataset, it is a large-scale autonomous-driving dataset for 3D detection, which consists of 700, 150, and 150 scenes for training, validation, and testing, respectively. Each sample includes a 360-degree LiDAR point cloud and six high-definition camera images. The dataset contains annotations for 23 distinct object classes, and they are divided into 10 classes, such as car, truck, bus, trailer, construction vehicles (C.V.), pedestrian (Ped.), motorcycle, bicycle, traffic cone (T.C.), and barrier. To compare the accuracy, nuScenes employs standard metrics average precision (AP).

#### 2) Training Settings:

When implementing our TinyFusionDet on KITTI and nuScenes, we customize the preset anchor sizes for each category using statistical information on ground truth boxes and utilize the same data augmentation methods. Following the SECOND and PointPillars methods, we employ random flipping, random global rotation about the Z axis between $[-\pi /6, \pi /6]$ and global scaling with scaling factor [0.95, 1.05] to enhance the spatial variability and improve the generalization performance. Moreover, we apply ground truth sampling for both point cloud and image branches. When adding the ground truth samples, we also avoid the occlusion of samples and ensure the mapping relationship with points.

On KITTI, TinyFusionDet is trained for 100 epochs with batch size 16 on 4 NVIDIA RTX2080Ti GPUs in an end-to-end manner without any pretrain parameters. On nuScenes, we train TinyFusionDet for 30 epochs with batch size 16 on 4 NVIDIA RTX3090 GPUs. All other training parameters remain consistent. Onecycle [^79] learning rate scheduler and Adam [^80] optimizer are employed in the training. The maximum lr in Onecycle is 0.03, pct start (the percentage of lr increasing part) is 0.4, and the weight decay is 0.01.

### B. Experimental Study on KITTI

Since TinyFusionDet is an ultra-compact LiDAR-Camera fusion 3D object detection framework, we not only focus on detection accuracy but also pay special attention to the memory consumption and latency in the inference phase.

#### 1) Accuracy Results:

As the accuracy on KITTI validation set shown in Tab. I, our proposed TinyFusionDet achieves mAP of 65.83% at moderate-level. Compared to single modal methods with LiDAR, our method outperforms the representative lightweight PointPillars by 1.75%. As for PointRCNN and 3DSSD, our voxel-based method has remarkable advantage in inference latency (Tab. V) and storage space (Tab. IV) at a slight cost of accuracy decline. Compared to voxel-based methods like SECOND and $H^{2}3$D-RCNN, TinyFusionDet consumes much lower memory for feature maps. Even compared with the LiDAR-Camera fusion methods, our accuracy significantly surpasses MV3D, PointFusion, F-PointNet, and AVOD. However, due to complex point-based network employed by EPNet and Transformer framework adopted by CAT-Det, they are reasonably superior to our voxel-based TinyFusionDet, especially in the small object detection as pedestrians. As for the reason, there are few points illuminated by the LiDAR on pedestrians, typically only a dozen or so, and pedestrian targets are small and easily confused with other clutter background, resulting in remarkably lower detection accuracy. The accuracy assessment on KITTI test set illustrated in Tab. II also demonstrated that our proposed framework yields competitive accuracy with much lower hardware consumption.

![Figure 8](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t1-3556711-large.gif)

*TABLE I*

![Figure 9](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t2-3556711-large.gif)

*TABLE II*

![Figure 10](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t3-3556711-large.gif)

*TABLE III*

![Figure 11](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t4-3556711-large.gif)

*TABLE IV*

![Figure 12](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t5-3556711-large.gif)

*TABLE V*

It is noteworthy that accuracy of the baseline model, being a sub-network of TinyFusionDet without HSPS, CMHA and CMFI modules, is also shown in Tab. I and Tab. II. it can be observed that the detection accuracy of small objects (such as pedestrian and cyclist) are significant improved about 10%, while large object (cars) is less than 1%. Meanwhile, we present the performance improvement of the fusion methods over the single-modal based methods on the KITTI test BEV benchmark in Tab. III. As can be seen, the absolute precision of TinyFusionDet is really not very high, but the performance gain from cross-modal feature fusion is quite superior to the competitors. Especially in the accuracy improvement of small targets, such as pedestrians and cyclists. The reason lies in: (1) HSPS introduces more fine-grained pseudo maps as inputs, introducing more detailed 3D information of small objects; (2) CMHA fuse more attention from LiDAR BEV perspective, which enhance the 2D features of small objects effectively; (3) Since 3D boxes are composed of central coordinates and sizes, the heatmap of small objects is closer to their spatial center than bigger ones in the heatmap. Given this, the specially designed feature extraction and fusion modules in TinyFusionDet contribute more to the accuracy improvement of small objects.

#### 2) Memory Consumption:

Feature maps and weight parameters are the main source of memory consumption. In general, feature maps need to be frequently accessed and thus utilize high-speed memory, such as RAM located close to computational units. In contrast, weight parameters are generally in a massive size and typically stored in high-capacity memory, such as ROM or Flash. Shrinking the memory consumption will contribute to both of inference speed and power consumption.

We compare the memory consumption of our proposed TinyFusionDet with existing state-of-the-art methods in Tab. IV, including the maximum size of feature map and weights. As can be seen, the weight size of TinyFusionDet is far less than competitors, which validates the outstanding expression efficiency of our proposed model. Moreover, TinyFusionDet only produces slightly more feature map than 3DSSD with raw points input, and outperforms other voxel-based and fusion-based methods in a big margin. In summary, only our proposed framework can satisfy the strict memory constrain of edge devices.

#### 3) Inference Latency:

Since the inference speed is another key metric, we specially implement TinyFusionDet on GPUs with different computation power and illustrate the latency in Tab. V. In the test phase, we assign the batch size as large as possible to fully utilize the performance of each GPU. The computational power of the GPU is measured in terms of Floating-point Operations Per Second (FLOPS). According to the computational power, we categorize the inference latency of the competitors into three classes.

Firstly, We implement TinyFusionDet on the lowest computing power GPU (GTX 1050Ti) and achieve a latency of 72.92ms, which is approximately $5\times$ and $17.8\times$ lower than MV3D and PointFusion, respectively. On GTX 1080Ti, our latency is the second lowest and slightly higher than that of PointPillars by 6.38ms. As for the reason, our multi-modal fusion paradigm requires more branches, computation and memory access than the single-modal PointPillars. Even so, our method is still remarkably faster than other LiDAR-based methods, such as SECOND and fusion-based F-PointNet and 3D-CVF. Furthermore, a minimum latency of 16.93ms is recorded on GTX 2080Ti GPU, advancing the application of high performance fusion-based object detection framework in high real-time scenarios. Notably, our TinyFusionDet generally yields approximate mAP with much lower latency, which confirms its high inference speed is not at the cost of detection accuracy. In other words, TinyFusionDet achieves an excellent balance between latency and performance, which expands its application at the edge.

#### 4) Robustness Analysis:

To comprehensively evaluate the detection accuracy in dense and sparse scenes, we conduct comparison with competitive methods on different beam settings according to EPNet++ [^40] and FBMNet [^81]. As can be seen in Tab. VI, although yielding lower accuracy on dense 64-beam LiDAR, our TinyFusionDet demonstrates significant superiority on sparse 8-beam LiDAR, outperforming single-modal Voxel RCNN and PV-RCNN on the moderate difficulty level respectively by 12.57%, 3.84% and 28.38% of mAP. Similarly, TinyFusionDet boosts mAP of 5.49% and 19.46% over the fusion-based EPNet++ in Cars and Cyclist sets on moderate difficulty level, respectively. For more challenging Pedestrians set on hard difficulty with sparse 8-beam LiDAR points as input, TinyFusionDet still achieves remarkable mAP gain respectively of 3.74%, 9.37%, 4.33% and 0.12% over Voxel RCNN, PV-RCNN, EPNet (only with LiDAR input) and EPNet with a much more lightweight backbone.

![Figure 13](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t6-3556711-large.gif)

*TABLE VI*

More significantly, it can be observed clearly from the complete comparison with both of existing single-modal and fusion based methods, TinyFusionDet presents significantly less mAP degradation with the point cloud density droping from 64-beam to sparser 16-beam and 8-beam, which demonstrates the robustness of our proposed method for challenging remote or small object detection tasks.

### C. Extended Experiments on nuScenes

To assess the generalization performance, we further validate our proposed TinyFusionDet on nuScenes test set. Note that the configurations remain consistent with those on KITTI, such as the pillar encoding strategy and network architecture. To be clear, the accuracy results illustrated in Tab. VII are based on the official testing benchmarks, while the inference speed are all implemented on NVIDIA GTX 3090 GPU.

![Figure 14](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t7-3556711-large.gif)

*TABLE VII*

As shown in Tab. VII, the accuracy of our proposed method on nuScences test set is superior to part of state-of-the-art works with smaller size of parameter or feature map. In terms of inference speed, our method significantly outperforms all LiDAR-Camera fusion based algorithms. As for the reason, the proposed TinyFusionDet consumes much lower computation and storage space than competitive methods list in Tab. VIII, allowing its real-time implementation on the edge devices. By the same token, the limited representation capability of tiny backbone network leads to decline in detection accuracy when compared to large scale models, such as PointAugmenting [^32], TransFusion [^42], BEVFusion [^76] and so on.

![Figure 15](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t8-3556711-large.gif)

*TABLE VIII*

### D. Ablation Study

In this section, we validate the effectiveness of each proposed component through ablation experiments with the same data processing and training settings. The accuracy metric of AP for three categories of moderate difficulty is employed in the assessment.

#### 1) Effects of HSPS:

To validate the role of proposed HSPS, we compare different designs with multi-scale voxels fusion to the baseline without HSPS shown in line1 of Tab. IX. As for Lines 2-4, we respectively apply add, concat and embedding to fuse multi-scale 3D features.

![Figure 16](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t9-3556711-large.gif)

*TABLE IX*

In general, the designs with HSPS yield much higher accuracy, indicating that multi-scale feature is beneficial for the detection of both large and small objects. Due to the different size of intrinsic pseudo-maps, both of “add” and “concat” introduce padding with zeros for alignment, except our proposed embedding operation. Because of the padding operation leads to imbalance fusion of feature from regions with or without padding zeros, resulting in remarkable decline in accuracy.

In terms of memory consumption, “add” requires simultaneously storing global field feature map (640KB) and local field feature map (160KB). “concat” requires padding the local field feature map to the size of the global field map, and then concatenating these two feature maps, which totally consumes 1280KB. However, the embedding operation, being a direct replacement, does not incur additional consumption for the local field feature map, costing only 640KB, giving it advantages in both of accuracy and memory consumption.

#### 2) Effects of DSLRB:

To demonstrate the effectiveness of the proposed DSLRB in extracting features of large objects, we introduce LRB and DSLRB to represent the building blocks G1, G2 and G3 in the backbone of 3D Feature Extractor with different configurations.

As the results shown in Tab. X, with LRB is gradually replaced with DSLRB from Line 1 to Line 3, the mAP continues to increase. The reason lies in the additional DWconv in DSLRB enhance the feature representation. However, in case that G1 is further replaced by DSLRB in Line 4, the mAP instead fell for excessive number of DWconv may result in feature loss from ultra-scale receptive field. Obviously, the AP of smaller objects suffer more from the superfluous DSLRB in shallow layer. Overall, the optimal configuration in Line 3 is highly advantageous for both of large and small objects.

![Figure 17](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t10-3556711-large.gif)

*TABLE X*

#### 3) Effects of CMHA:

To illustrate the effects of the proposed CMHA, we compare the detection accuracy and memory consumption of TinyFusionDet with different heatmap attention strategy in Tab. XI. Line 1 shows the baseline without CMHA, which indicates the image and LiDAR features generated by feature extractors are directly fused by the following CMFI. As a result, the baseline yields the worst accuracy for the reason that 2D background semantics corrupt the LiDAR feature severely.

![Figure 18](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t11-3556711-large.gif)

*TABLE XI*

Compared to the 4-head cross-attention 2D feature enhancing, our CMHA achieves over 3% mAP improvement with about $10\times$ lower memory consumption. As for the reason, the CMHA explicitly learns object distribution, while the cross-attention uses similarity to roughly represent the distribution. Therefore, even utilizing more powerful feature extraction network with larger size of weight [^82], the cross-attention still yields inferior accuracy.

#### 4) Effects of CMFI:

To prove the effectiveness of presented CMFI, we assess various operations for cross-modal feature fusion in Tab. XII. As for the addition and concatenation commonly used in traditional methods [^26], [^27], [^35], [^39], they show a mAP decrease over 2.5% for the diverse representations from directly fusion in unified feature space.

![Figure 19](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t12-3556711-large.gif)

*TABLE XII*

Furthermore, transformer based self-attention framework generally used in image and LiDAR feature fusion works [^41], [^42], [^43], [^44] is included in the comparison. From the results in Tab. XII, we can find that our CMFI outperforms self-attention 0.7% in mAP with about $5\times$ lower feature storage. As for the reason, the self-attention uses global similarity calculation to fuse cross-modal features, which requires much larger spatial size of feature maps and causes high memory consumption. In contrast, our CMFI adopts Metaformer framework [^49] that incorporates locally aligned image and point cloud features in a channel fusion manner, resulting in a unified space high efficiency fusion.

### E. Implement on FPGA-Based Prototype System

Based on the pre-trained TinyFusionDet, we firstly apply the Quantization-aware-training [^83] (QAT) algorithm to quantize the feature maps and weight parameters to INT8. Followed that, the algorithm-hardware co-designed accelerator for TinyFusionDet is further deployed on the FPGA prototype system, which performs one frame of 3D object detection in **46.65ms** with limited hardware resource reported in Tab. XIII.

![Figure 20](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t13-3556711-large.gif)

*TABLE XIII*

By comparing with FPGA solutions for typical point cloud processing tasks (such as point classification [^70], point segmentation [^84], [^85], point object detection [^73]) in Tab. XIV, our system consumes 3.674W of power for achieving the highest peak performance of 422.4GOPS, resulting in a new record energy efficiency of 114.97GOPS/W. Compared to our previous TinyPillarNet, this accelerator supports a wider range of convolutional kernel size and special operations such as concatenation and embedding, enabling it to perform cross-modal fusion based object detection tasks. Additionally, we design the layouts of tensors and optimize the batch normalization calculations, resulting in increased working frequency and peak performance without much more power consumption.

![Figure 21](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/76/11154820/10947105/lai.t14-3556711-large.gif)

*TABLE XIV*

## SECTION VI. Conclusion

In summary, this work proposed an extremely lightweight framework termed TinyFusionDet to deliver high-precision 3D detection in edge computing scenarios, addressing the main challenges of high computation and storage costs in current LiDAR-Camera fusion methods. By putting forward a Hybrid-Scale Voxel Strategy (HSPS) in the backbone of point clouds feature extraction, we enhance the detection accuracy of small-scale objects remarkably. Furthermore, the innovative Cross-Modal Heatmap Attention (CMHA) module is designed to enrich the image feature representation under the guidance of LiDAR BEV heatmap with lower memory and computational costs. The metaformer based Cross-Modal Feature Interaction (CMFI) module is finally presented to thoroughly represent the fusion of features from image and LiDAR point clouds for the following detection. The extensive experiments indicate that TinyFusionDet achieves a considerable detection accuracy with record lowest storage cost in a real-time implementation manner. Benefited from the above innovation in framework and corresponding hardware design, we mark a first in the deployment and implementation of our TinyFusionDet on a resource limited FPGA platform.

## References

[^1]: J. Mao, S. Shi, X. Wang, and H. Li, “3D object detection for autonomous driving: A comprehensive survey,” Int. J. Comput. Vis., vol. 131, no. 8, pp. 1909–1963, Aug. 2023. [DOI](https://doi.org/10.1007/s11263-023-01790-1) [Google Scholar](https://scholar.google.com/scholar?as_q=3D+object+detection+for+autonomous+driving%3A+A+comprehensive+survey&as_occt=title&hl=en&as_sdt=0%2C31)

[^2]: R. Qian, X. Lai, and X. Li, “3D object detection for autonomous driving: A survey,” Pattern Recognit., vol. 130, Oct. 2022, Art. no. 108796. [DOI](https://doi.org/10.1016/j.patcog.2022.108796) [Google Scholar](https://scholar.google.com/scholar?as_q=3D+object+detection+for+autonomous+driving%3A+A+survey&as_occt=title&hl=en&as_sdt=0%2C31)

[^3]: R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1440–1448. [IEEE](https://ieeexplore.ieee.org/document/7410526) [Google Scholar](https://scholar.google.com/scholar?as_q=Fast+R-CNN&as_occt=title&hl=en&as_sdt=0%2C31)

[^4]: W. Liu, “SSD: Single shot MultiBox detector,” in Proc. 14th Eur. Conf. Comput. Vis. (ECCV). Cham, Switzerland : Springer, Oct. 2016, pp. 21–37. [DOI](https://doi.org/10.1007/978-3-319-46448-0_2) [Google Scholar](https://scholar.google.com/scholar?as_q=SSD%3A+Single+shot+MultiBox+detector&as_occt=title&hl=en&as_sdt=0%2C31)

[^5]: J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” 2018, arXiv:1804.02767. [Google Scholar](https://scholar.google.com/scholar?as_q=YOLOv3%3A+An+incremental+improvement&as_occt=title&hl=en&as_sdt=0%2C31)

[^6]: A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, “YOLOv4: Optimal speed and accuracy of object detection,” 2020, arXiv:2004.10934. [DOI](https://doi.org/10.48550/arXiv.2004.10934) [Google Scholar](https://scholar.google.com/scholar?as_q=YOLOv4%3A+Optimal+speed+and+accuracy+of+object+detection&as_occt=title&hl=en&as_sdt=0%2C31)

[^7]: H. Law and J. Deng, “CornerNet: Detecting objects as paired keypoints,” in Computer Vision—ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., Cham, Switzerland : Springer, 2018, pp. 765–781. [DOI](https://doi.org/10.1007/978-3-030-01264-9_45) [Google Scholar](https://scholar.google.com/scholar?as_q=CornerNet%3A+Detecting+objects+as+paired+keypoints&as_occt=title&hl=en&as_sdt=0%2C31)

[^8]: X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” 2019, arXiv:1904.07850. [Google Scholar](https://scholar.google.com/scholar?as_q=Objects+as+points&as_occt=title&hl=en&as_sdt=0%2C31)

[^9]: S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation and detection from point cloud,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, Jun. 2019, pp. 770–779. [IEEE](https://ieeexplore.ieee.org/document/8954080) [Google Scholar](https://scholar.google.com/scholar?as_q=PointRCNN%3A+3D+object+proposal+generation+and+detection+from+point+cloud&as_occt=title&hl=en&as_sdt=0%2C31)

[^10]: Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “STD: Sparse-to-dense 3D object detector for point cloud,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1951–1960. [IEEE](https://ieeexplore.ieee.org/document/9008777) [Google Scholar](https://scholar.google.com/scholar?as_q=STD%3A+Sparse-to-dense+3D+object+detector+for+point+cloud&as_occt=title&hl=en&as_sdt=0%2C31)

[^11]: B. Yang, W. Luo, and R. Urtasun, “PIXOR: Real-time 3D object detection from point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 7652–7660. [IEEE](https://ieeexplore.ieee.org/document/8578896) [Google Scholar](https://scholar.google.com/scholar?as_q=PIXOR%3A+Real-time+3D+object+detection+from+point+clouds&as_occt=title&hl=en&as_sdt=0%2C31)

[^12]: Y. Yan, Y. Mao, and B. Li, “SECOND: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, Oct. 2018. [DOI](https://doi.org/10.3390/s18103337) [Google Scholar](https://scholar.google.com/scholar?as_q=SECOND%3A+Sparsely+embedded+convolutional+detection&as_occt=title&hl=en&as_sdt=0%2C31)

[^13]: A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “PointPillars: Fast encoders for object detection from point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, Jun. 2019, pp. 12689–12697. [IEEE](https://ieeexplore.ieee.org/document/8954311) [Google Scholar](https://scholar.google.com/scholar?as_q=PointPillars%3A+Fast+encoders+for+object+detection+from+point+clouds&as_occt=title&hl=en&as_sdt=0%2C31)

[^14]: J. N. C. Hayton, T. Barros, C. Premebida, M. J. Coombes, and U. J. Nunes, “CNN-based human detection using a 3D LiDAR onboard a UAV,” in Proc. IEEE Int. Conf. Auto. Robot Syst. Competitions (ICARSC), Trondheim, Norway, Apr. 2020, pp. 312–318. [IEEE](https://ieeexplore.ieee.org/document/9096075) [Google Scholar](https://scholar.google.com/scholar?as_q=CNN-based+human+detection+using+a+3D+LiDAR+onboard+a+UAV&as_occt=title&hl=en&as_sdt=0%2C31)

[^15]: C. Chen, “DCPLD-Net: A diffusion coupled convolution neural network for real-time power transmission lines detection from UAV-borne LiDAR data,” Int. J. Appl. Earth Observ. Geoinf., vol. 112, Aug. 2022, Art. no. 102960. [DOI](https://doi.org/10.1016/j.jag.2022.102960) [Google Scholar](https://scholar.google.com/scholar?as_q=DCPLD-Net%3A+A+diffusion+coupled+convolution+neural+network+for+real-time+power+transmission+lines+detection+from+UAV-borne+LiDAR+data&as_occt=title&hl=en&as_sdt=0%2C31)

[^16]: Z. Ma, W. Yao, Y. Niu, B. Lin, and T. Liu, “UAV low-altitude obstacle detection based on the fusion of LiDAR and camera,” Auton. Intell. Syst., vol. 1, no. 1, pp. 182–191, 2021. [DOI](https://doi.org/10.1007/s43684-021-00014-y) [Google Scholar](https://scholar.google.com/scholar?as_q=UAV+low-altitude+obstacle+detection+based+on+the+fusion+of+LiDAR+and+camera&as_occt=title&hl=en&as_sdt=0%2C31)

[^17]: D. Amigo, J. García, J. M. Molina, and J. Lizcano, “UAV simulation for object detection and 3D reconstruction fusing 2D LiDAR and camera,” in Proc. 17th Int. Conf. Soft Comput. Models Ind. Environ. Appl. (SOCO). Cham, Switzerland : Springer, Oct. 2022, pp. 31–40. [DOI](https://doi.org/10.1007/978-3-031-18050-7_4) [Google Scholar](https://scholar.google.com/scholar?as_q=UAV+simulation+for+object+detection+and+3D+reconstruction+fusing+2D+LiDAR+and+camera&as_occt=title&hl=en&as_sdt=0%2C31)

[^18]: U. Olawoye and J. N. Gross, “UAV position estimation using a LiDAR-based 3D object detection method,” in Proc. IEEE/ION Position, Location Navigat. Symp. (PLANS), Apr. 2023, pp. 46–51. [IEEE](https://ieeexplore.ieee.org/document/10139979) [Google Scholar](https://scholar.google.com/scholar?as_q=UAV+position+estimation+using+a+LiDAR-based+3D+object+detection+method&as_occt=title&hl=en&as_sdt=0%2C31)

[^19]: M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 4510–4520. [IEEE](https://ieeexplore.ieee.org/document/8578572) [Google Scholar](https://scholar.google.com/scholar?as_q=MobileNetV2%3A+Inverted+residuals+and+linear+bottlenecks&as_occt=title&hl=en&as_sdt=0%2C31)

[^20]: D. Zhou, Q. Hou, Y. Chen, J. Feng, and S. Yan, “Rethinking bottleneck structure for efficient mobile network design,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland : Springer, 2020, pp. 680–697. [DOI](https://doi.org/10.1007/978-3-030-58580-8_40) [Google Scholar](https://scholar.google.com/scholar?as_q=Rethinking+bottleneck+structure+for+efficient+mobile+network+design&as_occt=title&hl=en&as_sdt=0%2C31)

[^21]: X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 6848–6856. [IEEE](https://ieeexplore.ieee.org/document/8578814) [Google Scholar](https://scholar.google.com/scholar?as_q=ShuffleNet%3A+An+extremely+efficient+convolutional+neural+network+for+mobile+devices&as_occt=title&hl=en&as_sdt=0%2C31)

[^22]: N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland : Springer, 2018, pp. 122–138. [DOI](https://doi.org/10.1007/978-3-030-01264-9_8) [Google Scholar](https://scholar.google.com/scholar?as_q=ShuffleNet+V2%3A+Practical+guidelines+for+efficient+CNN+architecture+design&as_occt=title&hl=en&as_sdt=0%2C31)

[^23]: J. Lin, W.-M. Chen, H. Cai, C. Gan, and S. Han, “MCUNetV2: Memory-efficient patch-based inference for tiny deep learning,” 2021, arXiv:2110.15352. [Google Scholar](https://scholar.google.com/scholar?as_q=MCUNetV2%3A+Memory-efficient+patch-based+inference+for+tiny+deep+learning&as_occt=title&hl=en&as_sdt=0%2C31)

[^24]: K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More features from cheap operations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, Jun. 2020, pp. 1577–1586. [IEEE](https://ieeexplore.ieee.org/document/9157333) [Google Scholar](https://scholar.google.com/scholar?as_q=GhostNet%3A+More+features+from+cheap+operations&as_occt=title&hl=en&as_sdt=0%2C31)

[^25]: M. Tan, “MnasNet: Platform-aware neural architecture search for mobile,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Long Beach, CA, USA, Jun. 2019, pp. 2820–2828. [IEEE](https://ieeexplore.ieee.org/document/8954198) [Google Scholar](https://scholar.google.com/scholar?as_q=MnasNet%3A+Platform-aware+neural+architecture+search+for+mobile&as_occt=title&hl=en&as_sdt=0%2C31)

[^26]: J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3D proposal generation and object detection from view aggregation,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018, pp. 1–8. [IEEE](https://ieeexplore.ieee.org/document/8594049) [Google Scholar](https://scholar.google.com/scholar?as_q=Joint+3D+proposal+generation+and+object+detection+from+view+aggregation&as_occt=title&hl=en&as_sdt=0%2C31)

[^27]: X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 6526–6534. [IEEE](https://ieeexplore.ieee.org/document/8100174) [Google Scholar](https://scholar.google.com/scholar?as_q=Multi-view+3D+object+detection+network+for+autonomous+driving&as_occt=title&hl=en&as_sdt=0%2C31)

[^28]: C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets for 3D object detection from RGB-D data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 918–927. [IEEE](https://ieeexplore.ieee.org/document/8578200) [Google Scholar](https://scholar.google.com/scholar?as_q=Frustum+PointNets+for+3D+object+detection+from+RGB-D+data&as_occt=title&hl=en&as_sdt=0%2C31)

[^29]: K. Shin, Y. P. Kwon, and M. Tomizuka, “RoarNet: A robust 3D object detection based on RegiOn approximation refinement,” in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2019, pp. 2510–2515. [IEEE](https://ieeexplore.ieee.org/document/8813895) [Google Scholar](https://scholar.google.com/scholar?as_q=RoarNet%3A+A+robust+3D+object+detection+based+on+RegiOn+approximation+refinement&as_occt=title&hl=en&as_sdt=0%2C31)

[^30]: S. Pang, D. Morris, and H. Radha, “CLOCs: Camera-LiDAR object candidates fusion for 3D object detection,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Las Vegas, NV, USA, Oct. 2020, pp. 10386–10393. [IEEE](https://ieeexplore.ieee.org/document/9341791) [Google Scholar](https://scholar.google.com/scholar?as_q=CLOCs%3A+Camera-LiDAR+object+candidates+fusion+for+3D+object+detection&as_occt=title&hl=en&as_sdt=0%2C31)

