# Near-Sensor LiDAR and Visual Feature Extraction and Communication for Low-Latency Roadside Cooperative Perception

## Abstract

Autonomous driving technologies are swiftly evolving, characterized by two main strategies: 1) single-vehicle autonomous driving (SVAD) and 2) vehicle-infrastructure cooperative autonomous driving (VICAD). SVAD depends entirely on the vehicle’s internal sensors and processing capabilities, whereas VICAD benefits from a synergistic network combining roadside infrastructure, connected vehicles, and cloud services to boost safety and efficiency. Nevertheless, VICAD encounters challenges with high-bandwidth data transmission and perception latency. To mitigate these concerns, we introduce an innovative intelligent roadside unit (I-RSU) platform integrating perception, computing, and communication into one cohesive system. The platform features dual neural processing units (NPUs) for the effective extraction of images and LiDAR features, alongside a Cellular-V2X (C-V2X) communication module, all realized on a field-programmable gate array (FPGA). This setup minimizes latency and expenses by enabling computation near the sensors and facilitating selective data transmission. Our system also supports multimodal fusion, enhancing overall perception and safety. Through extensive real-world trials and simulations, our system demonstrates a substantial reduction in end-to-end latency, providing a scalable solution for VICAD scenarios.

## Authors

Wei Zhang *School of Information and Communication Engineering, Shanghai University, Shanghai, China* [ORCID: 0009-0006-3274-8867](https://orcid.org/0009-0006-3274-8867)

Yuhang Gu *School of Information and Communication Engineering, Shanghai University, Shanghai, China*

Beining Zhao *School of Information and Communication Engineering, Shanghai University, Shanghai, China* [ORCID: 0009-0006-4550-7974](https://orcid.org/0009-0006-4550-7974)

Qingyu Deng *School of Information and Communication Engineering, Shanghai University, Shanghai, China* [ORCID: 0009-0004-5805-3866](https://orcid.org/0009-0004-5805-3866)

Xinyu Chen *School of Information and Communication Engineering, Shanghai University, Shanghai, China* [ORCID: 0009-0008-4717-1962](https://orcid.org/0009-0008-4717-1962)

Yi Shi *School of Information and Communication Engineering, Shanghai University, Shanghai, China* [ORCID: 0000-0002-3240-7900](https://orcid.org/0000-0002-3240-7900)

Limin Jiang *School of Information and Communication Engineering, Shanghai University, Shanghai, China* [ORCID: 0009-0008-7034-5780](https://orcid.org/0009-0008-7034-5780)

Shan Cao *School of Information and Communication Engineering, Shanghai University, Shanghai, China* [ORCID: 0000-0003-3713-8671](https://orcid.org/0000-0003-3713-8671)

Zhiyuan Jiang *School of Information and Communication Engineering, Shanghai University, Shanghai, China* [ORCID: 0000-0002-8522-5721](https://orcid.org/0000-0002-8522-5721)

Ruiqing Mao *Department of Electronic Engineering, Tsinghua University, Beijing, China* [ORCID: 0000-0001-7169-3922](https://orcid.org/0000-0001-7169-3922)

Sheng Zhou *Department of Electronic Engineering, Tsinghua University, Beijing, China* [ORCID: 0000-0003-0651-0071](https://orcid.org/0000-0003-0651-0071)

## Publication Information

**Journal:** IEEE Internet of Things Journal **Year:** 2025 **Volume:** 12 **Issue:** 17 **Pages:** 36713-36729 **DOI:** [10.1109/JIOT.2025.3583443](https://doi.org/10.1109/JIOT.2025.3583443) **Article Number:** 11052245 **ISSN:** Electronic ISSN: 2327-4662, CD: 2372-2541

## Metrics

**Total Downloads:** 110

## Funding

- National Natural Science Foundation of China (NSFC) (Grant: 62271300 and 12141107)
- Shanghai Municipal Science and Technology Commission (Grant: 24DP1501100 and 24DP1500600)

---

## Keywords

**IEEE Keywords:** Sensors, Laser radar, Real-time systems, Feature extraction, Cameras, Hardware, Data communication, Autonomous vehicles, Accuracy, Safety

**Index Terms:** Cooperative Perception, LiDAR Features, Data Transmission, Communication Module, Roadside Units, Neural Network, Detection Accuracy, Feature Maps, Sensor Data, Object Detection, Pedestrian, Point Cloud, Real-time Performance, Small Objects, Transmission Delay, High Latency, Point Cloud Data, Lidar Data, Mobile Edge Computing, Large-scale Deployment, LiDAR Sensor, KITTI Dataset, LiDAR Point Clouds, Calculation Module, Camera Data, Real-time Data, Flow Data, External Memory, Convolutional Neural Network

**Author Keywords:** Cellular-V2X (C-V2X), cooperative perception, neural processing unit (NPU), vehicle-infrastructure cooperative autonomous driving (VICAD)

undefined
## SECTION I. Introduction

Autonomous driving is rapidly emerging as a crucial technology influencing the future landscape of the automotive sector. Recent developments have branched into two primary technical strategies: 1) single-vehicle autonomous driving (SVAD) and 2) vehicle-infrastructure cooperative autonomous driving (VICAD). The SVAD strategy entails each vehicle utilizing its own array of sensors and computing capabilities to understand its surroundings and make decisions in real-time. Nevertheless, this method encounters challenges due to its limited sensor range, restricted field of view, and constrained computational power, potentially impacting its capability in intricate or changing driving scenarios. In contrast, VICAD systems make use of cooperative interactions between vehicles, roadside infrastructure, and cloud-based servers, facilitating shared environmental perception and distributed processing of data, which helps to mitigate the limitations of standalone sensors.

VICAD systems have many potential applications. In urban areas with large vehicles blocking views, roadside units using cooperative sensing can assist vehicles in detecting hidden pedestrians or cyclists. VICAD offers real-time map updates, providing vehicles with current information on road conditions like construction zones and obstacles, enhancing route planning and safety. In unusual traffic situations like stalled vehicles or incidents, roadside infrastructure can quickly detect these issues and inform nearby vehicles, enabling them to adjust their driving strategies. Efficient computation and communication are essential for VICAD to process multimodal sensor data and provide real-time perception updates. Computation modules must quickly integrate image and LiDAR data, while communication modules need to relay essential information with minimal delay for prompt decision-making.

Despite these advantages, VICAD systems face several major challenges. VICAD’s real-world application [^1], [^2], [^3] involves installing sensors like cameras, and LiDAR, along with mobile edge computing (MEC) nodes and road ride communication units (RSU) [^4], facilitating real-time data processing, as shown in Fig. 1. Transmitting high-bandwidth raw sensor data between sensors and processing units demands significant network resources and raises end-to-end latency. Large-scale LiDAR point clouds and high-resolution images produce massive data streams, demanding significant computational power for real-time object detection and data fusion. Inefficient communication frameworks can worsen processing delays by overloading network bandwidth with redundant or unoptimized sensor data. The separation of perception, computation, and communication across devices leads to inefficient data flow and higher hardware costs. Current VICAD implementations often spread computation over different processors, leading to extra latencies from memory access delays and fragmented data transfer. Heterogeneous system components from various manufacturers often lack a unified processing pipeline, resulting in inconsistent cooperative perception outcomes.

![Figure 1](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang1-3583443-large.gif)

*Fig. 1. Components of roadside equipment in VICAD systems, which include various sensors like cameras, LiDAR, MEC nodes, communication devices-RSU, and so on.*

To mitigate these issues, we introduce an innovative intelligent roadside unit (I-RSU) platform that integrates perception, computing, and communication into a single device. The perception module provides dedicated interfaces for camera and LiDAR sensors and processes raw sensor data in real-time, including image cropping, resizing, and LiDAR voxelization, to reduce computational complexity before feature extraction. The computing module includes dual neural processing units (NPUs) designed for image and LiDAR feature extraction. The communication module, based on the Cellular-V2X (C-V2X) protocol stack, ensures low-latency data exchange between roadside units and vehicles. We validated our system through comprehensive experiments featuring actual roadside deployments, the KITTI dataset [^5] and the DAIR-V2X-I dataset [^6]. For object detection with cameras, our system achieved an end-to-end latency of under 100ms, with negligible accuracy degradation when using quantized models. When integrating image and LiDAR data, our platform significantly enhanced detection accuracy, particularly in difficult conditions with pedestrians and cyclists. These findings affirm that our system effectively lowers latency while sustaining strong performance in intricate, real-world driving situations.

Our contributions can be summarized as follows.

1. *Unified Perception, Computation, and Communication:* Our platform combines perception, processing, and communication into one device, simplifying hardware and minimizing communication delays compared to traditional VICAD systems.
2. *Near-Sensor Processing:* Image and LiDAR data preprocessing modules are designed to perform computations close to the sensors, greatly reducing processing latency.
3. *Adaptive Data Transmission:* Our system offers flexible data transmission options, allowing users to choose between sending final detection results or intermediate feature maps from neural network processing. This optimizes data flow for the specific application.
4. *Multimodal Fusion Capability:* Our platform, featuring cameras and LiDAR, enables the use of multimodal fusion algorithms, improving safety by offering more dependable detection and perception capabilities.
5. *Real-World Validation:* We conduct thorough experiments using real deployment data and established datasets. Our system offers lower end-to-end latency than distributed VICAD architectures, making it more effective in dynamic traffic scenarios like intersections. On-device processing and optimized communication enable quicker responses to unexpected road conditions, improving safety and system efficiency.

The remainder of this article is organized as follows. In Section II, we review recent works on LiDAR-based detection and multimodal fusion algorithms and related roadside systems from various countries. Section III presents the implementation details of our proposed system. In Section IV, we elaborate on the experimental setup. Section V shows the simulation results. Finally, Section VI concludes this article.

## SECTION II. Related Works

### A. VICAD Systems

VICAD systems leverage roadside infrastructure to enhance autonomous driving by integrating sensing, computing, and communication. Traditional implementations, such as Huawei’s cloud-edge system and Baidu’s Apollo Air [^1], deploy multisensor roadside units to support vehicle perception. Similarly, projects like Japan’s Smartway [^2] and the U.S. ITS program [^3] utilize DSRC and C-V2X to improve vehicle coordination.

Various existing roadside perception systems use different hardware setups. Zhang et al. [^7] proposed a cloud-edge cooperative system using NVIDIA Jetson AGX Xavier edge devices for local processing before sending data to cloud servers. This lessens the computational load on vehicles but raises network reliance, posing challenges for real-time safety-critical applications. Xiang et al. [^8] created a multisensor fusion system that integrates LiDAR and cameras, linked to Intel i9-10900x and RTX3090 for processing. Data transmission between devices adds latency, and high hardware costs restrict large-scale deployment. Vignarca et al. [^9] developed a localization system that utilizes cameras and NVIDIA Jetson Nano for visual tracking. Distributed devices cause high latency in image processing and data transmission. Other studies examine cooperative roadside perception with networked units. Networked RSPUs [^10] utilize multiple RSUs equipped with laptop PCs for data processing and Intel Wi-Fi 6 AX200 module for communication. However, large-scale deployment demands significant costs and high latency in data transmission.

While these distributed architectures improve roadside perception, they suffer from high hardware costs, increased interdevice latency, and complex system synchronization. In contrast, our proposed system integrates perception, computation, and communication within a single field-programmable gate array (FPGA)-based platform. By processing sensor data directly on the FPGA, our approach minimizes transmission delays, reduces reliance on expensive computing devices and cloud computing, and offers a cost-effective and scalable solution for real-time cooperative autonomous driving.

### B. Perception and Computation Hardware Platforms

Considering both power consumption and performance, the convolutional neural network (CNN) remains the mainstream solution in current perception and computation hardware platform. Reference [^11], [^12] propose a reconfigurable CNN accelerator. Reference [^11] optimizes convolution operations and pooling circuits, as well as the arrangement of weights and feature data in memory, thereby reducing bandwidth access and enhancing computational performance under resource-constrained conditions. Reference [^12] enhances the flexibility and resource utilization of the platform by reconstructing and optimizing the multiply-accumulate (MAC) unit, and improves model inference speed through the use of INT8 quantization. Target recognition plays an important role in the perception and computation hardware platform, and the YOLO is widely used in target recognition due to its advantages of high precision and low latency. Reference [^13] proposes a software-hardware co-design method based on OpenCL for accelerating YOLOv2. It uses PCIe for communication between the PC and the FPGA, achieving a throughput of 2.13 TOPS. However, due to the presence of the PC, this method cannot meet the requirements of low power consumption and high real-time performance for edge computing. Reference [^14] deploys the official CNN acceleration IP of Xilinx on the ZYNQ system, which significantly reduces the development complexity, but faces the problem of low computational density. Reference [^15] achieved a throughput of 95.08 GOPS through software-hardware co-optimization of YOLOv3-tiny, but faces challenges in adaptability and configuration flexibility. References [^16] and [^17], respectively implement the hardware deployment of YOLOv4-tiny and YOLOv5. However, the use of Xilinx’s HLS tool in the hardware design leads to higher latency and power consumption of the platform, as well as poor compatibility.

### C. LiDAR-Based Perception

Recent progress in LiDAR-based object detection features several crucial algorithms, each with unique advantages and drawbacks. Grid-based strategies, like VoxelNet [^18] and SECOND [^19], transform 3-D point clouds into organized grids to streamline processing using CNNs. However, they may encounter information loss due to discretization. In contrast, point-based methods, such as PointNet [^20] and its successors [^21] [^22], handle raw point clouds directly, maintaining detailed spatial data to improve accuracy for complex forms, nevertheless at the expense of higher computational demands. Hybrid techniques, exemplified by PointPillars [^23], merge grid-based efficiency with the precise detail of point-based approaches, finding a balance between accuracy and computational load. Furthermore, advanced methods, such as PV-RCNN [^24] enhance point cloud processing by integrating 3-D voxel grids, improving feature extraction and robustness, though they may demand substantial computational resources. In our system, maintaining a balance between computational demands and algorithmic adaptability is crucial. We have refined and altered the PointPillars’ pillar feature net (PFN) module [^23] for processing LiDAR data. This method efficiently segments point cloud data into pillars, offering lower computational complexity compared to other techniques. The pseudo-images generated through this process act as effective inputs for further algorithms, enhancing the system’s adaptability.

### D. Joint Camera-LiDAR Perception

In the field of LiDAR-camera fusion aimed at 3-D object detection, several algorithms have been designed to efficiently merge the depth information from LiDAR with the detailed visual data from cameras. Early methods like PointFusion [^25] and MV3D [^26] were created to integrate these two data forms but often face challenges when it comes to aligning features from different sensor modalities. AVOD [^27] proposes an anchor-based technique to enhance detection accuracy, yet it demands a considerable amount of computational power. F-PointNet [^28] increases efficiency by narrowing the focus to particular regions of interest identified from 2-D images, although its success depends on the caliber of the initial 2-D proposals. PointPainting [^29] enriches LiDAR data with semantic information from images, achieving improved detection in complex settings; however, this approach introduces additional latency. Algorithms like PI-RCNN [^30] and DeepFusion [^31] use sophisticated deep learning methods to integrate data, yielding reliable detection even in cluttered areas, but pose a high computational burden. The 3D-CVF [^32] method enhances the fusion by dynamically adjusting the feature combination from both sensors, which aids in adaptability across various conditions, yet it increases computational demand. BEVFusion [^33] centers around constructing bird’s-eye-view models that facilitate accurate detection through a refined fusion method, though it necessitates efficient management of computational resources. In this study, we implement a result-level fusion approach by combining 2-D detection results from YOLOv3 [^34] and PointPillars [^23] using a voting-based strategy. Here, confidence scores from the neural networks function as votes for each detected class, with the class receiving the most votes being selected as the fused classification result.

## SECTION III. System Design

### A. General Architecture of the System

The system utilizes a modular approach, separating the core board from the baseboard, connected via high-speed backplane connectors. This design significantly reduces the overall product size while maintaining functionality and performance.

The core board integrates two Xilinx chips as outlined in Fig. 2: XCZ15EG [^35] and XC7K410T [^36] connected through a high-speed Serdes interface [^37], facilitated by Xilinx’s Serdes IP for seamless communication. In addition to the dual Xilinx chips, the core board features multiple DDR memory modules, which can be accessed via the PS or Xilinx DDR Control IP for efficient data operations. The system incorporates a dual-access Flash memory shared between the XCZ15EG and XC7K410T to enable flexible bitstream updates. Since the XCZ15EG operates on a Linux system, its bitstream is loaded from an SD card. The bitstream for the XC7K410T can also be stored on the SD card and transferred to the shared Flash via the XCZ15EG. After the transfer, the system switches the Flash interface, allowing the XC7K410T to load the bitstream and complete its boot sequence.

![Figure 2](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang2-3583443-large.gif)

*Fig. 2. Hardware development board with labeled components.*

The baseboard is designed to meet diverse application requirements by providing various peripheral interfaces, as shown in Fig. 2. These include a UART interface for system logging, a gigabit Ethernet port, a JTAG interface for FPGA debugging, a MIPI interface for connecting cameras, an SD card slot, and USB ports for 4G/5G module connectivity. Additionally, the baseboard is equipped with an AD9361 radio frequency (RF) chip [^38] and a security chip that adheres to communication security standards, enhancing the system’s adaptability for various use cases.

In terms of workflow, as shown in Fig. 3, the XCZ15EG handles camera-related data processing. Camera data is transmitted via the MIPI interface, undergoing preprocessing steps, such as cropping, resizing, fixed-point quantization, and reordering. The processed data is then stored in DDR memory, from which the NPU retrieves it to perform object detection. The NPU is a custom-designed module optimized for neural network inference, leveraging parallelized data processing to accelerate computation as illustrated in Section III-C. The selected offline-trained neural network model is processed to extract weight parameters, which are stored in DDR memory, while the network structure is used to generate execution instructions for the NPU, guiding operations, such as convolution. After inference, the detection results or feature maps are stored back in DDR, ready for transmission via the communication module. Simultaneously, the XC7K410T manages LiDAR data processing. The raw LiDAR data is preprocessed through the PFN and convolution modules, transforming point clouds into pseudo-images, which are then stored in DDR memory. The NPU accesses these pseudo-images for further algorithm acceleration, performing feature extraction and object detection. The results are subsequently written back to DDR for transmission through the communication module.

![Figure 3](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang3-3583443-large.gif)

*Fig. 3. System data processing flow. The XCZ15EG handles camera data preprocessing and object detection, while the XC7K410T manages LiDAR data preprocessing and feature extraction. Both NPU perform their tasks before transmitting the results via the communication module.*

### B. Camera and LiDAR Data Preprocessing

In our system, both the image and LiDAR point cloud preprocessing modules are implemented on the FPGA, ensuring efficient real-time processing for the downstream perception tasks in autonomous driving.

#### 1) Camera Data Preprocessing:

The image preprocessing module handles two key operations: 1) cropping and 2) resizing, both crucial for preparing the camera input for the object detection algorithm. The cropping step is necessary because the camera captures a broad field of view, which may include unnecessary areas. By focusing on key regions, such as the road or objects within the driving path, cropping helps reduce the amount of irrelevant data processed by subsequent stages.

After cropping, the image undergoes resizing using the Inter-Area method, which is particularly effective when downscaling images. The Inter-Area method works by averaging pixel values from the larger source image, preserving important details while reducing the image size to match the input dimensions required by the object detection model. This resizing not only ensures that the image fits the algorithm’s input specifications but also minimizes computation. On FPGA, original image data transmission via MIPI occurs in line order. Our hardware enables image clipping and resizing during the transmission gap between lines, utilizing a pipeline architecture. This approach integrates data preprocessing delay with data acquisition, achieving near-sensor preprocessing.

#### 2) LiDAR Data Preprocessing:

The LiDAR point cloud preprocessing module is designed to process data using the PFN, a key layer in the PointPillars network architecture [^23]. The first step is voxelization as shown in Fig. 3, where the incoming point cloud data is divided into fixed-size grids. All points within a grid cell are grouped into what is known as a pillar. The voxelization process is implemented as a pipelined design, which enables the system to output completed voxel coordinates and pillars in real time, reducing both memory usage and latency. The next step is to extract the pillar feature to generate the pseudo-image, which is called the feature extractor in Fig. 3.

In most conventional algorithms, all LiDAR data is stored in memory. The pillars are computed when the voxelization process is fully completed. However, we observed that after the first 128 points and the last 128 points are processed, there is no overlap in voxel coordinates according to the KITTI dataset [^5]. Therefore, once the LiDAR has rotated past a certain angle, the previously generated pillars can be output to the next stage without storing a full rotation of data, enabling a pipelined approach for continuous batch processing. The detailed process is presented in the Fig. 3 and described as follows.

a) Point Preprocess:

In this project, we receive LiDAR point cloud data in packets, and processing begins upon receiving a valid packet signal. For points within the defined region of interest, the x and y coordinates are used to calculate the corresponding pillar coordinates through multiplication and bit-shifting techniques. After the pillar coordinates are calculated, the point cloud data is quantized for efficient storage and processing.

b) Hash Mapping:

Our implementation is inspired by [^39], which introduces a voxel encoding accelerator (VEA) architecture for 3-D object detection, based on a voxel-centric approach. This includes a generalized voxel generator and a functional expander. The voxel generator organizes the voxel information into a hierarchical table, storing high-locality voxel data on-chip while placing memory-intensive point data off-chip, optimizing memory usage and computation. We also apply a randomly generated hash function matrix to the pillar coordinates, which serve as the hash key. This operation produces a unique hash mapping for each point, generating the voxel and point indexes.

c) Voxel Index Table Lookup and Memory Status Table Update:

Next, the system checks the voxel index table to determine whether the voxel index already exists in RAM.

1. If the voxel index exists, the system retrieves the corresponding RAM address, and the point data is placed in the appropriate column according to the point index. The point will be discarded when the point index exceeds the maximum allowable value.
2. If the voxel index does not exist, the system refers to the memory status table to find an address marked as “Empty” using the listless zerotree coding (LZC) algorithm [^40]. The new voxel index is stored in this location, and the memory status for that address is updated to “Available,” as shown by the gray-highlighted font in the table of Fig. 3.

d) Package Index Table and Point Index Table Update:

The package index table is updated with the current package number for the corresponding RAM address. The point index table is also updated, recording the number of points stored at each address. This information ensures that the correct number of points can be read when the pillar is output.

After all points in a package are processed, the system tends to find the address corresponding to the current package number minus one in the package index table. Then the memory status table is updated to mark that address as “Wait.” Points stored in memory marked as Wait are then output sequentially, as indicated by the red and purple fonts in the Voxelization module of Fig. 3. Once the data is output, the memory status changes from Wait to Empty, indicating that the address is now free for future use. A mask is generated based on the point index table to apply to the data being read.

The generated pillars will be passed into the 1-D convolution module for feature extraction. Once processed, the coordinate information is combined with the extracted features, and the resulting data is mapped into the DDR as a pseudo-image for further stages of processing.

Algorithm 1: 2-D Convolution Loops Algorithm

for

$n\_iy = 0$***to***$N\_iy$**do**

**for**$n\_if = 0$***to***$N\_if$**do**

**for**$n\_of = 0$***to***$N\_of$**do**

**for**$n\_ky = 0$***to***$N\_ky$**do**

**for**$n\_kx = 0$***to***$N\_kx$**do**

**for**$n\_ix = 0$***to***$N\_ix$**do**

$Pixel\_O[n\_of][n\_oy] += Pixel\_I[n\_if][n\_ix][n\_iy] * Weight[n\_if][n\_of][n\_ky][n\_kx]$

**end**

**end**

**end**

**end**

**end**

**end**

### C. NPU Design

The NPU design consists of two key components as shown in Fig. 4: a software-level compiler and a hardware-level inference accelerator.

![Figure 4](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang4-3583443-large.gif)

*Fig. 4. NPU compiler and hardware workflow. The NPU compiler converts the weight file of the neural network into ONNX format for the purpose of information extraction and quantization. The DDR stores the preprocessed image data and instructions generated after the network structure is compiled for the NPU Hardware. The hardware control is the responsibility of the controller, while the instructions are analyzed by the decoder. The feeder is used to prepare the weight and feature map for the succeeding convolution process module. The convolution process module primarily consists of a multiplier, adder, and activation. The results will be stored back in the DDR after the pooling module.*

The compiler processes the trained neural network model (such as YOLOv3 and ResNet listed in Fig. 3) by quantizing and reordering the weights to align with hardware computation rules. Additionally, the compiler generates execution instructions based on the network structure, including essential neural network parameters, such as DDR memory addresses for weights and feature maps, input/output dimensions, convolution kernel sizes, parallel input/output channels, dataflow configurations, and so on. These preparatory steps transform the neural network into a format that the NPU hardware can efficiently process. Once the NPU receives the processed feature maps as input, it executes network inference operations following the generated instructions.

The hardware design of the NPU is inspired by Hui’s work [^41], employing a configurable architecture that introduces a vector convolution mechanism supporting 2-D convolutions, as shown in Fig. 4. This approach enables efficient handling of various dataflow configurations and parallel processing schemes. The accelerator accommodates dynamic input/output channel dimensions and adapts to shrinking feature map sizes in deeper network layers. During execution, the NPU retrieves instructions from external memory, decodes them to generate control signals in the controller module, and loads essential data, such as feature maps, weights, and biases into on-chip memory to initiate convolution processes. Post-processing operations, including ReLU [^42], [^43] and Pooling [^44], [^45], refine the output before storing the final feature maps in DDR memory for further processing or transmission.

The specific convolution process executed by the NPU is detailed in Algorithm 1 and depicted in Fig. 5. The accelerator is designed to handle a variety of convolution kernel sizes, requiring a flexible data structure. It employs a weight-stationary dataflow where feature maps are streamed from DDR memory while keeping network weights stored on-chip. To optimize memory efficiency, the system prioritizes completing all computations along a feature map row before fetching new data, thus reducing redundant memory accesses. The NPU is configured with parallelism factors P_if and P_of at design time, reducing loop iterations to N_if/P_if and N_of/P_of, respectively. Loops 1 and 2 traverse the convolution kernel in a temporally optimized order, reshuffling the 2-D convolution process to minimize duplicated reads from overlapping receptive fields. The optimized operation sequence follows the order: Loop4, Loop1, Loop2, Loop6, Loop3, Loop5, capitalizing on row-wise feature map parallelism instead of iterating through the full feature map dimensions.

![Figure 5](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang5-3583443-large.gif)

*Fig. 5. Fine-grained convolution loop unrolling. Loop 1 corresponds to the horizontal dimension of the kernel, Loop 2 to the vertical dimension, Loop 3 to input channels, Loop 4 to the horizontal dimension of the feature map, and Loop 5 to the vertical dimension.*

![Figure 6](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang15-3583443-large.gif)

*Algorithm 1: 2-D Convolution Loops Algorithm*

Additionally, the accelerator implements a flexible dataflow system using vectorized processing, ensuring efficient weight and feature map management across multiple parallel output channels. A computing mask array is introduced to accommodate varying feature map widths by padding rows with zeros and utilizing mask signals in the vector processing engine (VPE) to regulate valid computations. This technique allows the system to adapt to different convolution configurations while maintaining high efficiency.

Furthermore, multisized convolution kernel support is realized through a hierarchical storage structure, which manages partial sums in both kernel row and column directions. The vector convolution mechanism, leveraging Hadamard product [^46] and vector addition operations, enables simultaneous processing of multiple feature map rows and convolution kernels. This approach reduces intermediate partial sum storage overhead. The hierarchical storage system facilitates efficient accumulation of these sums, allowing the accelerator to flexibly adjust kernel sizes by modifying the number of accumulations in processing units–eliminating the need for complex dataflow reconfigurations. The NPU currently supports a wide range of neural network operators, as outlined in Table I, enabling the execution of diverse deep learning models for real-time applications.

![Figure 7](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t1-3583443-large.gif)

*TABLE I*

### D. C-V2X Module Design

C-V2X technology, introduced by the 3rd generation partnership project (3GPP) in LTE Release 14 (2017) [^47], was developed to support low-latency, high-reliability communication for intelligent transportation and autonomous driving. Operating in the 5.9 GHz ITS band, C-V2X facilitates direct (V2V, V2I, V2P) and network-based (V2N) communication, enabling seamless data exchange between vehicles and roadside infrastructure. Its applications include cooperative adaptive cruise control, emergency vehicle warnings, and vehicle platooning, enhancing traffic safety and efficiency. As a key enabler for VICAD systems, C-V2X allows real-time data sharing, improving situational awareness and decision-making in complex driving environments. Building on this foundation, our system employs an ARM-FPGA co-design architecture, as illustrated in Fig. 6, to meet the stringent real-time requirements of autonomous driving. The ARM processor oversees control tasks, while FPGA accelerates bit-level and symbol-level processing. Table II presents the performance of our C-V2X communication module against 3GPP standard benchmarks.

![Figure 8](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t2-3583443-large.gif)

*TABLE II*

![Figure 9](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang6-3583443-large.gif)

*Fig. 6. C-V2X data processing flow on FPGA and ARM.*

On the ARM side, data, such as feature maps, weights, and biases are loaded into on-chip memory via the data loading module, and instructions are decoded to generate control signals for the transmission. This step involves error detection using CRC, followed by encoding with tail-biting convolution codes and rate matching before being sent for further processing. On the reception side, the ARM processes include channel estimation, frequency and timing offset estimation, and compensation. The channel estimation is carried out by calculating channel coefficients from reference signals, which are then interpolated for the entire frame. The ARM also handles noise power estimation used in subsequent soft-bit demodulation.

On the FPGA side, transmission processing includes scrambling, modulation, and resource mapping, followed by fast fourier transform (FFT)-based OFDM signal generation. Additionally, vector convolution and accumulation operations optimize symbol processing, with a cyclic prefix (CP) added before transmission. For the reception, after receiving the signals, the FPGA conducts a FFT to convert the time-domain signals into the frequency domain. It performs operations like DC removal, resource demapping, and soft-bit calculation through Log-likelihood ratio (LLR) estimation. Additionally, the FPGA handles frequency and timing offset compensation, which is essential for synchronizing the received data with the transmitted signals. The received symbols are demodulated and decoded using a turbo decoder, followed by CRC verification.

For RF processing, the system manages signal amplification, filtering, and modulation at the RF front end. The timing synchronization and frequency offset estimation ensure that the received signal is aligned with the transmitted signal, minimizing phase noise and distortion.

### E. Algorithm Design

For camera-based object detection, we employed YOLOv3-tiny, a lightweight version of the YOLOv3 architecture [^34] on our platform. This model is designed to be computationally efficient, making it suitable for real-time applications. It utilizes a skip connection structure, which helps to preserve features from earlier layers while allowing the network to focus on detecting objects at different scales. This architecture strikes a balance between speed and accuracy, making it ideal for embedded systems.

For LiDAR-based object detection, we adopted PointPillars [^23], with modifications to optimize the transmission of intermediate features or final results. If intermediate features are transmitted, we address the large size of concatenated three-layer features generated by the PointPillars’ backbone by introducing 1*1 convolutions to reduce the number of channels and 3*3 convolutions to downscale the feature map size, as shown in Fig. 7. This compression reduces the size of the transmitted feature map, and at the receiving end, the features are reconstructed for subsequent object detection.

![Figure 10](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang7-3583443-large.gif)

*Fig. 7. Partially modified network structure of the PointPillars architecture.*

For joint camera-LiDAR object detection, we implement a result-level fusion strategy to integrate 2-D detection results from YOLOv3 and PointPillars. Since YOLOv3 provides 2-D bounding boxes, while PointPillars generates 3-D detections, we first project the 3-D bounding boxes onto the image plane using the LiDAR-to-camera transformation matrix to ensure spatial alignment. After projection, we apply a voting-based fusion method, where each network assigns a confidence score as a vote for the detected class. The class with the highest cumulative confidence score is selected as the final classification result. This fusion approach effectively leverages LiDAR’s geometric accuracy and the rich semantic features of images, improving detection robustness, especially in challenging scenarios, such as occlusions or poor lighting conditions.

All algorithms deployed on hardware have undergone quantization-aware training (QAT) [^48]. QAT involves simulating the effects of quantization (e.g., reducing the bit-width of weights and activations) during the forward pass, allowing the network to adapt to quantization-induced errors during training. This method is particularly effective in minimizing the accuracy loss that typically occurs with post-training quantization. The quantization was implemented using the MQBench framework [^49], a powerful tool for model quantization that supports multiple deep learning platforms. MQBench provides a unified interface for quantization and allows users to simulate various bit-width configurations to optimize the model’s deployment on hardware. In our implementation, YOLOv3-tiny was quantized using W8A8 (8-bit weights and activations), while PointPillars [^23] was quantized using W4A8 (4-bit weights and 8-bit activations) to achieve a better balance between efficiency and performance.

### F. Scalability Considerations for Large-Scale VICAD Deployments

Scalability is a crucial factor in VICAD systems as the number of sensors and data throughput requirements increase. Our system is designed with three potential scalability strategies to accommodate larger sensor deployments and higher data rates.

1. *Multidevice Deployment:* Due to our low-cost and low-latency hardware design, multiple units can be deployed at intersections or complex road environments to handle an increased number of cameras and LiDAR sensors in real-time. This modular approach allows flexible scaling without significantly increasing infrastructure costs.
2. *Serial Processing:* We currently employ a sequential processing strategy, where multiple camera inputs are processed one after another through our NPU, ensuring efficient use of computing resources without requiring hardware modifications.
3. *Hardware Upgrade:* For future NPU upgrades, we plan to introduce a batch-processing mode, which will enhance parallelism by increasing the number of MAC units in the pipeline as shown in Fig. 8. This upgrade will enable simultaneous processing of multiple sensor inputs while maintaining efficient on-chip memory usage, as weight parameters are shared across different sensor streams.

![Figure 11](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang8-3583443-large.gif)

*Fig. 8. Future NPU upgrade modification diagram. The preprocessing module can be expanded with minimal additional resource consumption. The NPU’s MAC pipeline will be augmented with additional multipliers and accumulators to support concurrent computations. Given that our FPGA board has sufficient multiplier resources and that the network weights can be shared across sensors, on-chip memory requirements remain manageable.*

Regarding data throughput, our V2X module supports a maximum transmission of 137792 bits per frame. If throughput demands exceed this limit, we will adopt data compression techniques, such as follows:

1. *Feature Map Compression:* Reducing intermediate feature map size using autoencoders or low-bit quantization before transmission, followed by reconstruction at the receiving end.
2. *Lossy Compression:* Employing techniques, such as principal component analysis (PCA) to retain critical data features while reducing overall bandwidth consumption.
3. *Sparse Representation:* Transmitting only nonzero activations from feature maps to minimize redundant information.

By integrating these strategies, our system remains scalable and adaptable to future expansions, ensuring efficient perception, computation, and communication even in large-scale VICAD deployments with high sensor density and data traffic.

## SECTION IV. Experimental Setup

### A. Data Collection

The camera-based intersection dataset was collected at Chengdu. Our equipment was installed adjacent to the traffic lights at the intersection, at a height of approximately 5 m. The device enclosure is made of CNC-machined aluminum alloy, offering excellent heat dissipation and waterproof capabilities. It is rated IP68 [^50] for water and dust resistance, ensuring protection in harsh environments. All external interfaces utilize aviation-grade connectors [^51] for enhanced durability. The overall appearance of the device and its actual deployment location are shown in Fig. 9. The distance from the device to the far end of the visible road, around the corner, was approximately 200 m. In this real-world setting, we collected a dataset specifically for this experiment. Details about the dataset can be found in Section IV-B. Additionally, we used a SinoGNSS GPSRTK to map the road, obtaining correspondences between the GPS points and the points in the images captured by our system. Using the perspective-n-Point (PnP) algorithm [^52], we calculated the camera’s extrinsic parameters. With MATLAB’s toolbox [^53] and a chessboard pattern, we derived the camera’s intrinsic parameters. This calibration of both the intrinsic and extrinsic parameters allows us to convert pixel coordinates into GPS coordinates after object detection, enabling the detected objects to be displayed on the map for the vehicle.

![Figure 12](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang9-3583443-large.gif)

*Fig. 9. System deployment and field testing.*

### B. Datasets

This section describes the two datasets used in our experiments.

The first dataset was collected and annotated by us, which is illustrated in IV-A. It contains a total of 6774 images, with 5497 images in the training set and 1277 in the validation set. The dataset is labeled with five categories: 1) car; 2) bus; 3) truck; 4) pedestrian; and 5) bicycle. The distribution of data across these categories is illustrated in Fig. 10.

![Figure 13](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang10-3583443-large.gif)

*Fig. 10. Category distribution in the intersection dataset.*

The second dataset used is the KITTI dataset [^5], which is widely recognized in the field of autonomous driving research and provides comprehensive data for object detection and 3-D point cloud analysis. It provides diverse real-world driving scenarios captured by a vehicle outfitted with cameras, LiDAR, GPS, and IMU sensors. The dataset includes over 14 000 labeled images and over 80 000 LiDAR point clouds, covering urban, rural, and highway environments. Its rich annotations, including 3-D bounding boxes for vehicles, pedestrians, and cyclists, make it invaluable for developing and evaluating perception algorithms, particularly in sensor fusion and 3-D scene understanding.

The third dataset is the DAIR-V2X-I dataset [^6] for road side perception. The DAIR-V2X-I dataset includes 10 084 frame images from roadside cameras and 10 084 frame point clouds from LiDAR, split into training, evaluation, and test sets in a 50%, 20%, and 30% ratio. The dataset supports the development and evaluation of cooperative perception algorithms by providing synchronized sensor recordings from various viewpoints, addressing issues like occlusions and sensor blind spots.

### C. Experimental Configuration

We conducted an evaluation on the setting of point cloud packet sizes in the KITTI dataset. This evaluation assumes that N consecutive point cloud data points within a single scan cycle form one packet. We assess whether point cloud data from adjacent *k* packets share the same pillar. If point cloud data from packets *j* and $j{+}$ i are mapped to the same pillar, but no points from the intermediate packets belong to this pillar, the points from $j{+}$ i will be discarded during processing. As *k* approaches infinity, we reach the conclusion illustrated in Fig. 11: for 122 637 point cloud data points collected during a single LiDAR scan cycle, the number of lost points gradually decreases as *N*, the number of points per data packet, increases. After considering both hardware resource limitations and algorithmic accuracy, we set *N* to 128 for our experiments.

![Figure 14](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang11-3583443-large.gif)

*Fig. 11. Missed Points and Proportion of Total Points in KITTI.*

In our experiments, the camera operates at a capture frequency of 30 Hz, with an image resolution of 1920*1080 pixels. After cropping, the image size is reduced to 600*600 and then resized to 512*512 to fit the input requirements for subsequent processing. For the communication module, the configuration of the C-V2X protocol on the FPGA is detailed in Table III.

![Figure 15](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t3-3583443-large.gif)

*TABLE III*

We trained all of the algorithms on two NVIDIA RTX 4090 GPUs. For YOLOv3 and YOLOv3-tiny, we used the Adam optimizer with an initial learning rate of 0.01, which decayed to $1\times 10^{-4}$ over 300 epochs, with a batch size of 32. PointPillars was trained using Adam with an initial learning rate of 0.001, 300 epochs, and a batch size of 8. The QAT experiments were applied with MQBench, where YOLOv3 was fine-tuned with a learning rate of $1\times 10^{-4}$, and PointPillars used $3\times 10^{-4}$, both trained for an additional 50 epochs. For comparison, other models trained on the DAIR-V2X-I dataset followed the parameter settings of the MMDetection3D framework to ensure consistency.

## SECTION V. Evaluation Results

### A. Camera-Based Algorithm Hardware Deployment

In this experiment, we used YOLOv3-tiny as the detection network. We compared the detection performance of the model using W8A8 quantization with full precision (F32), as shown in Table V. It can be observed that there is a performance drop with the quantized model compared to the F32 version. Specifically, since the number of car instances is higher and the size of the vehicle targets is relatively larger than pedestrians, the performance drop for vehicles is smaller, with a reduction of 3.4%. In contrast, for pedestrians, the performance drop is 5.1%. The qualitative performance can be seen in Fig. 12.

![Figure 16](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t4-3583443-large.gif)

*TABLE IV*

![Figure 17](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t5-3583443-large.gif)

*TABLE V*

![Figure 18](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang12-3583443-large.gif)

*Fig. 12. Object detection results in the intersection dataset.*

The hardware performance comparison presented in Table IV highlights the efficiency and resource utilization of the accelerators. For the YOLOv3-tiny accelerators, the implementation in this work on the XCZ15EG platform significantly improves throughput, achieving 244.92 GOPS, compared to 10.45 GOPS and 31.50 GOPS from previous designs [^12], [^54]. This represents a $23.4{\times }$ improvement over [^12] and a $7.8{\times }$ improvement over [^54], demonstrating the effectiveness of our architecture in accelerating neural network inference.

This gain in performance comes with increased resource utilization. The proposed accelerator consumes 1024 DSPs, which is $6.4{\times }$ more than [^12] and $4.2{\times }$ more than [^54]. Similarly, the utilization of LUTs (169.8 k) and BRAMs (428) is significantly higher, reflecting the increased architectural complexity required to meet real-time processing needs. However, despite this increased usage, the DSP efficiency of our design reaches 0.24 GOPS/DSP, surpassing 0.07 GOPS/DSP and 0.13 GOPS/DSP in prior works, indicating better utilization of computational resources.

Moreover, our accelerator achieves an end-to-end latency of only 34.28 ms, which is $15.5{\times }$ lower than [^12] and $3.5{\times }$ lower than [^54]. This substantial reduction in latency enables real-time execution, a critical requirement for autonomous driving applications. Additionally, the power efficiency of our system is significantly improved, reaching 21.48 GOPS/W, compared to 2.03 GOPS/W and 7.4 GOPS/W in prior designs, making it $10.6{\times }$ and $2.9{\times }$ more power-efficient, respectively.

Overall, the proposed accelerator achieves superior performance, with significantly reduced inference time and better hardware efficiency, making it well-suited for high-performance, real-time perception tasks in I-RSUs.

### B. LiDAR-Based Algorithm Hardware Deployment

The second scenario involves LiDAR-based object detection, where we employed the PointPillars [^23] architecture. A comparison between the accuracy of the original PointPillars network and our optimized deployment is presented in Table VI, which shows a minor reduction in accuracy across different difficulty levels: Easy, Moderate, and Hard, as measured by 2-D detection, average orientation similarity (AOS), Bird’s Eye View (BEV) detection, and 3-D detection metrics. This accuracy reduction is primarily attributed to the optimizations made in our system to enhance real-time processing efficiency and reduce resource consumption introduced in Section III-E. Specifically, our design utilizes a lower-bit quantization scheme (W4A8) and an optimized feature extraction pipeline to minimize computation overhead. These modifications lead to a slight decrease in accuracy, with the largest observed reduction occurring in 3-D detection under Hard difficulty (−3.08%) and BEV detection under Moderate difficulty (−4.35%). However, despite these reductions, our system maintains competitive detection performance while significantly improving computational efficiency and reducing latency. In real-world applications, the tradeoff between accuracy and efficiency is particularly important for scenarios requiring rapid response, such as urban intersections with high pedestrian activity and dynamic traffic conditions. The slight accuracy degradation is outweighed by the system’s ability to process LiDAR data in real-time, ensuring timely perception updates for cooperative autonomous driving. This balance between detection performance and system efficiency highlights the suitability of our approach for latency-sensitive environments. Fig. 13 visually compares the 2-D detection results of the modified structure and quantized PointPillars to the original PointPillars.

![Figure 19](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t6-3583443-large.gif)

*TABLE VI*

![Figure 20](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang13-3583443-large.gif)

*Fig. 13. 2-D Object detection visualized comparison results in KITTI. The two left pictures are the results of the original PointPillars and the right pictures are the results of our work.*

As for the hardware implementation, Table IV highlights the performance comparison between our PFN accelerator and other implementations. Our design significantly reduces FPGA resource usage while maintaining computational efficiency. Notably, unlike the PFN hardware module in [^39] and [^55], which rely on external memory for storing point cloud data, our approach performs all computations on-chip, eliminating the need for off-chip memory. This ensures that all resources accounted for are contained within the synthesis results.

Compared to [^55], our design reduces DSP usage by approximately 15% while achieving a throughput of 204.8 GOPS, which is ten times higher than [^39] and comparable to other state-of-the-art implementations. Moreover, our system achieves a DSP efficiency of 0.8 GOPS/DSP, significantly surpassing [^39] and [^55], demonstrating our ability to maximize processing power while minimizing computational overhead. Our Pillar Generator module further reduces 9.7 k logic unit usage by avoiding unnecessary interaction with external memory, leading to better resource allocation. While our BRAM usage is higher than some previous designs (114 versus 10.5 in [^55]), this tradeoff is necessary to enable on-chip buffering of point cloud data, which streamlines memory access and minimizes latency during real-time processing. Additionally, the power efficiency of PFN process module reaches 211.13 GOPS/W, which is 45x higher than [^55].

Moreover, as shown in Table VII, Our system processes encoding and voxelization together in 0.48 ms, slightly higher than [^55] (0.05 ms) but lower than [^39] (4.09 ms) and [^56] (71.93 ms), while remains within real-time constraints and balances memory efficiency with processing speed. Our system achieves 7.5 ms backbone processing latency, which is slightly higher than 6.18 ms in [^55] but significantly lower than 33.87 ms in [^39]. The increased latency in the backbone stage is offset by the overall system efficiency, where our integrated design ensures that data flows smoothly through the pipeline without the need for frequent external memory accesses, reducing potential bottlenecks, positioning our system as a robust solution for low-latency, LiDAR-based object detection in autonomous driving applications.

![Figure 21](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t7-3583443-large.gif)

*TABLE VII*

### C. Comparison With Existing VICAD Systems

The latency comparison presented in Table VIII evaluates the data transmission delays in various VICAD systems, specifically considering two major latency components: 1) the transmission delay from roadside sensors (e.g., cameras and LiDAR) to the computing device and 2) the latency from the RSU to vehicles. The detection algorithm’s processing time is not included in this comparison since each referenced work employs different detection models with varying computational complexities.

![Figure 22](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t8-3583443-large.gif)

*TABLE VIII*

For the sensor-to-computing device latency, our system achieves zero transmission delay, which is a major improvement over other implementations. This is because our system eliminates the need for external data transmission by utilizing a shared DDR memory for seamless data exchange between perception, computation, and communication modules. In contrast, EdgeCooper [^57] incurs a significantly higher delay of 75 ms, as it transmits raw point cloud data to edge devices for further processing. Similarly, [^7] and [^9] report latencies of 20 ms and 14 ms, respectively, due to data transmission overheads.

For the RSU-to-vehicle latency, our system maintains a low delay of 6 ms, which is $6.7{\times }$ lower than [^7] and $33{\times }$ lower than [^9]. The significantly higher latency in [^9] (198 ms) is primarily attributed to its reliance on a public online broker and 4G modems for internet communication, which introduces substantial network delays. In contrast, [^7] employs the Cohda MK5 Wireless RSU, a commercial communication module that reduces transmission latency to 40 ms. Reference [^57] achieves a notably low latency of 5 ms by optimizing communication pipelines, though its sensor-to-computing delay remains high.

Overall, our system effectively minimizes latency at both stages, demonstrating superior efficiency in real-time roadside perception and cooperative vehicle-infrastructure communication. The integration of a shared memory mechanism significantly enhances system responsiveness, making it well-suited for safety-critical applications in autonomous driving.

The power consumption of our system is detailed in Table X, with a total operating power of 28.40 W. The NPU consumes 11.44 W (40.3%), primarily for deep learning inference on FPGA. The C-V2X module consumes 6.33 W (22.3%), handling real-time data transmission. The peripheral circuits, including power amplifiers, 5G modules, GNSS receivers, camera modules, and RF circuits, consume 10.63 W (37.4%). While the system maintains efficient power usage, future optimizations, such as power gating techniques and transmission efficiency improvements, will further reduce energy consumption for roadside deployment.

![Figure 23](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t9-3583443-large.gif)

*TABLE IX*

![Figure 24](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t10-3583443-large.gif)

*TABLE X*

### D. Algorithm Quantization on KITTI Dataset

In this experiment, we evaluated the fusion of YOLOv3 [^34] and our modified PointPillars [^23] architecture on the KITTI dataset, as measured by 2-D detection metric. As shown in Table IX, the fusion-based algorithm significantly enhances overall detection performance compared to using either model individually. The impact of quantization varies across object categories, with larger objects like cars experiencing minimal performance degradation, while smaller objects, such as pedestrians and cyclists are more affected. Specifically, for PointPillars, the quantized model exhibits a 4.2% drop in cyclist detection accuracy and a 6.2% drop in pedestrian detection accuracy, compared to a negligible 1.1% reduction for cars. This degradation can be attributed to the sensitivity of small-object features to lower numerical precision in quantized networks. Similarly, YOLOv3 experiences a 1.4% accuracy drop for pedestrians and a 0.7% drop for cyclists, while maintaining a stable car detection accuracy. Despite these reductions, the fusion model demonstrates robustness by mitigating the accuracy loss. For example, the fusion approach limits the quantized PointPillars model 6.2% drop for pedestrians to 1.7%, demonstrating its effectiveness in preserving detection accuracy. It also achieves a mAP of 86.7%, only 0.9% lower than the full-precision version, while significantly outperforming the individual quantized models. It is evident that fusion compensates for quantization-induced losses, particularly for small objects. The qualitative detection results of the three algorithms are shown in Fig. 14.

![Figure 25](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang14-3583443-large.gif)

*Fig. 14. Object detection results in KITTI.*

In summary, the fusion of YOLOv3 [^34] and PointPillars [^23] offers a balanced tradeoff between performance and efficiency, with the combined model demonstrating superior detection accuracy and robustness, especially in challenging real-world conditions. Furthermore, the successful application of quantization techniques to both models ensures that the system remains efficient, making it suitable for deployment in resource-constrained environments, such as roadside infrastructure.

### E. Performance Evaluation on DAIR-V2X-I Dataset

Table XI presents the performance evaluation on the DAIR-V2X-I dataset, comparing our fusion approach with state-of-the-art methods utilizing different modalities.

![Figure 26](https://ieeexplore.ieee.org/mediastore/IEEE/content/media/6488907/11134518/11052245/jiang.t11-3583443-large.gif)

*TABLE XI*

For 2-D object detection, the fusion method of YOLOv3 and PointPillars achieves the highest accuracy across all categories, particularly excelling in the Car category with a detection accuracy of 95.10%, significantly surpassing BEVHeight (81.95%) and MVX-Net (21.39%). This demonstrates that our fusion approach effectively integrates image and LiDAR information, enhancing 2-D detection accuracy. Additionally, for Cyclist and Pedestrian categories, our method achieves 88.30% and 83.60%, respectively, outperforming BEVHeight (84.97% and 62.70%) and showing strong capability in detecting smaller objects.

For 3-D and BEV object detection, our method achieves performance comparable to SECOND in the Car category but shows limitations in the Cyclist and Pedestrian categories. Specifically, for Cyclist detection, SECOND achieves 39.06% 3-D accuracy, while our method attains 30.35%, and for Pedestrian, SECOND reaches 53.98%, outperforming our 38.71%. This performance gap is mainly due to the difficulty in detecting small and distant objects using LiDAR, as well as the limited resolution of YOLOv3 in handling such targets. In comparison, BEVHeight, which relies solely on image features, achieves relatively high performance in the Car category (77.05% in 3-D and 84.08% in BEV) but struggles with Cyclist (15.15% in 3-D, 18.97% in BEV) and Pedestrian (3.90% in 3-D, 5.30% in BEV). This highlights the limitation of image-only methods in accurately estimating depth, especially for smaller objects.

Despite these limitations in 3-D and BEV detection, our strong 2-D detection performance ensures reliable perception results, which are critical for V2X-based object localization and communication. Future improvements will focus on refining LiDAR feature extraction and incorporating multiscale fusion strategies to enhance 3-D detection accuracy, particularly for small and distant objects.

## SECTION VI. Conclusion

In this research, we introduced a comprehensive roadside unit platform that integrates perception, computing, and communication capabilities, specifically designed for cooperative autonomous driving applications. The platform utilizes an FPGA-based architecture equipped with dual NPUs to efficiently process image and LiDAR data, while a C-V2X communication module facilitates low-latency data exchange. Our experiments demonstrated the effectiveness of the platform in real-world roadside object detection and fusion-based perception scenarios.

The camera-based object detection system achieved an end-to-end latency of under 100 ms, with NPU-based processing taking 34 ms. LiDAR-based detection maintained performance comparable to PointPillars while optimizing FPGA resource usage. Our design improved DSP utilization by 20% over state-of-the-art methods, and the voxelization module demonstrated a fourfold speedup over PS-side implementations. The fusion of YOLOv3 and PointPillars achieved 87.6% mAP on KITTI, demonstrating robustness in complex environments. To further validate our approach in realistic roadside scenarios, we evaluated it on the DAIR-V2X-I dataset. Our fusion-based method excelled in 2-D detection, achieving 95.10% mAP for cars, 88.30% for cyclists, and 83.60% for pedestrians, surpassing image-only and point cloud-based methods. However, its 3-D and BEV detection performance lagged behind LiDAR-exclusive approaches like SECOND, particularly for small and distant objects, highlighting the need for enhanced multiscale fusion and LiDAR feature extraction. Additionally, our system demonstrated superior scalability and efficiency compared to existing VICAD roadside units. By integrating perception, computation, and communication within a single FPGA-based platform, we significantly reduced interdevice latency, achieving an RSU-to-vehicle transmission delay of only 6 ms–substantially lower than prior VICAD implementations. Our hardware-optimized approach also minimizes redundant data transmission, further enhancing real-time performance.

Future works will involve broadening the system’s capability to interface with additional sensors and further optimizing the neural network models to improve performance across varied driving landscapes. We also plan to enhance our NPU architecture to support batch processing and higher throughput, enabling real-time multisensor fusion at larger intersections. Additionally, we will explore adaptive feature compression and V2X message encoding optimizations to address data transmission constraints in large-scale deployments.

## References

[^1]: H. Fan, “Baidu Apollo EM motion planner,” 2018, arXiv:1807.08048. [Google Scholar](https://scholar.google.com/scholar?as_q=Baidu+Apollo+EM+motion+planner&as_occt=title&hl=en&as_sdt=0%2C31)

[^2]: H. Kanoshima and H. Hatakenaka, “Development of next-generation road services by public and private joint research,” in Proc. 8th Int. Conf. ITS Telecommun., 2008, pp. 404–407. [IEEE](https://ieeexplore.ieee.org/document/4740295) [Google Scholar](https://scholar.google.com/scholar?as_q=Development+of+next-generation+road+services+by+public+and+private+joint+research&as_occt=title&hl=en&as_sdt=0%2C31)

[^3]: A. Lupinska-Dubicka, “In-car eCall device for automatic accident detection, passengers counting and alarming,” in Transactions on Computational Science XXXV, M. L. Gavrilova, C. J. K. Tan, K. Saeed, and N. Chaki, Eds. Heidelberg, Germany : Springer, 2020, pp. 36–57. [DOI](https://doi.org/10.1007/978-3-662-61092-3_3) [Google Scholar](https://scholar.google.com/scholar?as_q=In-car+eCall+device+for+automatic+accident+detection%2C+passengers+counting+and+alarming&as_occt=title&hl=en&as_sdt=0%2C31)

[^4]: S. Mokhtarimousavi, “A time of day analysis of pedestrian-involved crashes in California: Investigation of injury severity, a logistic regression and machine learning approach using HSIS data,” Inst. Transp. Engineers. J., vol. 89, no. 10, pp. 25–33, 2019. [Google Scholar](https://scholar.google.com/scholar?as_q=A+time+of+day+analysis+of+pedestrian-involved+crashes+in+California%3A+Investigation+of+injury+severity%2C+a+logistic+regression+and+machine+learning+approach+using+HSIS+data&as_occt=title&hl=en&as_sdt=0%2C31)

[^5]: A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3354–3361. [IEEE](https://ieeexplore.ieee.org/document/6248074) [Google Scholar](https://scholar.google.com/scholar?as_q=Are+we+ready+for+autonomous+driving%3F+The+KITTI+vision+benchmark+suite&as_occt=title&hl=en&as_sdt=0%2C31)

[^6]: H. Yu, “DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 21361–21370. [IEEE](https://ieeexplore.ieee.org/document/9879243) [Google Scholar](https://scholar.google.com/scholar?as_q=DAIR-V2X%3A+A+large-scale+dataset+for+vehicle-infrastructure+cooperative+3D+object+detection&as_occt=title&hl=en&as_sdt=0%2C31)

[^7]: R. Zhang, Z. Zou, S. Shen, and H. X. Liu, “Design, implementation, and evaluation of a roadside cooperative perception system,” Transp. Res. Rec., vol. 18, no. 5, pp. 273–284, 2022. [DOI](https://doi.org/10.1177/03611981221092402) [Google Scholar](https://scholar.google.com/scholar?as_q=Design%2C+implementation%2C+and+evaluation+of+a+roadside+cooperative+perception+system&as_occt=title&hl=en&as_sdt=0%2C31)

[^8]: C. Xiang, “Multi-sensor fusion algorithm in cooperative vehicle-infrastructure system for blind spot warning,” Int. J. Distrib. Sens. Netw., vol. 18, no. 5, 2022, Art. no. 15501329221100412. [DOI](https://doi.org/10.1177/15501329221100412) [Google Scholar](https://scholar.google.com/scholar?as_q=Multi-sensor+fusion+algorithm+in+cooperative+vehicle-infrastructure+system+for+blind+spot+warning&as_occt=title&hl=en&as_sdt=0%2C31)

[^9]: D. Vignarca, M. Vignati, S. Arrigoni, and E. Sabbioni, “Infrastructure-based vehicle Localization through camera calibration for I2V communication warning,” Sensors, vol. 23, no. 16, p. 7136, 2023. [DOI](https://doi.org/10.3390/s23167136) [Google Scholar](https://scholar.google.com/scholar?as_q=Infrastructure-based+vehicle+Localization+through+camera+calibration+for+I2V+communication+warning&as_occt=title&hl=en&as_sdt=0%2C31)

[^10]: M. Tsukada, T. Oi, M. Kitazawa, and H. Esaki, “Networked roadside perception units for autonomous driving,” Sensors, vol. 20, no. 18, p. 5320, 2020. [DOI](https://doi.org/10.3390/s20185320) [Google Scholar](https://scholar.google.com/scholar?as_q=Networked+roadside+perception+units+for+autonomous+driving&as_occt=title&hl=en&as_sdt=0%2C31)

[^11]: L. Jie, G. Yifan, T. Ming, and M. Liqiang, “Reconfigurable convolutional neural network accelerator based on ZYNQ,” Chin. J. Electron., vol. 49, no. 4, pp. 729–735, 2021. [Google Scholar](https://scholar.google.com/scholar?as_q=Reconfigurable+convolutional+neural+network+accelerator+based+on+ZYNQ&as_occt=title&hl=en&as_sdt=0%2C31)

[^12]: Z. Yu and C.-S. Bouganis, “A parameterisable FPGA-tailored architecture for YOLOv3-tiny,” in Proc. 16th Int. Symp. Appl. Reconfigurable Comput. Archit., Tools, Appl., Toledo, Spain, 2020, pp. 330–344. [DOI](https://doi.org/10.1007/978-3-030-44534-8_25) [Google Scholar](https://scholar.google.com/scholar?as_q=A+parameterisable+FPGA-tailored+architecture+for+YOLOv3-tiny&as_occt=title&hl=en&as_sdt=0%2C31)

[^13]: Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang, “Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2,” IEEE Access, vol. 8, pp. 116569–116585, 2020. [IEEE](https://ieeexplore.ieee.org/document/9122495) [Google Scholar](https://scholar.google.com/scholar?as_q=Sparse-YOLO%3A+Hardware%2Fsoftware+co-design+of+an+FPGA+accelerator+for+YOLOv2&as_occt=title&hl=en&as_sdt=0%2C31)

[^14]: Z. Lili, C. Zhen, L. Yuxuan, and Q. Lele, “Yolo v3-SPP real-time target detection system based on ZYNQ,” Opt. Precis. Eng., vol. 31, no. 4, pp. 543–551, 2023. [Google Scholar](https://scholar.google.com/scholar?as_q=Yolo+v3-SPP+real-time+target+detection+system+based+on+ZYNQ&as_occt=title&hl=en&as_sdt=0%2C31)

[^15]: M. Kim, K. Oh, Y. Cho, H. Seo, X. T. Nguyen, and H.-J. Lee, “A low-latency FPGA accelerator for YOLOv3-tiny with flexible layerwise mapping and dataflow,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 71, no. 3, pp. 1158–1171, Mar. 2024. [IEEE](https://ieeexplore.ieee.org/document/10360333) [Google Scholar](https://scholar.google.com/scholar?as_q=A+low-latency+FPGA+accelerator+for+YOLOv3-tiny+with+flexible+layerwise+mapping+and+dataflow&as_occt=title&hl=en&as_sdt=0%2C31)

[^16]: O. Eid and M. A. Abd El Ghany, “Hardware implementation of YOLOv4-tiny for object detection,” in Proc. Int. Conf. Microelectron. (ICM), 2021, pp. 270–275. [IEEE](https://ieeexplore.ieee.org/document/9664943) [Google Scholar](https://scholar.google.com/scholar?as_q=Hardware+implementation+of+YOLOv4-tiny+for+object+detection&as_occt=title&hl=en&as_sdt=0%2C31)

[^17]: J. Cao, Z. Yang, J. Lu, and J. Lai, “A high-performance YOLOV5 accelerator for object detection with near sensor intelligence,” in Proc. IEEE 15th Int. Conf. ASIC (ASICON), 2023, pp. 1–4. [IEEE](https://ieeexplore.ieee.org/document/10396271) [Google Scholar](https://scholar.google.com/scholar?as_q=A+high-performance+YOLOV5+accelerator+for+object+detection+with+near+sensor+intelligence&as_occt=title&hl=en&as_sdt=0%2C31)

[^18]: Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud based 3D object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4490–4499. [IEEE](https://ieeexplore.ieee.org/document/8578570) [Google Scholar](https://scholar.google.com/scholar?as_q=VoxelNet%3A+End-to-end+learning+for+point+cloud+based+3D+object+detection&as_occt=title&hl=en&as_sdt=0%2C31)

[^19]: Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018. [DOI](https://doi.org/10.3390/s18103337) [Google Scholar](https://scholar.google.com/scholar?as_q=Second%3A+Sparsely+embedded+convolutional+detection&as_occt=title&hl=en&as_sdt=0%2C31)

[^20]: C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660. [Google Scholar](https://scholar.google.com/scholar?as_q=PointNet%3A+Deep+learning+on+point+sets+for+3D+classification+and+segmentation&as_occt=title&hl=en&as_sdt=0%2C31)

[^21]: C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–10. [Google Scholar](https://scholar.google.com/scholar?as_q=PointNet%2B%2B%3A+Deep+hierarchical+feature+learning+on+point+sets+in+a+metric+space&as_occt=title&hl=en&as_sdt=0%2C31)

[^22]: G. Qian, “PointNext: Revisiting pointnet++ with improved training and scaling strategies,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 23192–23204. [Google Scholar](https://scholar.google.com/scholar?as_q=PointNext%3A+Revisiting+pointnet%2B%2B+with+improved+training+and+scaling+strategies&as_occt=title&hl=en&as_sdt=0%2C31)

[^23]: A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “PointPillars: Fast encoders for object detection from point clouds,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 12697–12705. [IEEE](https://ieeexplore.ieee.org/document/8954311) [Google Scholar](https://scholar.google.com/scholar?as_q=PointPillars%3A+Fast+encoders+for+object+detection+from+point+clouds&as_occt=title&hl=en&as_sdt=0%2C31)

[^24]: S. Shi, “PV-RCNN: Point-voxel feature set abstraction for 3D object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10529–10538. [IEEE](https://ieeexplore.ieee.org/document/9157234) [Google Scholar](https://scholar.google.com/scholar?as_q=PV-RCNN%3A+Point-voxel+feature+set+abstraction+for+3D+object+detection&as_occt=title&hl=en&as_sdt=0%2C31)

[^25]: D. Xu, D. Anguelov, and A. Jain, “PointFusion: Deep sensor fusion for 3D bounding box estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 244–253. [IEEE](https://ieeexplore.ieee.org/document/8578131) [Google Scholar](https://scholar.google.com/scholar?as_q=PointFusion%3A+Deep+sensor+fusion+for+3D+bounding+box+estimation&as_occt=title&hl=en&as_sdt=0%2C31)

[^26]: X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1907–1915. [IEEE](https://ieeexplore.ieee.org/document/8100174) [Google Scholar](https://scholar.google.com/scholar?as_q=Multi-view+3D+object+detection+network+for+autonomous+driving&as_occt=title&hl=en&as_sdt=0%2C31)

[^27]: J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3D proposal generation and object detection from view aggregation,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2018, pp. 1–8. [IEEE](https://ieeexplore.ieee.org/document/8594049) [Google Scholar](https://scholar.google.com/scholar?as_q=Joint+3D+proposal+generation+and+object+detection+from+view+aggregation&as_occt=title&hl=en&as_sdt=0%2C31)

[^28]: C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3D object detection from RGB-D data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 918–927. [IEEE](https://ieeexplore.ieee.org/document/8578200) [Google Scholar](https://scholar.google.com/scholar?as_q=Frustum+pointnets+for+3D+object+detection+from+RGB-D+data&as_occt=title&hl=en&as_sdt=0%2C31)

[^29]: S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Sequential fusion for 3D object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4604–4612. [IEEE](https://ieeexplore.ieee.org/document/9156790) [Google Scholar](https://scholar.google.com/scholar?as_q=Pointpainting%3A+Sequential+fusion+for+3D+object+detection&as_occt=title&hl=en&as_sdt=0%2C31)

[^30]: L. Xie, “PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module,” in Proc. AAAI Conf. Artif. Intell., vol. 34, 2020, pp. 12460–12467. [DOI](https://doi.org/10.1609/aaai.v34i07.6933) [Google Scholar](https://scholar.google.com/scholar?as_q=PI-RCNN%3A+An+efficient+multi-sensor+3D+object+detector+with+point-based+attentive+cont-conv+fusion+module&as_occt=title&hl=en&as_sdt=0%2C31)

### Additional References

