Benchmarking Performance Gains of Odin v0 D-IMC accelerator over Jetson Orin Nano

This section details the performance benchmarks conducted to evaluate and compare the AI inference capabilities of the NVIDIA Jetson Orin Nano against a configuration combining the Jetson Orin Nano with an Odin v0 D-IMC accelerator.

The primary objectives of these benchmarks are:

To quantify the performance gains achieved by offloading AI inference tasks to a dedicated AI accelerator.
To analyze the performance of the Jetson Orin Nano across its different power modes.
To provide a clear comparison of performance metrics, such as frames per second (FPS) and latency, for various neural network models and workloads.

The findings from these benchmarks are intended to guide future development and to developers and engineers in selecting the optimal hardware configuration for their edge AI applications based on performance requirements and power constraints.

Hardware Used

The following hardware components were used:

Compute Device 1: NVIDIA Jetson Orin Nano Developer Kit.
Compute Device 2: NVIDIA Jetson Orin Nano Developer Kit with Odin v0 D-IMC accelerator.
Power Monitoring Tools: INA260

Software Used

The software environment was configured as follows:

Operating System: NVIDIA JetPack SDK 6.2.1
AI Frameworks: Pytorch 2.11.0, TensorRT 10.3.0
Benchmark Scripts: The custom scripts located in the code/benchmark directory of our repository were used to run inference.
Voyager SDK: Release v1.5

Methodology

The general workflow for the benchmark is as follows:

Model Preparation

Pre-trained models were used to run benchmark samples in IMAGENET and Sentinel-2 Multi-Spectral Datasets. The general workflow is as follows:

Convert Model from Pytorch / TensorFlow to ONNX 18 (17 for D-IMC accelerator)
Run trtexec (or use tensorrt library) to build TensorRT engine from ONNX models (INT8 or FP16)
Run axcompile (or use deploy.py script) to build D-IMC engine from ONNX models (INT8)

Execution

The Inference Performance is evaluated across three different workloads: ImageNet-1K classification, Sentinel-2 MS Cloud Detection (4 Bands), and Spacecraft Pose estimation. The benchmark code isolates Jetson GPU and D-IMC accelerator to ensure high-fidelity comparision between Jetson Orin Nano and Odin v0 D-IMC accelerator. The power modes were controlled to simulate various constrained situations (using sudo nvpmodel -m <mode>).

The benchmark scripts calls the relevant model via respective Python bindings, measuring inference cycles across a 100-iteration warmup and 1000-iteration test loop. The reported results are an average of 50 test runs of the benchmarking script.

Standalone Jetson Orin Nano (Baseline) For the baseline, all models are optimized using NVIDIA TensorRT and executed on the integrated Ampere GPU. To analyze performance under varied orbital power budgets, benchmarks are run across the Orin Nano’s native power profiles (7W, 15W, 25W, 25W MAXN Super).
Jetson Orin Nano + Odin v0 D-IMC accelerator In this configuration, the Orin Nano acts as the host controller, while the heavy INT8 inference tasks are offloaded to the Odin v0 D-IMC accelerator via the M.2 PCIe interface. Offloading is done in three stages:
- Quantization: Models are converted to INT8 using the Voyager Optimizer to leverage the D-IMC accelerator's high-efficiency compute cores.
- The Voyager SDK compiles the quantized models into a hardware-specific kernels kernel_x.c.
- The benchmark script utilizes the axelera runtime API to initialize the accelerator and stream multispectral or RGB buffers directly to the AIPU.

Data Collection

Mission-Critical Metrics for satellite operations are evaluated to analyze the trade-off between throughput and energy consumption. The following metrics are collected:

Metric	Description
Inference Latency (ms)	The time required for a single forward pass through the model.
End-to-End Latency (ms)	Total time including data preprocessing, H2D (Host-to-Device) transfer, inference, and D2H transfer.
Throughput (FPS)	Frames processed per second, calculated as 1000/Inference Latency.
Power Consumption (W)	Real-time power draw of the Jetson module and the D-IMC accelerator combined.
Efficiency (FPS/Watt)	The primary KPI for satellite deployment, indicating compute density per unit of energy.
Utilization (%)	GPU/AIPU load and CPU overhead during the inference pipeline.

A dual-logging strategy is used to ensure data integrity for post-processing:

System Telemetry: A background process captures tegrastats, D-IMC accelerator sensor data, and INA260 (external power monitor) data to log instantaneous power and thermal metrics into a .csv file.
Performance Metrics: The Python execution script aggregates timing data using high-resolution timers (time.perf_counter()).

Execution Flow

ImageNet-1K

This test establishes a baseline for all following tests. It is used to evaluate the raw throughput and efficiency of the compute engines and a comparision with published figures.

Architecture: ResNet50 ( $224\times224$ Input Resolution)
Input Data: $224\times224$ images from ImageNet-1K dataset.
Precision: INT8 (Quantized)
1. For the Orin Nano, the model is optimized using TensorRT, utilizing symmetric quantization and calibration.
2. For the Odin v0, the model is quantized using the Voyager Optimizer for the accelerator's Digital In-Memory Computing (D-IMC) architectue.
Metrics Measured:
1. Throughput (FPS): We compare the maximum throughput achieved by the 1024-core Ampere GPU versus the dedicated D-IMC accelerator. While the Orin Nano provides a versatile compute environment, Odin v0 is expected to demonstrate superior throughput by offloading the heavy matrix multiplications from the GPU.
2. Energy Efficiency (TOPS/Watt): In orbital environments, the thermal budget is often more restrictive than the power budget. We calculate efficiency by dividing the operations per second by the total system power draw. ( $\text{OPs per sec} = \text{Stable FPS} \times \text{RESNET50 OPs per frame}$ )
3. Batch Size Analysis: Benchmarks are performed at batch_size=1 (for real-time responsiveness) and batch_size=8+ (to maximize hardware utilization and analyze memory bandwidth saturation).

While classification is a general task, ResNet50 serves as the backbone for many space-specific derivative models. Understanding its performance-per-watt on the hardware stack allows us to predict the feasibility of complex multi-model pipelines where the Jetson GPU can be reserved for pre-processing while Odin v0 handles the primary inference load to allow for low power processing.

Sentinel-2 MS Cloud Detection (4 Bands)

This test evaluates the system's ability to perform real-time semantic segmentation on multi-spectral satellite imagery. Efficient cloud detection is one of many steps involved in MS image processing, doing it onboard allows satellites to autonomously filter "cloud-contaminated" data and prioritize the downlink of clear scenes, significantly reducing required bandwidth.

Architecture: DTACSNet, a deep learning architecture optimized for binary cloud masking.
Input Data: 4-band Sentinel-2 Level-2A (L2A) multispectral imagery. 10m resolution bands: B02 (Blue), B03 (Green), B04 (Red), and B08 (Near-Infrared) are utilized for the test.
Data Handling: To manage the high dimensionality of Sentinel-2 scenes, images are processed in $224\times224$ pixel tiles. The pipeline incorporates I/O optimizations via GDAL to ensure that data loading and pre-processing do not become a bottleneck for the high-speed inference engine.

The primary focus of this benchmark is to measure the efficiency of our strategies in bridging data between the Jetson host and the Odin v0 D-IMC accelerator for dense semantic segmentation tasks.

Metrics Measured:
1. Pixel-Level Throughput: Unlike classification, segmentation requires high-resolution output masks. The time taken to generate a full-scene cloud mask is measured by aggregating individual tile inference results.
2. Memory Bandwidth Efficiency: We analyze how the Odin v0 handles the increased memory pressure of 4-band input tensors compared to standard 3-band RGB inputs.

For a low-power satellite DPU, the "Cloud Masking" latency directly dictates the maximum frame rate at which the satellite can process ground swaths. By offloading this task to the D-IMC accelerator, the Jetson Orin Nano’s GPU is freed to perform concurrent tasks, such as geometric correction or image compression, enabling a truly parallelized onboard data processing pipeline.

Spacecraft Pose Estimation

This test evaluates the system’s capability to perform high-precision Vision-Based Navigation (VBN). Spacecraft pose estimation is fundamental for GNC (Guidance, Navigation, and Control) during proximity operations, such as autonomous docking, satellite servicing, or active debris removal.

Architecture: Pose-ResNet50, utilizing a ResNet50 backbone for feature extraction followed by specialized regression heads.
Parameters Estimated: The model outputs a 7-element vector representing the relative pose:
1. 3-DoF Translation: $(x, y, z)$ coordinates in the camera frame.
2. 4-DoF Rotation: Unit quaternions $(q_w, q_x, q_y, q_z)$ to represent orientation.
Hybrid Execution The pipeline is split between the hardware components:
1. D-IMC accelerator: Handles the compute-intensive ResNet50 feature extraction in INT8.
2. Jetson CPU/GPU: Executes the post-processing "algorithmic functions," including quaternion normalization and coordinate frame transformations that require floating-point precision for mathematical integrity.

The challenge of pose estimation lies in maintaining millisecond-level latency to support real-time control loops while operating within a rigid power envelope. The metrics measured during this test:

Regression Latency: We measure the "Glass-to-Result" time—the duration from capturing a frame of a target spacecraft to the availability of the 7-DoF pose vector.
Host-Accelerator Coordination: This test specifically monitors the overhead of the hand-off between the AIPU’s feature maps and the Jetson’s post-processing logic.
Jitter Analysis: High-frequency control loops require stable inference times. We analyze the standard deviation of latency across 5,000 frames to ensure no significant "spikes" occur due to PCIe bus contention or thermal throttling.

In a docking scenario, the DPU must provide pose updates at a high frequency (e.g., >20 Hz). By offloading the ResNet50 backbone to the Odin v0, the system aims to achieve these high refresh rates at a fraction of the power consumption required by a standalone GPU, allowing the Jetson to remain in a lower power state while still meeting the mission's real-time constraints.

Benchmarking Performance Gains of Odin v0 D-IMC accelerator over Jetson Orin Nano

Hardware Used

Software Used

Methodology

Model Preparation

Execution

Data Collection

Execution Flow

ImageNet-1K

Sentinel-2 MS Cloud Detection (4 Bands)

Spacecraft Pose Estimation

Results

Conclusion

Hardware Used​

Software Used​

Methodology​

Model Preparation​

Execution​

Data Collection​

Execution Flow​

ImageNet-1K​

Sentinel-2 MS Cloud Detection (4 Bands)​

Spacecraft Pose Estimation​

Results​

Conclusion​

Hardware Used

Software Used

Methodology

Model Preparation

Execution

Data Collection

Execution Flow

ImageNet-1K

Sentinel-2 MS Cloud Detection (4 Bands)

Spacecraft Pose Estimation

Results

Conclusion