Skip to main content

GEMM Raw Throughput Testing

This guide provides a technical deep-dive into General Element Matrix Multiplication (GEMM) performance on the NVIDIA Jetson Orin Nano (Ampere) and the Odin v0 (D-IMC). We analyze the transition from memory bound to math bound regimes and the effects of hardware-specific tiling strategies. Based on NVIDIA's Matrix Multiplication Guide.

2. Background: Matrix-Matrix Multiplication

We consider the following (simple) GEMM operation:

C=αAB+βC\mathbf{C} = \alpha \mathbf{AB} + \beta \mathbf{C}

where ARM×K,BRK×N\mathbf{A}\in\mathbb{R}^{M\times K}, \mathbf{B}\in\mathbb{R}^{K\times N}, and CRM×N\mathbf{C} \in \mathbb{R}^{M\times N} where A\mathbf{A} and B\mathbf{B} are matrix inputs, α\alpha and β\beta as scalar inputs, and C\mathbf{C} as a pre-existing matrix to be overwritten by the output. For the compiled-model phase, as will be seen later, we consider the case of a fully-connected layer, where A,B,C\mathbf{A}, \mathbf{B}, \mathbf{C} acts as input, weights, and bias respectively.

2.1 Arithmetic Intensity

To determine if a benchmark is limited by the processor's speed (Math Bound) or memory bandwidth (Memory Bound), we calculate the Arithmetic Intensity:

Arithmetic Intensity=2MNKbytes_per_element(MK+KN+MN)\text{Arithmetic Intensity} = \frac{2\cdot M \cdot N \cdot K}{\text{bytes\_per\_element}\left( M\cdot K + K\cdot N + M\cdot N\right)}

where bytes_per_element is 2 for both FP16 and INT8 (counting load and store cycles per element). On the Jetson Orin Nano, the Roofline inflection point is reached significantly later than on the Odin v0 due to the Odin v0's Digital In-Memory Computing (D-IMC) architecture, which drastically reduces the denominator (bytes moved from global memory) by keeping weights stationary within the compute fabric.

For simplicity, we use M=K=NM = K = N. We also set α=1,β=1\alpha = 1, \beta = 1, which represents the addition of a skip-connection with a linear operation (Multiply-and-Accumulate, MAC). For this scenario, a MAC operation contains N3N^3 multiplications and additions, totalling 2N32N^3 floating-point operations. For the square case, Arithmetic Intensity =N/3= N/3, where the denominator accounts for three N×NN\times N matrices at 2 effective bytes each (load + store), giving 2N3/(6N2)=N/32N^3 / (6N^2) = N/3.

3. Hardware Execution Models

3.1 Jetson Orin's GPU

The Orin Nano 8GB shares a unified LPDDR5 memory (6868 GB/s bandwidth) between the CPU and GPU. The Orin Nano's Ampere architecture utilizes HMMA (Half-Precision Matrix Multiply-Accumulate) instructions which operate in four stages:

  1. Addition Accumulation: The βC\beta\mathbf{C} product is read from global memory before accumulation and written back whenever β0\beta \neq 0.
  2. Register Accumulation: The αAB\alpha\mathbf{AB} product is calculated in the register file.
  3. Tile Quantization: The GPU partitions the matrix into tiles of dimension (Ntile×Ntile)(N_\text{tile} \times N_\text{tile}). If NN is not a multiple of NtileN_\text{tile}, partial tiles are launched. They execute for the same number of clock cycles as a full tile but yield fewer useful Operations per Second (OPs), reducing effective throughput.
  4. Wave Quantization: The Orin Nano 8GB has 8 Streaming Multiprocessors (SMs). If the total number of tiles N/Ntile2\lfloor N / N_\text{tile} \rfloor^2 is not a multiple of 8, the final wave of tiles under-utilizes the hardware.
tip

Optimal Configuration: To ensure maximal SM occupancy, NN should be a multiple of 8×Ntile8\times N_\text{tile}, i.e. a multiple of 256256 for FP16 Tensor Cores (tile size 3232), or a multiple of 16×Ntile16\times N_\text{tile}, i.e. a multiple of 512512 for INT8 (tile size 3232).

To ensure optimal config, NN is ensured to be a multiple of 88 for FP16 Tensor Cores, or 1616 for INT8.

3.2 Odin v0 (D-IMC)

Unlike the Jetson Orin GPU's temporal execution (different operations performed on different data at different points in time using fixed-location ALUs), the Odin v0 utilizes Digital-In-Memory Computing, where weights are stationary within the SRAM-backed crossbar array. Operations proceed in three stages:

  1. SRAM Staging: Matrix weights B\mathbf{B} are staged in local SRAM where the D-IMC crossbar resides. This is a one-time cost per model deployment.
  2. Multiplication: Input A\mathbf{A} is streamed through the crossbar grid where B\mathbf{B} is stored. Each cell of the crossbar computes a partial dot product, yielding a fully spatial MAC with no data movement between compute and memory.
  3. Accumulation: For β0\beta \neq 0, the βC\beta\mathbf{C} addition is handled at the output of the D-IMC crossbar before writing the result back over the PCIe interface.

The primary bottleneck for the Odin v0 is the PCIe 3.0 x4 interface, which provides a theoretical peak bandwidth of 4 GB/s and a practical bandwidth of approximately 3.2–3.5 GB/s once protocol overhead is accounted for. When the weight matrix fits entirely within the on-chip SRAM, compute is effectively free relative to the data ingestion rate, making the system entirely PCIe-bound.

4. Benchmarking Methodology

4.1 Measurement Protocol

To eliminate transient noise and cold-start penalties:

  1. Warm-up: 100 iterations of the kernel to trigger frequency scaling (boost clocks) and populate instruction caches.
  2. Steady-State: 1000 iterations recorded via high-precision hardware timers (cudaEvent on Jetson; Voyager SDK's internal profiler on Odin v0).
  3. Cache Purge: Cache flushing between runs to ensure "cold" memory-access conditions for memory-bound analysis.

On Jetson Orin Nano, jetson_clocks is used to lock the GPU at maximum frequency (clocks pinned, fan at full speed). Odin v0 is configured to 100% utilization mode via the Voyager runtime flags.

4.2 Precision Targets

AcceleratorFormatPeak Theoretical
Orin Nano SuperFP16 / INT817 TFLOPs / 67 TOPs (Sparse) / 33 TOPs (Dense)
Odin v0INT8214 TOPS

4.3 Environment Setup

Host Hardware:

  • NVIDIA Jetson Orin Nano 8GB Developer Kit
  • Odin v0 D-IMC accelerator connected via M.2 Key M Slot (PCIe 3.0 x4)
ComponentJetson Orin NanoOdin v0
OSUbuntu 22.04 LTS (L4T 36.4)Ubuntu 22.04 LTS (host)
SDK / RuntimeJetPack 6.2.1Voyager SDK 1.5
Compiler / BackendTensorRT 10.3, CUDA 12.6Voyager Compiler (ONNX)
ONNX ExportPyTorch 2.11 + torch.onnxPyTorch 2.11 + torch.onnx
Python3.103.10
Power Profilenvpmodel -m 2 (25W MAXN Super)High-Performance Mode (100% Utilization)
# Jetson: Pin clocks for deterministic benchmarking
sudo nvpmodel -m 2 # MAXN Super
sudo jetson_clocks --store # Store Dynamic Frequency Stats
sudo jetson_clocks # Overclock
sudo jetson_clocks --restore # To disable overclock

# Verify GPU frequency
sudo jetson_clocks --show

4.4 Measured Metrics

Each benchmark run records the following metrics:

MetricDescriptionUnit
Compute LatencyWall-clock time for the GEMM kernel execution only, averaged over 1000 iterationsms
ThroughputEffective operations per second: 2N3/latency2N^3 / \text{latency}GOPs or TOPs
H2D LatencyHost-to-Device transfer time for input matrix A\mathbf{A} over PCIe (Odin v0 only; absent for Jetson unified memory)ms
D2H LatencyDevice-to-Host transfer time for output matrix C\mathbf{C} over PCIe (Odin v0 only)ms
Byte AccessTotal bytes read/written across the memory bus: (MK+KN+MN)×bytes_per_element(M \cdot K + K \cdot N + M \cdot N) \times \text{bytes\_per\_element}MB
Arithmetic IntensityRatio of compute operations to bytes accessed: 2N3/Byte Access2N^3 / \text{Byte Access}OPs/B
note

Throughput values reported for Jetson TensorRT are total system GOPs using INT8 precision. Throughput values for Odin v0 are per-core GOPs; multiply by 4 for total system throughput.

5. Raw CUDA Kernels

Based on NVIDIA's Matrix Multiplication Performance Guide, a naive tiled CUDA kernel was implemented to establish a performance baseline across a range of matrix sizes. Note that for this phase, all matrices A\mathbf{A}, B\mathbf{B}, and C\mathbf{C} are treated as runtime inputs — no weight-stationary optimization is applied.

#define TILE 32

__global__ void full_gemm(float * A, float * B, float * C, int M, int N, int K, float alpha, float beta) {
__shared__ float sA[ TILE ][ TILE ];
__shared__ float sB[ TILE ][ TILE ];

int tx = threadIdx.x;
int ty = threadIdx.y;
int row = blockIdx.y * TILE + ty;
int col = blockIdx.x * TILE + tx;

float acc = 0.0 f;

// Loop over TILE s to compute AB
for (int m = 0; m < (K + TILE - 1) / TILE ; ++m) {
// Load A TILE (handle boundary)
if (row < M && (m * TILE + tx) < K)
sA[ty][tx] = A[row * K + (m * TILE + tx)];
else
sA[ty][tx] = 0.0 f;

// Load B TILE (handle boundary)
if (col < N && (m * TILE + ty) < K)
sB[ty][tx] = B[(m * TILE + ty) * N + col];
else
sB[ty][tx] = 0.0 f;

__syncthreads();

#pragma unroll
for (int k = 0; k < TILE ; ++k) {
acc += sA[ty][k] * sB[k][tx];
}
__syncthreads();
}

// Apply GEMM logic: C = alpha * (A*B) + beta * C
if (row < M && col < N) {
int idx = row * N + col;
C[idx] = (alpha * acc) + (beta * C[idx]);
}
}

TILE is set to 32 in accordance with the warp size and the shared memory bank layout described in Section 3.1. Each 32×32 shared memory tile maps cleanly to one warp per row, minimizing bank conflicts on Ampere. The execution runs from this kernel:

MATRIX_SIZECompute_Latency (ms)Throughput (GFLOPs)H2D Latency (ms)D2H Latency (ms)Byte Access (MB)Arithmetic Intensity (FLOPs/B)
512x5121.319203.4500.4450.1851.5170.67
1024x102411.080193.8201.5270.6356.0341.33
2048x204886.110199.5106.2952.26824.0682.67
4096x4096702.326195.69024.9756.63596.01365.33
note

All runs were performed with the GPU locked at maximum frequency via jetson_clocks.

5.1 Why CUDA Cannot Utilize Jetson's Unified Memory Advantage Here

Even though the Orin Nano's CPU and GPU share a physical LPDDR5 pool, this naive kernel still incurs explicit cudaMemcpy calls (H2D/D2H) because the data originates on the CPU heap. The unified memory system only eliminates copy overhead when using cudaMallocManaged or Pinned Memory (cudaMallocHost) — neither of which this kernel exploits. The H2D/D2H latency values in the table above represent the actual copy cost over the internal memory fabric.

More critically, the throughput ceiling of ~200 GFLOPs is far below the Orin Nano Super 8GB's rated 33 TOPS (INT8 Dense) or 17 TFLOPs (FP16) because this kernel:

  • Uses FP32 CUDA Cores, not Tensor Cores. Tensor Cores are only invoked via wmma API calls or mma.sync PTX instructions. Without them, the hardware operates at its scalar FP32 ceiling (~1.5 TFLOPS on the Orin Nano 8GB) rather than the Tensor Core peak.
  • Does not saturate memory bandwidth. While shared memory tiling reduces global loads, naïve bank-conflict patterns and uncoalesced edge accesses prevent the kernel from reaching the 68 GB/s LPDDR5 bandwidth ceiling.
  • Has no software pipelining. cuBLAS and TensorRT use double-buffered shared memory and asynchronous cp.async loads to overlap compute with data movement. This kernel synchronizes at every tile boundary via __syncthreads(), stalling the pipeline. The ~200 GFLOPs ceiling is therefore the FP32 scalar-core throughput at this occupancy level, not a reflection of the hardware's Tensor Core capability. See NVIDIA's cuBLAS documentation for a comparison of cublasSgemm vs cublasGemmEx (Tensor Core path).

5.2 Math-Bound vs. Memory-Bound Analysis

At N=4096N = 4096, the arithmetic intensity is 1365.33 FLOPs/B. The Roofline inflection point for FP32 on the Orin Nano 8GB — calculated as peak compute / peak bandwidth = 1500 GFLOPs / 68 GB/s22 FLOPs/B1500\ \text{GFLOPs}\ /\ 68\ \text{GB/s} \approx 22\ \text{FLOPs/B} — is crossed very early (N>66N > 66). This means all four matrix sizes in the table above are strictly math-bound under this kernel. The ALUs are the bottleneck, not the memory bus.

However, because the kernel does not use Tensor Cores, the "Math Ceiling" is capped at the scalar FP32 core throughput (~1.5 TFLOPS), not the Tensor Core peak (20 TFLOPS FP16). This explains the flat ~200 GFLOPs throughput across all matrix sizes: the kernel is math-bound, pinned against a low ceiling, with no benefit from further increases in NN.

Implication for AI model sizing on Jetson Orin Nano: For real inference workloads, the takeaway is that small-to-medium matrix sizes (N<512N < 512) will be memory-bound even with TensorRT, and optimal throughput requires batching inputs to raise NN (and thus arithmetic intensity) into the math-bound regime. This is the primary argument for batch size > 1 in production deployments on Jetson.

5.3 SM Occupancy and Throughput Plateaus

Orin Nano SM Occupancy: SM occupancy can be measured using NVIDIA Nsight Compute. Run:

ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active \
--target-processes all ./gemm_benchmark

For this naive kernel, 100% occupancy is reached approximately at N=512N = 512, where the number of tiles (16×16=25616 \times 16 = 256) exceeds 8 SMs×32 warps/SM=2568\ \text{SMs} \times 32\ \text{warps/SM} = 256 concurrent warps. Above this point, occupancy does not increase; throughput improvements come only from reducing per-instruction latency (i.e., using Tensor Cores).

Odin v0 PCIe Plateau: For the Odin v0, the compute throughput per core begins to plateau when the PCIe transfer of input matrix A\mathbf{A} dominates total latency. Based on theoretical PCIe 3.0 x4 bandwidth (4 GB/s peak, ~3.2 GB/s practical), the crossover point where compute time ≈ transfer time is approximately N768N \approx 768 for INT8, corresponding to a 768 KB input. Beyond this size, gains in raw compute throughput are masked by PCIe ingestion delays.

6. Experimental Results

For this section, GEMM is treated as a Deep Learning operation: only matrix A\mathbf{A} is an input at inference time; B\mathbf{B} (weights), C\mathbf{C} (bias), α\alpha, and β\beta are baked in as model parameters. The resultant model is exported to ONNX and used to build a TensorRT engine (Jetson) and an compiled model file (AXM) (Odin v0).

6.1 INT8 Quantization and Calibration

Both TensorRT and Voyager use Post-Training Quantization (PTQ) to convert the FP32 ONNX model to INT8. The calibration procedure:

  1. Calibration Dataset: 512 synthetic input matrices drawn from U(1,1)\mathcal{U}(-1, 1), representative of the expected input distribution.
  2. TensorRT Calibrator: IInt8EntropyCalibrator2 — minimizes KL-divergence between the FP32 and INT8 activation distributions to determine per-tensor scale factors.
  3. Voyager Quantizer: Uses the Voyager SDK's built-in PTQ pipeline, which clips activations based on a percentile of the calibration distribution (default: 99.99th percentile). Accuracy vs. FP32 Baseline (Mean Absolute Error on output C\mathbf{C}):
BackendPrecisionMAE vs. FP32 Baseline
CUDA (this work)FP320.000 (reference)
TensorRTINT8< 0.002
VoyagerINT8< 0.004

The slightly higher MAE for D-IMC accelerator reflects the additional quantization of intermediate accumulations within the D-IMC crossbar. For GEMM-only operations, both INT8 backends remain within acceptable bounds for downstream neural network layers.

6.2 Jetson Orin Nano (TensorRT — INT8)

MATRIX_SIZECompute Latency (ms)Throughput (GOPs)H2D Latency (ms)D2H Latency (ms)Byte Access (MB)Arithmetic Intensity (OPs/B)
512×5120.1671,602.720.75170.67
1024×10240.5314,048.023.0341.33
2048×20481.60610,697.8812.0682.67

Throughput Calculation: Throughput (GOPs) =2N3/(latency×106)= 2N^3 / (\text{latency} \times 10^6). For example, at N=1024N=1024: 2×10243/(0.531×103)4048 GOPs2 \times 1024^3 / (0.531 \times 10^{-3}) \approx 4048\ \text{GOPs}.

Interpretation: The throughput scales super-linearly from N=512N=512 to N=2048N=2048 (1602 → 10698 GOPs, a 6.7× increase). This is the expected behavior as TensorRT's Tensor Core utilization improves with matrix size: larger NN increases tile occupancy, reduces wave-quantization waste, and enables TensorRT's tactic selector to choose deeper-pipelined kernels. The hardware is fully Tensor Core-driven (INT8_IMMA instructions), explaining the order-of-magnitude improvement over the naive CUDA FP32 baseline.

H2D / D2H Latency: Not reported. On the Orin Nano, the GPU and CPU share physical LPDDR5 memory. When using Pinned Memory (cudaMallocHost) — as TensorRT does internally — data is accessible to the GPU via DMA without a copy over a discrete bus. The latency column reflects purely kernel execution time plus internal cache management overhead.

Why 4096×4096 was omitted: TensorRT's builder performs an exhaustive tactic search over candidate GEMM kernels for each layer shape. At N=4096N=4096, this search space grows combinatorially. On the 8GB Orin Nano, the builder allocates a workspace (commonly 4–8 GB for large INT8 shapes) that, combined with the tensor buffers themselves (40962×232 MB4096^2 \times 2 \approx 32\ \text{MB} for INT8 I/O), regularly triggers the OS Out-of-Memory (OOM) killer before the engine is serialized. The 4096×4096 configuration is therefore not feasible on the 8GB Orin Nano with TensorRT INT8 without reducing the builder workspace or applying layer-level memory constraints.

How to run:

import tensorrt as trt
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine(onnx_path, calibrator=None):
builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1 GB
config.set_flag(trt.BuilderFlag.INT8)
if calibrator:
config.int8_calibrator = calibrator
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, TRT_LOGGER)
with open(onnx_path, "rb") as f:
parser.parse(f.read())
return builder.build_serialized_network(network, config)

# Benchmark: average over 1000 iterations after 100 warm-up
import pycuda.driver as cuda
import pycuda.autoinit
import time

def benchmark(engine_bytes, N, iterations=1000, warmup=100):
runtime = trt.Runtime(TRT_LOGGER)
engine = runtime.deserialize_cuda_engine(engine_bytes)
context = engine.create_execution_context()
# ... allocate buffers, run warmup, time iterations

6.3 Odin v0 (Voyager SDK — INT8)

MATRIX_SIZECompute Latency (ms)Throughput (GOPs/Core)H2D Latency (ms)D2H Latency (ms)Byte Access (MB)Arithmetic Intensity (OPs/B)
512×5120.695386.140.1200.1200.75170.67
1024×10242.0201,062.801.031.033.0341.33

Reported values are per-core throughput. The Odin v0 has 4 AI cores; total system throughput is approximately 4×4\times the per-core figure. For N=1024N=1024: total 4252 GOPs\approx 4252\ \text{GOPs}.

Interpretation of Latency: Odin v0 latency is the sum of compute time and the PCIe transaction overhead required to stream matrix A\mathbf{A} into the device and return C\mathbf{C}. Unlike the Jetson's unified memory, Odin v0 must physically move data across the PCIe bus on every inference call. For N=512N=512, the time spent moving data is nearly equal to the compute time, suppressing effective throughput relative to the theoretical 214 TOPS peak.

PCIe Latency Calculation (Expected vs. Observed):

For a single INT8 matrix of size N×NN \times N, the transfer size is N2N^2 bytes. With a practical PCIe 3.0 x4 bandwidth of ~3.2 GB/s:

MATRIX_SIZETransfer Size (MB)Expected H2D (ms)Observed H2D (ms)Overhead (ms)
512×5120.250.0780.1200.042
1024×10241.000.3131.0300.717

The growing gap between expected and observed latency at N=1024N=1024 reflects PCIe protocol framing overhead and software stack latency in the Voyager SDK runtime (descriptor setup, DMA chaining). Small transfers are disproportionately penalized by fixed per-transaction overhead.

Why 2048×2048 was omitted: Each D-IMC AI core has 4 MB of local L1 SRAM. A single 2048×20482048 \times 2048 INT8 weight matrix occupies exactly 4 MB. When input buffers and output staging are accounted for, the total on-chip footprint exceeds the L1 capacity, forcing the Voyager compiler to fragment the operation across multiple PCIe transactions. This produces severe PCIe thrashing — repeated stalls where the crossbar sits idle while the next data fragment is fetched — collapsing effective throughput well below the N=1024N=1024 result. The 32 MB L2 SRAM is shared across all cores and does not fully mitigate this for a single-core GEMM operation.

How to run:

from axelera.runtime import InferenceSession
import numpy as np

session = InferenceSession("gemm_n1024.axm") # Compiled model

A = np.random.randint(-128, 127, (1024, 1024), dtype=np.int8)

# Warm-up
for _ in range(100):
_ = session.run({"input_A": A})

# Benchmark
import time
times = []
for _ in range(1000):
t0 = time.perf_counter()
C = session.run({"input_A": A})
t1 = time.perf_counter()
times.append((t1 - t0) * 1e3)

print(f"Mean latency: {np.mean(times):.3f} ms")
print(f"Throughput: {2 * 1024**3 / (np.mean(times) * 1e-3) / 1e9:.2f} GOPs")

7. Conclusion

7.1 Comparative Summary

BackendN=512 Latency (ms)N=512 Throughput (GOPs)N=1024 Latency (ms)N=1024 Throughput (GOPs)
CUDA FP32 (naive)1.319203.511.080193.8
TensorRT INT80.1671,602.70.5314,048.0
Odin v0 (total)0.695 + 0.24 xfer~1,544.62.020 + 2.06 xfer~4,251.2

At equivalent matrix sizes, TensorRT and Odin v0 deliver comparable total-system throughput. The distinction lies in the latency breakdown: TensorRT's latency is almost entirely compute time, while Odin v0's latency is dominated by PCIe data movement. For small inputs, TensorRT has a clear latency advantage. As NN grows and the compute-to-transfer ratio improves for Odin v0, the two systems converge.

7.2 Benefits of Odin v0 for Space Applications

  • Deterministic Latency: D-IMC avoids the jitter caused by GPU warp scheduling and memory bank contention. For applications such as satellite pose estimation or real-time attitude control, consistent timing is mission-critical. Odin v0's spatial architecture delivers cycle-accurate, deterministic execution per inference.
  • Power Efficiency: MAC operations are performed in-memory, eliminating the power-hungry data movement between SRAM and ALUs that dominates GPU energy consumption. This is critical for power-budgeted satellite platforms.
  • Radiation Hardening Potential: Traditional GPU register files are highly susceptible to Single Event Upsets (SEUs) from ionizing radiation in LEO/MEO orbits. Localized SRAM architectures with small, well-bounded crossbar cells are inherently more amenable to cell-level ECC and Triple Modular Redundancy (TMR) implementation compared to the distributed register files of a 1024-core GPU.
  • TVAC Thermal Profile: The lower peak power draw of Odin v0 (<< 5W in typical inference) simplifies the thermal mass and radiator sizing requirements for vacuum-environment chassis, where convective cooling is unavailable.

7.3 Limitations and Bottlenecks

Accuracy:

  • Both TensorRT and Voyager introduce quantization error via INT8 PTQ. For GEMM-only operations, the MAE is negligible (see Section 6.1). However, in stacked networks, quantization error can accumulate across layers. Layer-wise sensitivity analysis is recommended before deploying deep networks on Odin v0. D-IMC-Specific Constraints:
  • 4D Tensor Requirement: D-IMC accelerator is optimized for 4D NCHW tensors (vision pipelines). 1D and 2D GEMMs are internally padded to 4D, introducing "dummy" operations that reduce effective utilization. This benchmark uses a (1, 1, N, N) wrapper to satisfy this constraint.
  • Fixed ONNX Operator Set: Custom CUDA-like kernels cannot be written for Odin v0. You are restricted to ONNX opset 17 operators supported by the Voyager compiler. Operations such as sparse attention, custom activation functions, or non-standard pooling require operator decomposition into supported primitives before compilation.
  • SRAM Capacity Cliff: As demonstrated by the 2048×2048 exclusion, any layer whose weight matrix exceeds ~3.5 MB (leaving headroom for I/O buffers) will trigger fragmentation and severe throughput degradation. Model architectures must be designed with this constraint in mind — for example, by using depthwise-separable layers or limiting fully-connected layer widths.
  • PCIe Bandwidth as Hard Ceiling: For inference pipelines that chain multiple GEMM operations, each layer's output must transit the PCIe bus before the next layer's input arrives. This serializes the pipeline in a way that a Jetson's unified memory model does not, making Odin v0 less competitive for recurrent or iterative architectures with small per-layer tensor sizes.

These benchmarks are directly relevant to the selection of edge accelerators for space-grade DPUs:

  1. Sizing guidance: The SRAM capacity cliff at N2048N \approx 2048 defines the maximum fully-connected layer width deployable on a single D-IMC core without performance degradation. Network architectures for satellite onboard inference should keep FC layer widths below this threshold.
  2. Throughput vs. latency trade-off: For high-cadence sensor fusion tasks (e.g., 100 Hz IMU integration), TensorRT on Jetson offers lower per-inference latency. For batch image classification (e.g., downlink prioritization), Odin v0's higher sustained throughput at larger NN is advantageous.
  3. Power envelope: At 15W MAXN, the Orin Nano fits within typical small satellite power budgets. The Odin v0 adds minimal incremental power while substantially increasing INT8 throughput, making the combined system attractive for platforms with tight SWaP-C constraints.

Appendix: Hardware Specifications

FeatureNVIDIA Jetson Orin Nano 8GBOdin v0
Compute CoreAmpere GPU (1024-core, 8 SM)4× AI Cores (D-IMC)
Memory ArchitectureUnified LPDDR5 (68 GB/s)Local SRAM / PCIe Gen3
On-chip Storage4 MB L1, 32 MB L2 / core
External InterfacePCIe 3.0 x4 (~4 GB/s)
Configured Power25W (MAXN Super, nvpmodel -m 2)High-Performance Mode
Target PrecisionFP16 / INT8 (Tensor Cores)INT8 (In-Memory)
Peak Throughput67 TOPS (INT8 Sparse) / 33.5 TOPS (INT8 Dense) / 17 TFLOPs (FP16)214 TOPS (INT8)