GEMM Raw Throughput Testing

This guide provides a technical deep-dive into General Element Matrix Multiplication (GEMM) performance on the NVIDIA Jetson Orin Nano (Ampere) and the Odin v0 (D-IMC). We analyze the transition from memory bound to math bound regimes and the effects of hardware-specific tiling strategies. Based on NVIDIA's Matrix Multiplication Guide.

2. Background: Matrix-Matrix Multiplication

We consider the following (simple) GEMM operation:

\mathbf{C} = \alpha \mathbf{AB} + \beta \mathbf{C}

where $\mathbf{A}\in\mathbb{R}^{M\times K}, \mathbf{B}\in\mathbb{R}^{K\times N}$ , and $\mathbf{C} \in \mathbb{R}^{M\times N}$ where $\mathbf{A}$ and $\mathbf{B}$ are matrix inputs, $\alpha$ and $\beta$ as scalar inputs, and $\mathbf{C}$ as a pre-existing matrix to be overwritten by the output. For the compiled-model phase, as will be seen later, we consider the case of a fully-connected layer, where $\mathbf{A}, \mathbf{B}, \mathbf{C}$ acts as input, weights, and bias respectively.

2.1 Arithmetic Intensity

To determine if a benchmark is limited by the processor's speed (Math Bound) or memory bandwidth (Memory Bound), we calculate the Arithmetic Intensity:

\text{Arithmetic Intensity} = \frac{2\cdot M \cdot N \cdot K}{\text{bytes\_per\_element}\left( M\cdot K + K\cdot N + M\cdot N\right)}

where bytes_per_element is 2 for both FP16 and INT8 (counting load and store cycles per element). On the Jetson Orin Nano, the Roofline inflection point is reached significantly later than on the Odin v0 due to the Odin v0's Digital In-Memory Computing (D-IMC) architecture, which drastically reduces the denominator (bytes moved from global memory) by keeping weights stationary within the compute fabric.

For simplicity, we use $M = K = N$ . We also set $\alpha = 1, \beta = 1$ , which represents the addition of a skip-connection with a linear operation (Multiply-and-Accumulate, MAC). For this scenario, a MAC operation contains $N^3$ multiplications and additions, totalling $2N^3$ floating-point operations. For the square case, Arithmetic Intensity $= N/3$ , where the denominator accounts for three $N\times N$ matrices at 2 effective bytes each (load + store), giving $2N^3 / (6N^2) = N/3$ .

3. Hardware Execution Models

3.1 Jetson Orin's GPU

The Orin Nano 8GB shares a unified LPDDR5 memory ( $68$ GB/s bandwidth) between the CPU and GPU. The Orin Nano's Ampere architecture utilizes HMMA (Half-Precision Matrix Multiply-Accumulate) instructions which operate in four stages:

Addition Accumulation: The $\beta\mathbf{C}$ product is read from global memory before accumulation and written back whenever $\beta \neq 0$ .
Register Accumulation: The $\alpha\mathbf{AB}$ product is calculated in the register file.
Tile Quantization: The GPU partitions the matrix into tiles of dimension $(N_\text{tile} \times N_\text{tile})$ . If $N$ is not a multiple of $N_\text{tile}$ , partial tiles are launched. They execute for the same number of clock cycles as a full tile but yield fewer useful Operations per Second (OPs), reducing effective throughput.
Wave Quantization: The Orin Nano 8GB has 8 Streaming Multiprocessors (SMs). If the total number of tiles $\lfloor N / N_\text{tile} \rfloor^2$ is not a multiple of 8, the final wave of tiles under-utilizes the hardware.

tip

Optimal Configuration: To ensure maximal SM occupancy, $N$ should be a multiple of $8\times N_\text{tile}$ , i.e. a multiple of $256$ for FP16 Tensor Cores (tile size $32$ ), or a multiple of $16\times N_\text{tile}$ , i.e. a multiple of $512$ for INT8 (tile size $32$ ).

To ensure optimal config, $N$ is ensured to be a multiple of $8$ for FP16 Tensor Cores, or $16$ for INT8.

3.2 Odin v0 (D-IMC)

Unlike the Jetson Orin GPU's temporal execution (different operations performed on different data at different points in time using fixed-location ALUs), the Odin v0 utilizes Digital-In-Memory Computing, where weights are stationary within the SRAM-backed crossbar array. Operations proceed in three stages:

SRAM Staging: Matrix weights $\mathbf{B}$ are staged in local SRAM where the D-IMC crossbar resides. This is a one-time cost per model deployment.
Multiplication: Input $\mathbf{A}$ is streamed through the crossbar grid where $\mathbf{B}$ is stored. Each cell of the crossbar computes a partial dot product, yielding a fully spatial MAC with no data movement between compute and memory.
Accumulation: For $\beta \neq 0$ , the $\beta\mathbf{C}$ addition is handled at the output of the D-IMC crossbar before writing the result back over the PCIe interface.

The primary bottleneck for the Odin v0 is the PCIe 3.0 x4 interface, which provides a theoretical peak bandwidth of 4 GB/s and a practical bandwidth of approximately 3.2–3.5 GB/s once protocol overhead is accounted for. When the weight matrix fits entirely within the on-chip SRAM, compute is effectively free relative to the data ingestion rate, making the system entirely PCIe-bound.

4. Benchmarking Methodology

4.1 Measurement Protocol

To eliminate transient noise and cold-start penalties:

Warm-up: 100 iterations of the kernel to trigger frequency scaling (boost clocks) and populate instruction caches.
Steady-State: 1000 iterations recorded via high-precision hardware timers (cudaEvent on Jetson; Voyager SDK's internal profiler on Odin v0).
Cache Purge: Cache flushing between runs to ensure "cold" memory-access conditions for memory-bound analysis.

On Jetson Orin Nano, jetson_clocks is used to lock the GPU at maximum frequency (clocks pinned, fan at full speed). Odin v0 is configured to 100% utilization mode via the Voyager runtime flags.

4.2 Precision Targets

Accelerator	Format	Peak Theoretical
Orin Nano Super	FP16 / INT8	17 TFLOPs / 67 TOPs (Sparse) / 33 TOPs (Dense)
Odin v0	INT8	214 TOPS

4.3 Environment Setup

Host Hardware:

NVIDIA Jetson Orin Nano 8GB Developer Kit
Odin v0 D-IMC accelerator connected via M.2 Key M Slot (PCIe 3.0 x4)

Component	Jetson Orin Nano	Odin v0
OS	Ubuntu 22.04 LTS (L4T 36.4)	Ubuntu 22.04 LTS (host)
SDK / Runtime	JetPack 6.2.1	Voyager SDK 1.5
Compiler / Backend	TensorRT 10.3, CUDA 12.6	Voyager Compiler (ONNX)
ONNX Export	PyTorch 2.11 + `torch.onnx`	PyTorch 2.11 + `torch.onnx`
Python	3.10	3.10
Power Profile	`nvpmodel -m 2` (25W MAXN Super)	High-Performance Mode (100% Utilization)

# Jetson: Pin clocks for deterministic benchmarking
sudo nvpmodel -m 2 # MAXN Super
sudo jetson_clocks --store # Store Dynamic Frequency Stats
sudo jetson_clocks # Overclock
sudo jetson_clocks --restore # To disable overclock
 
# Verify GPU frequency
sudo jetson_clocks --show

4.4 Measured Metrics

Each benchmark run records the following metrics:

Metric	Description	Unit
Compute Latency	Wall-clock time for the GEMM kernel execution only, averaged over 1000 iterations	ms
Throughput	Effective operations per second: $2N^3 / \text{latency}$	GOPs or TOPs
H2D Latency	Host-to-Device transfer time for input matrix $\mathbf{A}$ over PCIe (Odin v0 only; absent for Jetson unified memory)	ms
D2H Latency	Device-to-Host transfer time for output matrix $\mathbf{C}$ over PCIe (Odin v0 only)	ms
Byte Access	Total bytes read/written across the memory bus: $(M \cdot K + K \cdot N + M \cdot N) \times \text{bytes\_per\_element}$	MB
Arithmetic Intensity	Ratio of compute operations to bytes accessed: $2N^3 / \text{Byte Access}$	OPs/B

note

Throughput values reported for Jetson TensorRT are total system GOPs using INT8 precision. Throughput values for Odin v0 are per-core GOPs; multiply by 4 for total system throughput.

5. Raw CUDA Kernels

Based on NVIDIA's Matrix Multiplication Performance Guide, a naive tiled CUDA kernel was implemented to establish a performance baseline across a range of matrix sizes. Note that for this phase, all matrices $\mathbf{A}$ , $\mathbf{B}$ , and $\mathbf{C}$ are treated as runtime inputs — no weight-stationary optimization is applied.

#define TILE 32

__global__ void full_gemm(float * A, float * B, float * C, int M, int N, int K, float alpha, float beta) {
    __shared__ float sA[ TILE ][ TILE ];
    __shared__ float sB[ TILE ][ TILE ];

    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int row = blockIdx.y * TILE  + ty;
    int col = blockIdx.x * TILE  + tx;

    float acc = 0.0 f;

    // Loop over TILE s to compute AB
    for (int m = 0; m < (K + TILE  - 1) / TILE ; ++m) {
        // Load A TILE  (handle boundary)
        if (row < M && (m * TILE  + tx) < K)
            sA[ty][tx] = A[row * K + (m * TILE  + tx)];
        else
            sA[ty][tx] = 0.0 f;

        // Load B TILE  (handle boundary)
        if (col < N && (m * TILE  + ty) < K)
            sB[ty][tx] = B[(m * TILE  + ty) * N + col];
        else
            sB[ty][tx] = 0.0 f;

        __syncthreads();

        #pragma unroll
        for (int k = 0; k < TILE ; ++k) {
            acc += sA[ty][k] * sB[k][tx];
        }
        __syncthreads();
    }

    // Apply GEMM logic: C = alpha * (A*B) + beta * C
    if (row < M && col < N) {
        int idx = row * N + col;
        C[idx] = (alpha * acc) + (beta * C[idx]);
    }
}

TILE is set to 32 in accordance with the warp size and the shared memory bank layout described in Section 3.1. Each 32×32 shared memory tile maps cleanly to one warp per row, minimizing bank conflicts on Ampere. The execution runs from this kernel:

MATRIX_SIZE	Compute_Latency (ms)	Throughput (GFLOPs)	H2D Latency (ms)	D2H Latency (ms)	Byte Access (MB)	Arithmetic Intensity (FLOPs/B)
512x512	1.319	203.450	0.445	0.185	1.5	170.67
1024x1024	11.080	193.820	1.527	0.635	6.0	341.33
2048x2048	86.110	199.510	6.295	2.268	24.0	682.67
4096x4096	702.326	195.690	24.975	6.635	96.0	1365.33

note

All runs were performed with the GPU locked at maximum frequency via jetson_clocks.

5.1 Why CUDA Cannot Utilize Jetson's Unified Memory Advantage Here

Even though the Orin Nano's CPU and GPU share a physical LPDDR5 pool, this naive kernel still incurs explicit cudaMemcpy calls (H2D/D2H) because the data originates on the CPU heap. The unified memory system only eliminates copy overhead when using cudaMallocManaged or Pinned Memory (cudaMallocHost) — neither of which this kernel exploits. The H2D/D2H latency values in the table above represent the actual copy cost over the internal memory fabric.

More critically, the throughput ceiling of ~200 GFLOPs is far below the Orin Nano Super 8GB's rated 33 TOPS (INT8 Dense) or 17 TFLOPs (FP16) because this kernel:

Uses FP32 CUDA Cores, not Tensor Cores. Tensor Cores are only invoked via wmma API calls or mma.sync PTX instructions. Without them, the hardware operates at its scalar FP32 ceiling (~1.5 TFLOPS on the Orin Nano 8GB) rather than the Tensor Core peak.
Does not saturate memory bandwidth. While shared memory tiling reduces global loads, naïve bank-conflict patterns and uncoalesced edge accesses prevent the kernel from reaching the 68 GB/s LPDDR5 bandwidth ceiling.
Has no software pipelining. cuBLAS and TensorRT use double-buffered shared memory and asynchronous cp.async loads to overlap compute with data movement. This kernel synchronizes at every tile boundary via __syncthreads(), stalling the pipeline. The ~200 GFLOPs ceiling is therefore the FP32 scalar-core throughput at this occupancy level, not a reflection of the hardware's Tensor Core capability. See NVIDIA's cuBLAS documentation for a comparison of cublasSgemm vs cublasGemmEx (Tensor Core path).

5.2 Math-Bound vs. Memory-Bound Analysis

At $N = 4096$ , the arithmetic intensity is 1365.33 FLOPs/B. The Roofline inflection point for FP32 on the Orin Nano 8GB — calculated as peak compute / peak bandwidth = $1500\ \text{GFLOPs}\ /\ 68\ \text{GB/s} \approx 22\ \text{FLOPs/B}$ — is crossed very early ( $N > 66$ ). This means all four matrix sizes in the table above are strictly math-bound under this kernel. The ALUs are the bottleneck, not the memory bus.

However, because the kernel does not use Tensor Cores, the "Math Ceiling" is capped at the scalar FP32 core throughput (~1.5 TFLOPS), not the Tensor Core peak (20 TFLOPS FP16). This explains the flat ~200 GFLOPs throughput across all matrix sizes: the kernel is math-bound, pinned against a low ceiling, with no benefit from further increases in $N$ .

Implication for AI model sizing on Jetson Orin Nano: For real inference workloads, the takeaway is that small-to-medium matrix sizes ( $N < 512$ ) will be memory-bound even with TensorRT, and optimal throughput requires batching inputs to raise $N$ (and thus arithmetic intensity) into the math-bound regime. This is the primary argument for batch size > 1 in production deployments on Jetson.

5.3 SM Occupancy and Throughput Plateaus

Orin Nano SM Occupancy: SM occupancy can be measured using NVIDIA Nsight Compute. Run:

ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active \
    --target-processes all ./gemm_benchmark

For this naive kernel, 100% occupancy is reached approximately at $N = 512$ , where the number of tiles ( $16 \times 16 = 256$ ) exceeds $8\ \text{SMs} \times 32\ \text{warps/SM} = 256$ concurrent warps. Above this point, occupancy does not increase; throughput improvements come only from reducing per-instruction latency (i.e., using Tensor Cores).

Odin v0 PCIe Plateau: For the Odin v0, the compute throughput per core begins to plateau when the PCIe transfer of input matrix $\mathbf{A}$ dominates total latency. Based on theoretical PCIe 3.0 x4 bandwidth (4 GB/s peak, ~3.2 GB/s practical), the crossover point where compute time ≈ transfer time is approximately $N \approx 768$ for INT8, corresponding to a 768 KB input. Beyond this size, gains in raw compute throughput are masked by PCIe ingestion delays.

6. Experimental Results

For this section, GEMM is treated as a Deep Learning operation: only matrix $\mathbf{A}$ is an input at inference time; $\mathbf{B}$ (weights), $\mathbf{C}$ (bias), $\alpha$ , and $\beta$ are baked in as model parameters. The resultant model is exported to ONNX and used to build a TensorRT engine (Jetson) and an compiled model file (AXM) (Odin v0).

6.1 INT8 Quantization and Calibration

Both TensorRT and Voyager use Post-Training Quantization (PTQ) to convert the FP32 ONNX model to INT8. The calibration procedure:

Calibration Dataset: 512 synthetic input matrices drawn from $\mathcal{U}(-1, 1)$ , representative of the expected input distribution.
TensorRT Calibrator: IInt8EntropyCalibrator2 — minimizes KL-divergence between the FP32 and INT8 activation distributions to determine per-tensor scale factors.
Voyager Quantizer: Uses the Voyager SDK's built-in PTQ pipeline, which clips activations based on a percentile of the calibration distribution (default: 99.99th percentile). Accuracy vs. FP32 Baseline (Mean Absolute Error on output $\mathbf{C}$ ):

Backend	Precision	MAE vs. FP32 Baseline
CUDA (this work)	FP32	0.000 (reference)
TensorRT	INT8	< 0.002
Voyager	INT8	< 0.004

The slightly higher MAE for D-IMC accelerator reflects the additional quantization of intermediate accumulations within the D-IMC crossbar. For GEMM-only operations, both INT8 backends remain within acceptable bounds for downstream neural network layers.

6.2 Jetson Orin Nano (TensorRT — INT8)

MATRIX_SIZE	Compute Latency (ms)	Throughput (GOPs)	H2D Latency (ms)	D2H Latency (ms)	Byte Access (MB)	Arithmetic Intensity (OPs/B)
512×512	0.167	1,602.72	—	—	0.75	170.67
1024×1024	0.531	4,048.02	—	—	3.0	341.33
2048×2048	1.606	10,697.88	—	—	12.0	682.67

Throughput Calculation: Throughput (GOPs) $= 2N^3 / (\text{latency} \times 10^6)$ . For example, at $N=1024$ : $2 \times 1024^3 / (0.531 \times 10^{-3}) \approx 4048\ \text{GOPs}$ .

Interpretation: The throughput scales super-linearly from $N=512$ to $N=2048$ (1602 → 10698 GOPs, a 6.7× increase). This is the expected behavior as TensorRT's Tensor Core utilization improves with matrix size: larger $N$ increases tile occupancy, reduces wave-quantization waste, and enables TensorRT's tactic selector to choose deeper-pipelined kernels. The hardware is fully Tensor Core-driven (INT8_IMMA instructions), explaining the order-of-magnitude improvement over the naive CUDA FP32 baseline.

H2D / D2H Latency: Not reported. On the Orin Nano, the GPU and CPU share physical LPDDR5 memory. When using Pinned Memory (cudaMallocHost) — as TensorRT does internally — data is accessible to the GPU via DMA without a copy over a discrete bus. The latency column reflects purely kernel execution time plus internal cache management overhead.

Why 4096×4096 was omitted: TensorRT's builder performs an exhaustive tactic search over candidate GEMM kernels for each layer shape. At $N=4096$ , this search space grows combinatorially. On the 8GB Orin Nano, the builder allocates a workspace (commonly 4–8 GB for large INT8 shapes) that, combined with the tensor buffers themselves ( $4096^2 \times 2 \approx 32\ \text{MB}$ for INT8 I/O), regularly triggers the OS Out-of-Memory (OOM) killer before the engine is serialized. The 4096×4096 configuration is therefore not feasible on the 8GB Orin Nano with TensorRT INT8 without reducing the builder workspace or applying layer-level memory constraints.

How to run:

import tensorrt as trt
import numpy as np
 
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
 
def build_engine(onnx_path, calibrator=None):
    builder = trt.Builder(TRT_LOGGER)
    config  = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1 GB
    config.set_flag(trt.BuilderFlag.INT8)
    if calibrator:
        config.int8_calibrator = calibrator
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, TRT_LOGGER)
    with open(onnx_path, "rb") as f:
        parser.parse(f.read())
    return builder.build_serialized_network(network, config)
 
# Benchmark: average over 1000 iterations after 100 warm-up
import pycuda.driver as cuda
import pycuda.autoinit
import time
 
def benchmark(engine_bytes, N, iterations=1000, warmup=100):
    runtime = trt.Runtime(TRT_LOGGER)
    engine  = runtime.deserialize_cuda_engine(engine_bytes)
    context = engine.create_execution_context()
    # ... allocate buffers, run warmup, time iterations

6.3 Odin v0 (Voyager SDK — INT8)

MATRIX_SIZE	Compute Latency (ms)	Throughput (GOPs/Core)	H2D Latency (ms)	D2H Latency (ms)	Byte Access (MB)	Arithmetic Intensity (OPs/B)
512×512	0.695	386.14	0.120	0.120	0.75	170.67
1024×1024	2.020	1,062.80	1.03	1.03	3.0	341.33

Reported values are per-core throughput. The Odin v0 has 4 AI cores; total system throughput is approximately $4\times$ the per-core figure. For $N=1024$ : total $\approx 4252\ \text{GOPs}$ .

Interpretation of Latency: Odin v0 latency is the sum of compute time and the PCIe transaction overhead required to stream matrix $\mathbf{A}$ into the device and return $\mathbf{C}$ . Unlike the Jetson's unified memory, Odin v0 must physically move data across the PCIe bus on every inference call. For $N=512$ , the time spent moving data is nearly equal to the compute time, suppressing effective throughput relative to the theoretical 214 TOPS peak.

PCIe Latency Calculation (Expected vs. Observed):

For a single INT8 matrix of size $N \times N$ , the transfer size is $N^2$ bytes. With a practical PCIe 3.0 x4 bandwidth of ~3.2 GB/s:

MATRIX_SIZE	Transfer Size (MB)	Expected H2D (ms)	Observed H2D (ms)	Overhead (ms)
512×512	0.25	0.078	0.120	0.042
1024×1024	1.00	0.313	1.030	0.717

The growing gap between expected and observed latency at $N=1024$ reflects PCIe protocol framing overhead and software stack latency in the Voyager SDK runtime (descriptor setup, DMA chaining). Small transfers are disproportionately penalized by fixed per-transaction overhead.

Why 2048×2048 was omitted: Each D-IMC AI core has 4 MB of local L1 SRAM. A single $2048 \times 2048$ INT8 weight matrix occupies exactly 4 MB. When input buffers and output staging are accounted for, the total on-chip footprint exceeds the L1 capacity, forcing the Voyager compiler to fragment the operation across multiple PCIe transactions. This produces severe PCIe thrashing — repeated stalls where the crossbar sits idle while the next data fragment is fetched — collapsing effective throughput well below the $N=1024$ result. The 32 MB L2 SRAM is shared across all cores and does not fully mitigate this for a single-core GEMM operation.

How to run:

from axelera.runtime import InferenceSession
import numpy as np
 
session = InferenceSession("gemm_n1024.axm")  # Compiled model
 
A = np.random.randint(-128, 127, (1024, 1024), dtype=np.int8)
 
# Warm-up
for _ in range(100):
    _ = session.run({"input_A": A})
 
# Benchmark
import time
times = []
for _ in range(1000):
    t0 = time.perf_counter()
    C  = session.run({"input_A": A})
    t1 = time.perf_counter()
    times.append((t1 - t0) * 1e3)
 
print(f"Mean latency: {np.mean(times):.3f} ms")
print(f"Throughput:   {2 * 1024**3 / (np.mean(times) * 1e-3) / 1e9:.2f} GOPs")

7. Conclusion

7.1 Comparative Summary

Backend	N=512 Latency (ms)	N=512 Throughput (GOPs)	N=1024 Latency (ms)	N=1024 Throughput (GOPs)
CUDA FP32 (naive)	1.319	203.5	11.080	193.8
TensorRT INT8	0.167	1,602.7	0.531	4,048.0
Odin v0 (total)	0.695 + 0.24 xfer	~1,544.6	2.020 + 2.06 xfer	~4,251.2

At equivalent matrix sizes, TensorRT and Odin v0 deliver comparable total-system throughput. The distinction lies in the latency breakdown: TensorRT's latency is almost entirely compute time, while Odin v0's latency is dominated by PCIe data movement. For small inputs, TensorRT has a clear latency advantage. As $N$ grows and the compute-to-transfer ratio improves for Odin v0, the two systems converge.

7.2 Benefits of Odin v0 for Space Applications

Deterministic Latency: D-IMC avoids the jitter caused by GPU warp scheduling and memory bank contention. For applications such as satellite pose estimation or real-time attitude control, consistent timing is mission-critical. Odin v0's spatial architecture delivers cycle-accurate, deterministic execution per inference.
Power Efficiency: MAC operations are performed in-memory, eliminating the power-hungry data movement between SRAM and ALUs that dominates GPU energy consumption. This is critical for power-budgeted satellite platforms.
Radiation Hardening Potential: Traditional GPU register files are highly susceptible to Single Event Upsets (SEUs) from ionizing radiation in LEO/MEO orbits. Localized SRAM architectures with small, well-bounded crossbar cells are inherently more amenable to cell-level ECC and Triple Modular Redundancy (TMR) implementation compared to the distributed register files of a 1024-core GPU.
TVAC Thermal Profile: The lower peak power draw of Odin v0 ( $<$ 5W in typical inference) simplifies the thermal mass and radiator sizing requirements for vacuum-environment chassis, where convective cooling is unavailable.

7.3 Limitations and Bottlenecks

Accuracy:

Both TensorRT and Voyager introduce quantization error via INT8 PTQ. For GEMM-only operations, the MAE is negligible (see Section 6.1). However, in stacked networks, quantization error can accumulate across layers. Layer-wise sensitivity analysis is recommended before deploying deep networks on Odin v0. D-IMC-Specific Constraints:
4D Tensor Requirement: D-IMC accelerator is optimized for 4D NCHW tensors (vision pipelines). 1D and 2D GEMMs are internally padded to 4D, introducing "dummy" operations that reduce effective utilization. This benchmark uses a (1, 1, N, N) wrapper to satisfy this constraint.
Fixed ONNX Operator Set: Custom CUDA-like kernels cannot be written for Odin v0. You are restricted to ONNX opset 17 operators supported by the Voyager compiler. Operations such as sparse attention, custom activation functions, or non-standard pooling require operator decomposition into supported primitives before compilation.
SRAM Capacity Cliff: As demonstrated by the 2048×2048 exclusion, any layer whose weight matrix exceeds ~3.5 MB (leaving headroom for I/O buffers) will trigger fragmentation and severe throughput degradation. Model architectures must be designed with this constraint in mind — for example, by using depthwise-separable layers or limiting fully-connected layer widths.
PCIe Bandwidth as Hard Ceiling: For inference pipelines that chain multiple GEMM operations, each layer's output must transit the PCIe bus before the next layer's input arrives. This serializes the pipeline in a way that a Jetson's unified memory model does not, making Odin v0 less competitive for recurrent or iterative architectures with small per-layer tensor sizes.

These benchmarks are directly relevant to the selection of edge accelerators for space-grade DPUs:

Sizing guidance: The SRAM capacity cliff at $N \approx 2048$ defines the maximum fully-connected layer width deployable on a single D-IMC core without performance degradation. Network architectures for satellite onboard inference should keep FC layer widths below this threshold.
Throughput vs. latency trade-off: For high-cadence sensor fusion tasks (e.g., 100 Hz IMU integration), TensorRT on Jetson offers lower per-inference latency. For batch image classification (e.g., downlink prioritization), Odin v0's higher sustained throughput at larger $N$ is advantageous.
Power envelope: At 15W MAXN, the Orin Nano fits within typical small satellite power budgets. The Odin v0 adds minimal incremental power while substantially increasing INT8 throughput, making the combined system attractive for platforms with tight SWaP-C constraints.

Appendix: Hardware Specifications

Feature	NVIDIA Jetson Orin Nano 8GB	Odin v0
Compute Core	Ampere GPU (1024-core, 8 SM)	4× AI Cores (D-IMC)
Memory Architecture	Unified LPDDR5 (68 GB/s)	Local SRAM / PCIe Gen3
On-chip Storage	—	4 MB L1, 32 MB L2 / core
External Interface	—	PCIe 3.0 x4 (~4 GB/s)
Configured Power	25W (MAXN Super, `nvpmodel -m 2`)	High-Performance Mode
Target Precision	FP16 / INT8 (Tensor Cores)	INT8 (In-Memory)
Peak Throughput	67 TOPS (INT8 Sparse) / 33.5 TOPS (INT8 Dense) / 17 TFLOPs (FP16)	214 TOPS (INT8)

2. Background: Matrix-Matrix Multiplication​

2.1 Arithmetic Intensity​

3. Hardware Execution Models​

3.1 Jetson Orin's GPU​

3.2 Odin v0 (D-IMC)​

4. Benchmarking Methodology​

4.1 Measurement Protocol​

4.2 Precision Targets​

4.3 Environment Setup​

4.4 Measured Metrics​

5. Raw CUDA Kernels​

5.1 Why CUDA Cannot Utilize Jetson's Unified Memory Advantage Here​

5.2 Math-Bound vs. Memory-Bound Analysis​

5.3 SM Occupancy and Throughput Plateaus​

6. Experimental Results​

6.1 INT8 Quantization and Calibration​

6.2 Jetson Orin Nano (TensorRT — INT8)​

6.3 Odin v0 (Voyager SDK — INT8)​

7. Conclusion​

7.1 Comparative Summary​

7.2 Benefits of Odin v0 for Space Applications​

7.3 Limitations and Bottlenecks​

Appendix: Hardware Specifications​