GEMM Raw Throughput Testing
This guide provides a technical deep-dive into General Element Matrix Multiplication (GEMM) performance on the NVIDIA Jetson Orin Nano (Ampere) and the Odin v0 (D-IMC). We analyze the transition from memory bound to math bound regimes and the effects of hardware-specific tiling strategies. Based on NVIDIA's Matrix Multiplication Guide.
2. Background: Matrix-Matrix Multiplication
We consider the following (simple) GEMM operation:
where , and where and are matrix inputs, and as scalar inputs, and as a pre-existing matrix to be overwritten by the output. For the compiled-model phase, as will be seen later, we consider the case of a fully-connected layer, where acts as input, weights, and bias respectively.
2.1 Arithmetic Intensity
To determine if a benchmark is limited by the processor's speed (Math Bound) or memory bandwidth (Memory Bound), we calculate the Arithmetic Intensity:
where bytes_per_element is 2 for both FP16 and INT8 (counting load and store cycles per element). On the Jetson Orin Nano, the Roofline inflection point is reached significantly later than on the Odin v0 due to the Odin v0's Digital In-Memory Computing (D-IMC) architecture, which drastically reduces the denominator (bytes moved from global memory) by keeping weights stationary within the compute fabric.
For simplicity, we use . We also set , which represents the addition of a skip-connection with a linear operation (Multiply-and-Accumulate, MAC). For this scenario, a MAC operation contains multiplications and additions, totalling floating-point operations. For the square case, Arithmetic Intensity , where the denominator accounts for three matrices at 2 effective bytes each (load + store), giving .
3. Hardware Execution Models
3.1 Jetson Orin's GPU
The Orin Nano 8GB shares a unified LPDDR5 memory ( GB/s bandwidth) between the CPU and GPU. The Orin Nano's Ampere architecture utilizes HMMA (Half-Precision Matrix Multiply-Accumulate) instructions which operate in four stages:
- Addition Accumulation: The product is read from global memory before accumulation and written back whenever .
- Register Accumulation: The product is calculated in the register file.
- Tile Quantization: The GPU partitions the matrix into tiles of dimension . If is not a multiple of , partial tiles are launched. They execute for the same number of clock cycles as a full tile but yield fewer useful Operations per Second (OPs), reducing effective throughput.
- Wave Quantization: The Orin Nano 8GB has 8 Streaming Multiprocessors (SMs). If the total number of tiles is not a multiple of 8, the final wave of tiles under-utilizes the hardware.
Optimal Configuration: To ensure maximal SM occupancy, should be a multiple of , i.e. a multiple of for FP16 Tensor Cores (tile size ), or a multiple of , i.e. a multiple of for INT8 (tile size ).
To ensure optimal config, is ensured to be a multiple of for FP16 Tensor Cores, or for INT8.
3.2 Odin v0 (D-IMC)
Unlike the Jetson Orin GPU's temporal execution (different operations performed on different data at different points in time using fixed-location ALUs), the Odin v0 utilizes Digital-In-Memory Computing, where weights are stationary within the SRAM-backed crossbar array. Operations proceed in three stages:
- SRAM Staging: Matrix weights are staged in local SRAM where the D-IMC crossbar resides. This is a one-time cost per model deployment.
- Multiplication: Input is streamed through the crossbar grid where is stored. Each cell of the crossbar computes a partial dot product, yielding a fully spatial MAC with no data movement between compute and memory.
- Accumulation: For , the addition is handled at the output of the D-IMC crossbar before writing the result back over the PCIe interface.
The primary bottleneck for the Odin v0 is the PCIe 3.0 x4 interface, which provides a theoretical peak bandwidth of 4 GB/s and a practical bandwidth of approximately 3.2–3.5 GB/s once protocol overhead is accounted for. When the weight matrix fits entirely within the on-chip SRAM, compute is effectively free relative to the data ingestion rate, making the system entirely PCIe-bound.
4. Benchmarking Methodology
4.1 Measurement Protocol
To eliminate transient noise and cold-start penalties:
- Warm-up: 100 iterations of the kernel to trigger frequency scaling (boost clocks) and populate instruction caches.
- Steady-State: 1000 iterations recorded via high-precision hardware timers (
cudaEventon Jetson; Voyager SDK's internal profiler on Odin v0). - Cache Purge: Cache flushing between runs to ensure "cold" memory-access conditions for memory-bound analysis.
On Jetson Orin Nano, jetson_clocks is used to lock the GPU at maximum frequency (clocks pinned, fan at full speed). Odin v0 is configured to 100% utilization mode via the Voyager runtime flags.
4.2 Precision Targets
| Accelerator | Format | Peak Theoretical |
|---|---|---|
| Orin Nano Super | FP16 / INT8 | 17 TFLOPs / 67 TOPs (Sparse) / 33 TOPs (Dense) |
| Odin v0 | INT8 | 214 TOPS |
4.3 Environment Setup
Host Hardware:
- NVIDIA Jetson Orin Nano 8GB Developer Kit
- Odin v0 D-IMC accelerator connected via M.2 Key M Slot (PCIe 3.0 x4)
| Component | Jetson Orin Nano | Odin v0 |
|---|---|---|
| OS | Ubuntu 22.04 LTS (L4T 36.4) | Ubuntu 22.04 LTS (host) |
| SDK / Runtime | JetPack 6.2.1 | Voyager SDK 1.5 |
| Compiler / Backend | TensorRT 10.3, CUDA 12.6 | Voyager Compiler (ONNX) |
| ONNX Export | PyTorch 2.11 + torch.onnx | PyTorch 2.11 + torch.onnx |
| Python | 3.10 | 3.10 |
| Power Profile | nvpmodel -m 2 (25W MAXN Super) | High-Performance Mode (100% Utilization) |
# Jetson: Pin clocks for deterministic benchmarking
sudo nvpmodel -m 2 # MAXN Super
sudo jetson_clocks --store # Store Dynamic Frequency Stats
sudo jetson_clocks # Overclock
sudo jetson_clocks --restore # To disable overclock
# Verify GPU frequency
sudo jetson_clocks --show
4.4 Measured Metrics
Each benchmark run records the following metrics:
| Metric | Description | Unit |
|---|---|---|
| Compute Latency | Wall-clock time for the GEMM kernel execution only, averaged over 1000 iterations | ms |
| Throughput | Effective operations per second: | GOPs or TOPs |
| H2D Latency | Host-to-Device transfer time for input matrix over PCIe (Odin v0 only; absent for Jetson unified memory) | ms |
| D2H Latency | Device-to-Host transfer time for output matrix over PCIe (Odin v0 only) | ms |
| Byte Access | Total bytes read/written across the memory bus: | MB |
| Arithmetic Intensity | Ratio of compute operations to bytes accessed: | OPs/B |
Throughput values reported for Jetson TensorRT are total system GOPs using INT8 precision. Throughput values for Odin v0 are per-core GOPs; multiply by 4 for total system throughput.
5. Raw CUDA Kernels
Based on NVIDIA's Matrix Multiplication Performance Guide, a naive tiled CUDA kernel was implemented to establish a performance baseline across a range of matrix sizes. Note that for this phase, all matrices , , and are treated as runtime inputs — no weight-stationary optimization is applied.
#define TILE 32
__global__ void full_gemm(float * A, float * B, float * C, int M, int N, int K, float alpha, float beta) {
__shared__ float sA[ TILE ][ TILE ];
__shared__ float sB[ TILE ][ TILE ];
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = blockIdx.y * TILE + ty;
int col = blockIdx.x * TILE + tx;
float acc = 0.0 f;
// Loop over TILE s to compute AB
for (int m = 0; m < (K + TILE - 1) / TILE ; ++m) {
// Load A TILE (handle boundary)
if (row < M && (m * TILE + tx) < K)
sA[ty][tx] = A[row * K + (m * TILE + tx)];
else
sA[ty][tx] = 0.0 f;
// Load B TILE (handle boundary)
if (col < N && (m * TILE + ty) < K)
sB[ty][tx] = B[(m * TILE + ty) * N + col];
else
sB[ty][tx] = 0.0 f;
__syncthreads();
#pragma unroll
for (int k = 0; k < TILE ; ++k) {
acc += sA[ty][k] * sB[k][tx];
}
__syncthreads();
}
// Apply GEMM logic: C = alpha * (A*B) + beta * C
if (row < M && col < N) {
int idx = row * N + col;
C[idx] = (alpha * acc) + (beta * C[idx]);
}
}
TILE is set to 32 in accordance with the warp size and the shared memory bank layout described in Section 3.1. Each 32×32 shared memory tile maps cleanly to one warp per row, minimizing bank conflicts on Ampere. The execution runs from this kernel:
| MATRIX_SIZE | Compute_Latency (ms) | Throughput (GFLOPs) | H2D Latency (ms) | D2H Latency (ms) | Byte Access (MB) | Arithmetic Intensity (FLOPs/B) |
|---|---|---|---|---|---|---|
| 512x512 | 1.319 | 203.450 | 0.445 | 0.185 | 1.5 | 170.67 |
| 1024x1024 | 11.080 | 193.820 | 1.527 | 0.635 | 6.0 | 341.33 |
| 2048x2048 | 86.110 | 199.510 | 6.295 | 2.268 | 24.0 | 682.67 |
| 4096x4096 | 702.326 | 195.690 | 24.975 | 6.635 | 96.0 | 1365.33 |
All runs were performed with the GPU locked at maximum frequency via jetson_clocks.
5.1 Why CUDA Cannot Utilize Jetson's Unified Memory Advantage Here
Even though the Orin Nano's CPU and GPU share a physical LPDDR5 pool, this naive kernel still incurs explicit cudaMemcpy calls (H2D/D2H) because the data originates on the CPU heap. The unified memory system only eliminates copy overhead when using cudaMallocManaged or Pinned Memory (cudaMallocHost) — neither of which this kernel exploits. The H2D/D2H latency values in the table above represent the actual copy cost over the internal memory fabric.
More critically, the throughput ceiling of ~200 GFLOPs is far below the Orin Nano Super 8GB's rated 33 TOPS (INT8 Dense) or 17 TFLOPs (FP16) because this kernel:
- Uses FP32 CUDA Cores, not Tensor Cores. Tensor Cores are only invoked via
wmmaAPI calls ormma.syncPTX instructions. Without them, the hardware operates at its scalar FP32 ceiling (~1.5 TFLOPS on the Orin Nano 8GB) rather than the Tensor Core peak. - Does not saturate memory bandwidth. While shared memory tiling reduces global loads, naïve bank-conflict patterns and uncoalesced edge accesses prevent the kernel from reaching the 68 GB/s LPDDR5 bandwidth ceiling.
- Has no software pipelining.
cuBLASand TensorRT use double-buffered shared memory and asynchronouscp.asyncloads to overlap compute with data movement. This kernel synchronizes at every tile boundary via__syncthreads(), stalling the pipeline. The ~200 GFLOPs ceiling is therefore the FP32 scalar-core throughput at this occupancy level, not a reflection of the hardware's Tensor Core capability. See NVIDIA's cuBLAS documentation for a comparison ofcublasSgemmvscublasGemmEx(Tensor Core path).
5.2 Math-Bound vs. Memory-Bound Analysis
At , the arithmetic intensity is 1365.33 FLOPs/B. The Roofline inflection point for FP32 on the Orin Nano 8GB — calculated as peak compute / peak bandwidth = — is crossed very early (). This means all four matrix sizes in the table above are strictly math-bound under this kernel. The ALUs are the bottleneck, not the memory bus.
However, because the kernel does not use Tensor Cores, the "Math Ceiling" is capped at the scalar FP32 core throughput (~1.5 TFLOPS), not the Tensor Core peak (20 TFLOPS FP16). This explains the flat ~200 GFLOPs throughput across all matrix sizes: the kernel is math-bound, pinned against a low ceiling, with no benefit from further increases in .
Implication for AI model sizing on Jetson Orin Nano: For real inference workloads, the takeaway is that small-to-medium matrix sizes () will be memory-bound even with TensorRT, and optimal throughput requires batching inputs to raise (and thus arithmetic intensity) into the math-bound regime. This is the primary argument for batch size > 1 in production deployments on Jetson.
5.3 SM Occupancy and Throughput Plateaus
Orin Nano SM Occupancy: SM occupancy can be measured using NVIDIA Nsight Compute. Run:
ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active \
--target-processes all ./gemm_benchmark
For this naive kernel, 100% occupancy is reached approximately at , where the number of tiles () exceeds concurrent warps. Above this point, occupancy does not increase; throughput improvements come only from reducing per-instruction latency (i.e., using Tensor Cores).
Odin v0 PCIe Plateau: For the Odin v0, the compute throughput per core begins to plateau when the PCIe transfer of input matrix dominates total latency. Based on theoretical PCIe 3.0 x4 bandwidth (4 GB/s peak, ~3.2 GB/s practical), the crossover point where compute time ≈ transfer time is approximately for INT8, corresponding to a 768 KB input. Beyond this size, gains in raw compute throughput are masked by PCIe ingestion delays.
6. Experimental Results
For this section, GEMM is treated as a Deep Learning operation: only matrix is an input at inference time; (weights), (bias), , and are baked in as model parameters. The resultant model is exported to ONNX and used to build a TensorRT engine (Jetson) and an compiled model file (AXM) (Odin v0).
6.1 INT8 Quantization and Calibration
Both TensorRT and Voyager use Post-Training Quantization (PTQ) to convert the FP32 ONNX model to INT8. The calibration procedure:
- Calibration Dataset: 512 synthetic input matrices drawn from , representative of the expected input distribution.
- TensorRT Calibrator:
IInt8EntropyCalibrator2— minimizes KL-divergence between the FP32 and INT8 activation distributions to determine per-tensor scale factors. - Voyager Quantizer: Uses the Voyager SDK's built-in PTQ pipeline, which clips activations based on a percentile of the calibration distribution (default: 99.99th percentile). Accuracy vs. FP32 Baseline (Mean Absolute Error on output ):
| Backend | Precision | MAE vs. FP32 Baseline |
|---|---|---|
| CUDA (this work) | FP32 | 0.000 (reference) |
| TensorRT | INT8 | < 0.002 |
| Voyager | INT8 | < 0.004 |
The slightly higher MAE for D-IMC accelerator reflects the additional quantization of intermediate accumulations within the D-IMC crossbar. For GEMM-only operations, both INT8 backends remain within acceptable bounds for downstream neural network layers.
6.2 Jetson Orin Nano (TensorRT — INT8)
| MATRIX_SIZE | Compute Latency (ms) | Throughput (GOPs) | H2D Latency (ms) | D2H Latency (ms) | Byte Access (MB) | Arithmetic Intensity (OPs/B) |
|---|---|---|---|---|---|---|
| 512×512 | 0.167 | 1,602.72 | — | — | 0.75 | 170.67 |
| 1024×1024 | 0.531 | 4,048.02 | — | — | 3.0 | 341.33 |
| 2048×2048 | 1.606 | 10,697.88 | — | — | 12.0 | 682.67 |
Throughput Calculation: Throughput (GOPs) . For example, at : .
Interpretation: The throughput scales super-linearly from to (1602 → 10698 GOPs, a 6.7× increase). This is the expected behavior as TensorRT's Tensor Core utilization improves with matrix size: larger increases tile occupancy, reduces wave-quantization waste, and enables TensorRT's tactic selector to choose deeper-pipelined kernels. The hardware is fully Tensor Core-driven (INT8_IMMA instructions), explaining the order-of-magnitude improvement over the naive CUDA FP32 baseline.
H2D / D2H Latency: Not reported. On the Orin Nano, the GPU and CPU share physical LPDDR5 memory. When using Pinned Memory (cudaMallocHost) — as TensorRT does internally — data is accessible to the GPU via DMA without a copy over a discrete bus. The latency column reflects purely kernel execution time plus internal cache management overhead.
Why 4096×4096 was omitted: TensorRT's builder performs an exhaustive tactic search over candidate GEMM kernels for each layer shape. At , this search space grows combinatorially. On the 8GB Orin Nano, the builder allocates a workspace (commonly 4–8 GB for large INT8 shapes) that, combined with the tensor buffers themselves ( for INT8 I/O), regularly triggers the OS Out-of-Memory (OOM) killer before the engine is serialized. The 4096×4096 configuration is therefore not feasible on the 8GB Orin Nano with TensorRT INT8 without reducing the builder workspace or applying layer-level memory constraints.
How to run:
import tensorrt as trt
import numpy as np
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
def build_engine(onnx_path, calibrator=None):
builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1 GB
config.set_flag(trt.BuilderFlag.INT8)
if calibrator:
config.int8_calibrator = calibrator
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, TRT_LOGGER)
with open(onnx_path, "rb") as f:
parser.parse(f.read())
return builder.build_serialized_network(network, config)
# Benchmark: average over 1000 iterations after 100 warm-up
import pycuda.driver as cuda
import pycuda.autoinit
import time
def benchmark(engine_bytes, N, iterations=1000, warmup=100):
runtime = trt.Runtime(TRT_LOGGER)
engine = runtime.deserialize_cuda_engine(engine_bytes)
context = engine.create_execution_context()
# ... allocate buffers, run warmup, time iterations
6.3 Odin v0 (Voyager SDK — INT8)
| MATRIX_SIZE | Compute Latency (ms) | Throughput (GOPs/Core) | H2D Latency (ms) | D2H Latency (ms) | Byte Access (MB) | Arithmetic Intensity (OPs/B) |
|---|---|---|---|---|---|---|
| 512×512 | 0.695 | 386.14 | 0.120 | 0.120 | 0.75 | 170.67 |
| 1024×1024 | 2.020 | 1,062.80 | 1.03 | 1.03 | 3.0 | 341.33 |
Reported values are per-core throughput. The Odin v0 has 4 AI cores; total system throughput is approximately the per-core figure. For : total .
Interpretation of Latency: Odin v0 latency is the sum of compute time and the PCIe transaction overhead required to stream matrix into the device and return . Unlike the Jetson's unified memory, Odin v0 must physically move data across the PCIe bus on every inference call. For , the time spent moving data is nearly equal to the compute time, suppressing effective throughput relative to the theoretical 214 TOPS peak.
PCIe Latency Calculation (Expected vs. Observed):
For a single INT8 matrix of size , the transfer size is bytes. With a practical PCIe 3.0 x4 bandwidth of ~3.2 GB/s:
| MATRIX_SIZE | Transfer Size (MB) | Expected H2D (ms) | Observed H2D (ms) | Overhead (ms) |
|---|---|---|---|---|
| 512×512 | 0.25 | 0.078 | 0.120 | 0.042 |
| 1024×1024 | 1.00 | 0.313 | 1.030 | 0.717 |
The growing gap between expected and observed latency at reflects PCIe protocol framing overhead and software stack latency in the Voyager SDK runtime (descriptor setup, DMA chaining). Small transfers are disproportionately penalized by fixed per-transaction overhead.
Why 2048×2048 was omitted: Each D-IMC AI core has 4 MB of local L1 SRAM. A single INT8 weight matrix occupies exactly 4 MB. When input buffers and output staging are accounted for, the total on-chip footprint exceeds the L1 capacity, forcing the Voyager compiler to fragment the operation across multiple PCIe transactions. This produces severe PCIe thrashing — repeated stalls where the crossbar sits idle while the next data fragment is fetched — collapsing effective throughput well below the result. The 32 MB L2 SRAM is shared across all cores and does not fully mitigate this for a single-core GEMM operation.
How to run:
from axelera.runtime import InferenceSession
import numpy as np
session = InferenceSession("gemm_n1024.axm") # Compiled model
A = np.random.randint(-128, 127, (1024, 1024), dtype=np.int8)
# Warm-up
for _ in range(100):
_ = session.run({"input_A": A})
# Benchmark
import time
times = []
for _ in range(1000):
t0 = time.perf_counter()
C = session.run({"input_A": A})
t1 = time.perf_counter()
times.append((t1 - t0) * 1e3)
print(f"Mean latency: {np.mean(times):.3f} ms")
print(f"Throughput: {2 * 1024**3 / (np.mean(times) * 1e-3) / 1e9:.2f} GOPs")
7. Conclusion
7.1 Comparative Summary
| Backend | N=512 Latency (ms) | N=512 Throughput (GOPs) | N=1024 Latency (ms) | N=1024 Throughput (GOPs) |
|---|---|---|---|---|
| CUDA FP32 (naive) | 1.319 | 203.5 | 11.080 | 193.8 |
| TensorRT INT8 | 0.167 | 1,602.7 | 0.531 | 4,048.0 |
| Odin v0 (total) | 0.695 + 0.24 xfer | ~1,544.6 | 2.020 + 2.06 xfer | ~4,251.2 |
At equivalent matrix sizes, TensorRT and Odin v0 deliver comparable total-system throughput. The distinction lies in the latency breakdown: TensorRT's latency is almost entirely compute time, while Odin v0's latency is dominated by PCIe data movement. For small inputs, TensorRT has a clear latency advantage. As grows and the compute-to-transfer ratio improves for Odin v0, the two systems converge.
7.2 Benefits of Odin v0 for Space Applications
- Deterministic Latency: D-IMC avoids the jitter caused by GPU warp scheduling and memory bank contention. For applications such as satellite pose estimation or real-time attitude control, consistent timing is mission-critical. Odin v0's spatial architecture delivers cycle-accurate, deterministic execution per inference.
- Power Efficiency: MAC operations are performed in-memory, eliminating the power-hungry data movement between SRAM and ALUs that dominates GPU energy consumption. This is critical for power-budgeted satellite platforms.
- Radiation Hardening Potential: Traditional GPU register files are highly susceptible to Single Event Upsets (SEUs) from ionizing radiation in LEO/MEO orbits. Localized SRAM architectures with small, well-bounded crossbar cells are inherently more amenable to cell-level ECC and Triple Modular Redundancy (TMR) implementation compared to the distributed register files of a 1024-core GPU.
- TVAC Thermal Profile: The lower peak power draw of Odin v0 ( 5W in typical inference) simplifies the thermal mass and radiator sizing requirements for vacuum-environment chassis, where convective cooling is unavailable.
7.3 Limitations and Bottlenecks
Accuracy:
- Both TensorRT and Voyager introduce quantization error via INT8 PTQ. For GEMM-only operations, the MAE is negligible (see Section 6.1). However, in stacked networks, quantization error can accumulate across layers. Layer-wise sensitivity analysis is recommended before deploying deep networks on Odin v0. D-IMC-Specific Constraints:
- 4D Tensor Requirement: D-IMC accelerator is optimized for 4D NCHW tensors (vision pipelines). 1D and 2D GEMMs are internally padded to 4D, introducing "dummy" operations that reduce effective utilization. This benchmark uses a
(1, 1, N, N)wrapper to satisfy this constraint. - Fixed ONNX Operator Set: Custom CUDA-like kernels cannot be written for Odin v0. You are restricted to ONNX opset 17 operators supported by the Voyager compiler. Operations such as sparse attention, custom activation functions, or non-standard pooling require operator decomposition into supported primitives before compilation.
- SRAM Capacity Cliff: As demonstrated by the 2048×2048 exclusion, any layer whose weight matrix exceeds ~3.5 MB (leaving headroom for I/O buffers) will trigger fragmentation and severe throughput degradation. Model architectures must be designed with this constraint in mind — for example, by using depthwise-separable layers or limiting fully-connected layer widths.
- PCIe Bandwidth as Hard Ceiling: For inference pipelines that chain multiple GEMM operations, each layer's output must transit the PCIe bus before the next layer's input arrives. This serializes the pipeline in a way that a Jetson's unified memory model does not, making Odin v0 less competitive for recurrent or iterative architectures with small per-layer tensor sizes.
These benchmarks are directly relevant to the selection of edge accelerators for space-grade DPUs:
- Sizing guidance: The SRAM capacity cliff at defines the maximum fully-connected layer width deployable on a single D-IMC core without performance degradation. Network architectures for satellite onboard inference should keep FC layer widths below this threshold.
- Throughput vs. latency trade-off: For high-cadence sensor fusion tasks (e.g., 100 Hz IMU integration), TensorRT on Jetson offers lower per-inference latency. For batch image classification (e.g., downlink prioritization), Odin v0's higher sustained throughput at larger is advantageous.
- Power envelope: At 15W MAXN, the Orin Nano fits within typical small satellite power budgets. The Odin v0 adds minimal incremental power while substantially increasing INT8 throughput, making the combined system attractive for platforms with tight SWaP-C constraints.
Appendix: Hardware Specifications
| Feature | NVIDIA Jetson Orin Nano 8GB | Odin v0 |
|---|---|---|
| Compute Core | Ampere GPU (1024-core, 8 SM) | 4× AI Cores (D-IMC) |
| Memory Architecture | Unified LPDDR5 (68 GB/s) | Local SRAM / PCIe Gen3 |
| On-chip Storage | — | 4 MB L1, 32 MB L2 / core |
| External Interface | — | PCIe 3.0 x4 (~4 GB/s) |
| Configured Power | 25W (MAXN Super, nvpmodel -m 2) | High-Performance Mode |
| Target Precision | FP16 / INT8 (Tensor Cores) | INT8 (In-Memory) |
| Peak Throughput | 67 TOPS (INT8 Sparse) / 33.5 TOPS (INT8 Dense) / 17 TFLOPs (FP16) | 214 TOPS (INT8) |