GEMM Raw Throughput Analysis
Matrix multiplication is the heartbeat of AI. Odin v0's Digital In-Memory Computing (D-IMC) architecture keeps model weights stationary in SRAM — eliminating memory bus toggles and delivering 3.98× greater compute efficiency at N=1024 versus TensorRT INT8 on Jetson.
Executive Summary
| Metric | Odin v0 (D-IMC) | NVIDIA Jetson (TRT INT8) | Delta |
|---|
| Peak Throughput (rated) | 214 TOPS | 67 TOPS | 3.2× rated INT8 capacity |
| Efficiency at N=1024 | 3,390 GOPS/W | 852 GOPS/W | 3.98× more efficient |
| Efficiency at N=512 | 1,501 GOPS/W | 504 GOPS/W | 2.98× more efficient |
| Execution Model | Spatial (In-Memory) | Temporal (DRAM-Bound) | Reduced data movement |
Efficiency = net active power (device inference − device idle). PM0 15W power mode.
Efficiency vs Matrix Size
Odin v0 efficiency advantage grows with N. Crossover with TRT INT8 occurs at N ≥ 512:
| Matrix Size | Odin GOPS/W | Jetson GOPS/W | Delta |
|---|
| 64×64 | 9.4 | 12.38 | 1.31× slower |
| 128×128 | 302.8 | 290.1 | 1.04× faster |
| 256×256 | 442.7 | 372.9 | 1.18× faster |
| 512×512 | 1,501.2 | 503.7 | 2.98× faster |
| 1024×1024 | 3,390.3 | 851.6 | 3.98× faster |
TOPS vs Matrix Size
All Backends at PM0 15 W
All Backends — TOPS vs N at PM0 15 W
Single power-mode comparison across all compute paths
At PM0 15 W: TRT INT8, TRT FP16, and Axelera are comparable for N≤256 — all overhead-dominated. Divergence starts at N=512. At N=1024: TRT INT8 (5.91) > TRT FP16 (3.07) > Axelera (2.85) > cuBLAS FP32 (0.76). TRT FP16 underperforming INT8 at 15 W is a power-headroom issue, not a precision trade-off.
TRT INT8 — All Power Modes
TensorRT INT8 — TOPS vs N
All Jetson power modes · INT8 precision
Best GPU throughput path. Clear stratification at N≥512: OC (9.54) > 25 W (8.57) > MAXN (8.35) > 15 W (5.91) > 7 W (2.22). 25 W and MAXN are nearly indistinguishable — no throughput gain past 25 W for this workload.
Axelera D-IMC Dynamic — All Power Modes
Axelera Metis Dynamic — TOPS vs N
Valid Jetson power modes · INT8, 4 cores parallel · PM3 7 W excluded (ORT crash)
PM2_OC anomaly: Axelera drops ~22% under overclocking (2.22 vs 2.86 TOPS at N=1024) — likely PCIe contention from OC'd CPU/GPU. PM0 / PM1 / PM2 are flat within 2%; Metis throughput is agnostic to Jetson host clock within normal operating range.
cuBLAS FP32 and TRT FP16 TOPS curves
cuBLAS FP32 — TOPS vs N
All Jetson power modes · FP32 precision
PM0 / PM1 / PM2 are virtually identical — GPU clock gating is equivalent for this workload. OC unlocks +65% at N=1024 (1.25 vs 0.76 TOPS); cuBLAS FP32 is memory-bandwidth-bound and responds directly to clock uplift. PM3 7 W throttles hard: 0.25 TOPS at N=1024.
TensorRT FP16 — TOPS vs N
All Jetson power modes · FP16 precision (TFLOPS)
PM0 15 W severely underperforms at N=1024 (3.07 TOPS, 5.6 ms) vs PM1 25 W (4.49 TOPS, 3.83 ms) — tensor-core throughput is starved at 15 W TDP. PM1 / PM2 / OC plateau at 4.1–4.5 TOPS. Note: TRT engines compiled on a different device; numbers may be conservative.
Efficiency (TOPS/W) vs Matrix Size
All Backends at PM0 15 W
All Backends — TOPS/W vs N at PM0 15 W
Single power-mode efficiency comparison
Axelera is 4× more efficient than TRT INT8 at N=1024 (3.39 vs 0.851 TOPS/W). At N=256 the gap closes (~0.37 vs 0.37 TOPS/W) — PCIe overhead erases the Axelera advantage at mid-range N. Axelera's efficiency advantage is only realised at N≥512.
TRT INT8 Efficiency
TensorRT INT8 — TOPS/W vs N
Efficiency across all Jetson power modes
PM0 15 W is the most efficient Jetson mode at N=1024 (0.851 TOPS/W) — not MAXN (0.565). The mid-range dip at N=256–512 reflects the transition from overhead-dominated to compute-dominated regime. At N=1024, 7 W (0.688) is competitive with 25 W (0.652) if latency budget allows.
Axelera D-IMC Efficiency
Axelera Metis Dynamic — TOPS/W vs N
Valid Jetson power modes · jtop net-power denominator
Axelera efficiency scales sharply with N: 3.39 TOPS/W at N=1024 (PM0). At N≤128, Axelera is worse than GPU due to PCIe overhead dominating small-matrix transfers. Efficiency crossover with TRT INT8 occurs around N=512. OC mode degrades to 2.15 TOPS/W at N=1024 — PCIe contention penalty.
cuBLAS FP32 and TRT FP16 efficiency curves
cuBLAS FP32 — TOPS/W vs N
Efficiency across all Jetson power modes
Efficiency improves monotonically with N — larger matrices amortise H2D/D2H overhead. OC shows a non-monotonic profile: strong at small N (lower latency), but comparable to other modes at N=1024 because power draw scales proportionally.
TensorRT FP16 — TOPS/W vs N
Efficiency across all Jetson power modes
FP16 efficiency is consistently below INT8 at large N despite comparable TFLOPS numbers. Best FP16 efficiency: PM0 at N=1024 (0.420 TOPS/W) — the restricted 15 W budget forces better utilisation. OC has strong efficiency at small N due to lower latency but does not extend to N=1024.
Operating Point Analysis — N=1024
Each point = one backend × power-mode configuration. All values measured at N=1024.
Latency vs Power
Latency vs Inference Power — N=1024
On-chip compute latency (Axelera) or GPU kernel latency · jtop inference power · ideal = bottom-left
Ideal region: bottom-left. Axelera cluster sits at 5–7.4 W / 3.0–3.9 ms. TRT INT8 achieves lower latency (1.8–2.9 ms) at 12–21.5 W. Axelera OC is the outlier (7.4 W / 3.9 ms) — PCIe contention inflating both axes. No single backend dominates both axes simultaneously.
TOPS vs Power (Pareto View)
TOPS vs Inference Power — N=1024 (Pareto view)
Ideal = top-left · Pareto-optimal configs = upper-left boundary
Pareto-optimal configs: Axelera PM0/PM1/PM2 (~2.85 TOPS, ~5.1–5.4 W) · TRT INT8 PM0 (5.91 TOPS, 12.2 W) · TRT INT8 PM1 (8.57 TOPS, 18.5 W) · TRT INT8 OC (9.54 TOPS, 21.5 W). MAXN TRT INT8 (20.8 W, 8.35 TOPS) is dominated by 25 W (18.5 W, 8.57 TOPS).
Efficiency vs Throughput Tradeoff
TOPS/W vs TOPS — N=1024 (Efficiency–Throughput Tradeoff)
Ideal = top-right · no backend achieves both simultaneously
No backend dominates both axes. Axelera cluster: 2.8–2.9 TOPS / 2.1–3.4 TOPS/W. TRT INT8 cluster: 2.2–9.5 TOPS / 0.56–0.85 TOPS/W. TRT FP16 is dominated by TRT INT8 at every operating point. Axelera OC is the weakest Axelera result — drops to 2.2 TOPS and 2.15 TOPS/W, below even TRT INT8 7 W efficiency.
Temperature vs Power
Temperature vs Inference Power — N=1024
GPU die temp (Jetson backends) · Metis die temp (Axelera backends) · temperatures are from different sensors
GPU temperature tracks power tightly. TRT FP16/INT8 at MAXN/25 W hit 63–68 °C GPU die. Axelera keeps the Jetson cool at 49–55 °C GPU die (not plotted) because compute offloads to Metis. Metis die stays at 44–46 °C across all host modes — well within thermal limits. Sensor types differ: Jetson GPU die vs Axelera Metis die.
Experimental Data
NVIDIA Jetson — TRT INT8, PM0 15 W
| Size | Latency (ms) | Throughput (TOPS) | TOPS/W |
|---|
| 512×512 | 0.478 | 2.219 | — |
| 1024×1024 | 2.907 | 5.910 | 0.851 |
Odin v0 — INT8, PM0 15 W (4 cores parallel)
| Size | On-chip (ms) | PCIe (ms) | Total (ms) | TOPS/W |
|---|
| 512×512 | — | — | — | 1.501 |
| 1024×1024 | 3.016 | 0.418 | 3.434 | 3.394 |
SRAM Capacity Limit: Odin v0 provides 4 MB of L1 SRAM per core. A 1024×1024 INT8 matrix requires ~3 MB, fitting within budget. A 2048×2048 matrix exceeds L1 capacity and must tile across multiple passes, breaking the single-shot spatial execution model.
Conclusion
- Efficiency at scale: At N=1024, Odin delivers 3,390 GOPS/W vs Jetson TRT INT8 852 GOPS/W — 3.98× advantage. Advantage grows with N; crossover occurs at N≥512.
- Power floor: Odin system power at N=1024 is 5.3 W total vs Jetson 12.2 W — directly reducing thermal radiator requirements for vacuum-environment chassis.
- Radiation tolerance: SRAM-bounded D-IMC fabric presents a smaller SEU-vulnerable surface than a 1024-core GPU, with per-layer weights fully on-chip after warm-up.