Skip to main content

GEMM Raw Throughput Analysis

Performance Core

Matrix multiplication is the heartbeat of AI. Odin v0's Digital In-Memory Computing (D-IMC) architecture keeps model weights stationary in SRAM — eliminating memory bus toggles and delivering 3.98× greater compute efficiency at N=1024 versus TensorRT INT8 on Jetson.

Executive Summary

MetricOdin v0 (D-IMC)NVIDIA Jetson (TRT INT8)Delta
Peak Throughput (rated)214 TOPS67 TOPS3.2× rated INT8 capacity
Efficiency at N=10243,390 GOPS/W852 GOPS/W3.98× more efficient
Efficiency at N=5121,501 GOPS/W504 GOPS/W2.98× more efficient
Execution ModelSpatial (In-Memory)Temporal (DRAM-Bound)Reduced data movement

Efficiency = net active power (device inference − device idle). PM0 15W power mode.


Efficiency vs Matrix Size

Odin v0 efficiency advantage grows with N. Crossover with TRT INT8 occurs at N ≥ 512:

Matrix SizeOdin GOPS/WJetson GOPS/WDelta
64×649.412.381.31× slower
128×128302.8290.11.04× faster
256×256442.7372.91.18× faster
512×5121,501.2503.72.98× faster
1024×10243,390.3851.63.98× faster

TOPS vs Matrix Size

All Backends at PM0 15 W

All Backends — TOPS vs N at PM0 15 W
Single power-mode comparison across all compute paths
At PM0 15 W: TRT INT8, TRT FP16, and Axelera are comparable for N≤256 — all overhead-dominated. Divergence starts at N=512. At N=1024: TRT INT8 (5.91) > TRT FP16 (3.07) > Axelera (2.85) > cuBLAS FP32 (0.76). TRT FP16 underperforming INT8 at 15 W is a power-headroom issue, not a precision trade-off.

TRT INT8 — All Power Modes

TensorRT INT8 — TOPS vs N
All Jetson power modes · INT8 precision
Best GPU throughput path. Clear stratification at N≥512: OC (9.54) > 25 W (8.57) > MAXN (8.35) > 15 W (5.91) > 7 W (2.22). 25 W and MAXN are nearly indistinguishable — no throughput gain past 25 W for this workload.

Axelera D-IMC Dynamic — All Power Modes

Axelera Metis Dynamic — TOPS vs N
Valid Jetson power modes · INT8, 4 cores parallel · PM3 7 W excluded (ORT crash)
PM2_OC anomaly: Axelera drops ~22% under overclocking (2.22 vs 2.86 TOPS at N=1024) — likely PCIe contention from OC'd CPU/GPU. PM0 / PM1 / PM2 are flat within 2%; Metis throughput is agnostic to Jetson host clock within normal operating range.
cuBLAS FP32 and TRT FP16 TOPS curves
cuBLAS FP32 — TOPS vs N
All Jetson power modes · FP32 precision
PM0 / PM1 / PM2 are virtually identical — GPU clock gating is equivalent for this workload. OC unlocks +65% at N=1024 (1.25 vs 0.76 TOPS); cuBLAS FP32 is memory-bandwidth-bound and responds directly to clock uplift. PM3 7 W throttles hard: 0.25 TOPS at N=1024.
TensorRT FP16 — TOPS vs N
All Jetson power modes · FP16 precision (TFLOPS)
PM0 15 W severely underperforms at N=1024 (3.07 TOPS, 5.6 ms) vs PM1 25 W (4.49 TOPS, 3.83 ms) — tensor-core throughput is starved at 15 W TDP. PM1 / PM2 / OC plateau at 4.1–4.5 TOPS. Note: TRT engines compiled on a different device; numbers may be conservative.

Efficiency (TOPS/W) vs Matrix Size

All Backends at PM0 15 W

All Backends — TOPS/W vs N at PM0 15 W
Single power-mode efficiency comparison
Axelera is 4× more efficient than TRT INT8 at N=1024 (3.39 vs 0.851 TOPS/W). At N=256 the gap closes (~0.37 vs 0.37 TOPS/W) — PCIe overhead erases the Axelera advantage at mid-range N. Axelera's efficiency advantage is only realised at N≥512.

TRT INT8 Efficiency

TensorRT INT8 — TOPS/W vs N
Efficiency across all Jetson power modes
PM0 15 W is the most efficient Jetson mode at N=1024 (0.851 TOPS/W) — not MAXN (0.565). The mid-range dip at N=256–512 reflects the transition from overhead-dominated to compute-dominated regime. At N=1024, 7 W (0.688) is competitive with 25 W (0.652) if latency budget allows.

Axelera D-IMC Efficiency

Axelera Metis Dynamic — TOPS/W vs N
Valid Jetson power modes · jtop net-power denominator
Axelera efficiency scales sharply with N: 3.39 TOPS/W at N=1024 (PM0). At N≤128, Axelera is worse than GPU due to PCIe overhead dominating small-matrix transfers. Efficiency crossover with TRT INT8 occurs around N=512. OC mode degrades to 2.15 TOPS/W at N=1024 — PCIe contention penalty.
cuBLAS FP32 and TRT FP16 efficiency curves
cuBLAS FP32 — TOPS/W vs N
Efficiency across all Jetson power modes
Efficiency improves monotonically with N — larger matrices amortise H2D/D2H overhead. OC shows a non-monotonic profile: strong at small N (lower latency), but comparable to other modes at N=1024 because power draw scales proportionally.
TensorRT FP16 — TOPS/W vs N
Efficiency across all Jetson power modes
FP16 efficiency is consistently below INT8 at large N despite comparable TFLOPS numbers. Best FP16 efficiency: PM0 at N=1024 (0.420 TOPS/W) — the restricted 15 W budget forces better utilisation. OC has strong efficiency at small N due to lower latency but does not extend to N=1024.

Operating Point Analysis — N=1024

Each point = one backend × power-mode configuration. All values measured at N=1024.

Latency vs Power

Latency vs Inference Power — N=1024
On-chip compute latency (Axelera) or GPU kernel latency · jtop inference power · ideal = bottom-left
Ideal region: bottom-left. Axelera cluster sits at 5–7.4 W / 3.0–3.9 ms. TRT INT8 achieves lower latency (1.8–2.9 ms) at 12–21.5 W. Axelera OC is the outlier (7.4 W / 3.9 ms) — PCIe contention inflating both axes. No single backend dominates both axes simultaneously.

TOPS vs Power (Pareto View)

TOPS vs Inference Power — N=1024 (Pareto view)
Ideal = top-left · Pareto-optimal configs = upper-left boundary
Pareto-optimal configs: Axelera PM0/PM1/PM2 (~2.85 TOPS, ~5.1–5.4 W) · TRT INT8 PM0 (5.91 TOPS, 12.2 W) · TRT INT8 PM1 (8.57 TOPS, 18.5 W) · TRT INT8 OC (9.54 TOPS, 21.5 W). MAXN TRT INT8 (20.8 W, 8.35 TOPS) is dominated by 25 W (18.5 W, 8.57 TOPS).

Efficiency vs Throughput Tradeoff

TOPS/W vs TOPS — N=1024 (Efficiency–Throughput Tradeoff)
Ideal = top-right · no backend achieves both simultaneously
No backend dominates both axes. Axelera cluster: 2.8–2.9 TOPS / 2.1–3.4 TOPS/W. TRT INT8 cluster: 2.2–9.5 TOPS / 0.56–0.85 TOPS/W. TRT FP16 is dominated by TRT INT8 at every operating point. Axelera OC is the weakest Axelera result — drops to 2.2 TOPS and 2.15 TOPS/W, below even TRT INT8 7 W efficiency.

Temperature vs Power

Temperature vs Inference Power — N=1024
GPU die temp (Jetson backends) · Metis die temp (Axelera backends) · temperatures are from different sensors
GPU temperature tracks power tightly. TRT FP16/INT8 at MAXN/25 W hit 63–68 °C GPU die. Axelera keeps the Jetson cool at 49–55 °C GPU die (not plotted) because compute offloads to Metis. Metis die stays at 44–46 °C across all host modes — well within thermal limits. Sensor types differ: Jetson GPU die vs Axelera Metis die.

Experimental Data

NVIDIA Jetson — TRT INT8, PM0 15 W

SizeLatency (ms)Throughput (TOPS)TOPS/W
512×5120.4782.219
1024×10242.9075.9100.851

Odin v0 — INT8, PM0 15 W (4 cores parallel)

SizeOn-chip (ms)PCIe (ms)Total (ms)TOPS/W
512×5121.501
1024×10243.0160.4183.4343.394
note

SRAM Capacity Limit: Odin v0 provides 4 MB of L1 SRAM per core. A 1024×1024 INT8 matrix requires ~3 MB, fitting within budget. A 2048×2048 matrix exceeds L1 capacity and must tile across multiple passes, breaking the single-shot spatial execution model.


Conclusion

  1. Efficiency at scale: At N=1024, Odin delivers 3,390 GOPS/W vs Jetson TRT INT8 852 GOPS/W — 3.98× advantage. Advantage grows with N; crossover occurs at N≥512.
  2. Power floor: Odin system power at N=1024 is 5.3 W total vs Jetson 12.2 W — directly reducing thermal radiator requirements for vacuum-environment chassis.
  3. Radiation tolerance: SRAM-bounded D-IMC fabric presents a smaller SEU-vulnerable surface than a 1024-core GPU, with per-layer weights fully on-chip after warm-up.