Sentinel-2 Multi-spectral Cloud Detection Benchmark

This test evaluates the system's capability to perform pixel-level cloud and cloud shadow detection on full Sentinel-2 Level-2A (L2A) multispectral (MS) scenes using a hybrid hardware pipeline. Onboard cloud masking is a critical preprocessing step for autonomous Earth Observation (EO) missions — without it, downstream analysis algorithms operate on corrupted imagery, degrading change detection, vegetation monitoring, and flood mapping pipelines. We analyze a tiled inference pipeline in which the DTACSNet-CD model — a U-Net with a MobileNetV2 encoder, compiled to Odin v0 INT8 — runs entirely on the D-IMC accelerator, with the Jetson GPU handling high-throughput band loading, tile extraction, and INT8 quantization as a hardware-accelerated preprocessing stage.

The central engineering challenge is achieving high-throughput, low-power tile processing for a 10,980 × 10,980 pixel full Sentinel-2 scene (~120.6 megapixels, 2,809 tiles) within the SWaP-C envelope of a small satellite platform, while maintaining cloud mask quality sufficient for downstream onboard applications.

2. Background: Sentinel-2 Cloud Detection

2.1 Problem Foundation

Given a multispectral image tile $\mathbf{I} \in \mathbb{R}^{H \times W \times C}$ captured by the Sentinel-2 MSI instrument, where $C = 4$ bands at 10 m spatial resolution, the goal is to assign each pixel $(i, j)$ a semantic label $y_{i,j} \in \{0, 1, 2, 3\}$ :

Class	Label	Description
0	Clear	No cloud contamination; surface reflectance recoverable
1	Thick Cloud	Opaque; surface reflectance fully blocked
2	Thin Cloud	Semi-transparent; alters spectral signal but background visible
3	Cloud Shadow	Dark pixels cast by overlying cloud

For the deployment pipeline, classes 1–3 are collapsed into a binary contaminated / non-contaminated label, enabling downstream cloud-fill reconstruction and tile prioritization logic.

The four 10 m-resolution bands — B02 (490 nm, Blue), B03 (560 nm, Green), B04 (665 nm, Red), and B08 (842 nm, NIR) — are selected because they are co-registered at native 10 m GSD, eliminating resampling artifacts at cloud boundaries. All four bands are normalized to surface reflectance units by dividing raw integer DN by 10,000.

2.2 Loss Function

DTACSNet-CD is trained with weighted cross-entropy loss over the four-class output:

\mathcal{L} = -\sum_{i,j} \sum_{k=0}^{3} w_k \cdot y_{i,j,k} \log \hat{p}_{i,j,k}

where $y_{i,j,k}$ is the one-hot ground-truth label at pixel $(i,j)$ for class $k$ , $\hat{p}_{i,j,k}$ is the predicted class probability, and $w_k$ are class weights compensating for the naturally imbalanced cloud vs. clear pixel distribution in the CloudSEN12 training set.

Training uses the Adam optimizer with learning rate $10^{-3}$ and batch size 32. The learning rate is reduced by $0.1\times$ if validation cross-entropy does not improve for 4 consecutive epochs; training stops if no improvement for 10 epochs.

2.3 Dataset (CloudSEN12)

The model is trained and evaluated on CloudSEN12 (Aybar et al., 2022), a global Sentinel-2 cloud semantic segmentation dataset.

Scene configuration:

Coverage: 49,400 image patches (IPs) of 509 × 509 pixels, uniformly distributed across all continents except Antarctica
Input: Sentinel-2 L1C Top-of-Atmosphere reflectance (all 13 spectral bands); the deployment model uses only the 4 × 10 m bands
Labels: Pixel-level annotation into 4 classes (clear, thick cloud, thin cloud, cloud shadow)
High-quality subset: 10,000 IPs with pixel-level human annotation — used for DTACSNet-CD training and test evaluation

Dataset split:

Split	Strategy	Purpose
Train	Spatially stratified block split	Supervised training
Validation	Spatially stratified block split	Early stopping / LR scheduling
Test	Spatially stratified block split	Held-out evaluation

A spatially stratified block split (blockCV) prevents spatial autocorrelation leakage between splits. INT8 PTQ calibration uses a subset of training images distinct from the test split used for accuracy reporting.

2.4 Evaluation Metrics

Cloud detection quality is assessed using the F2-score — a recall-weighted variant of the F-score. Missed clouds (false negatives) are more costly than false alarms for downstream onboard processing, motivating $\beta = 2$ :

F_2 = \frac{5 \cdot \text{Precision} \cdot \text{Recall}}{4 \cdot \text{Precision} + \text{Recall}}

Precision and Recall are computed over the binary contaminated (classes 1–3) vs. non-contaminated (class 0) labels.

For quantization accuracy comparison, pixel-level agreement against the PyTorch FP32 CPU baseline is additionally reported:

Metric	Formula	Unit	Notes
F2-Score	$(1+\beta^2) \cdot \text{Prec} \cdot \text{Rec} / (\beta^2 \cdot \text{Prec} + \text{Rec})$ , $\beta=2$	—	vs. CloudSEN12 ground truth
Pixel Agreement	$\frac{1}{HW}\sum_{i,j}\mathbb{1}[\hat{y}_{i,j}^\text{ref} = \hat{y}_{i,j}]$	%	vs. FP32 CPU baseline

Using the FP32 CPU output as reference isolates the quantization and precision degradation introduced by each accelerated backend independently of dataset label noise.

3. Architecture: DTACSNet-CD

3.1 Backbone

DTACSNet-CD is a U-Net with a MobileNetV2 encoder (Aybar et al., 2024, IEEE JSTARS), fine-tuned on CloudSEN12 for pixel-level cloud semantic segmentation. MobileNetV2's depthwise separable convolutions provide a strong accuracy/efficiency trade-off, achieving the highest F2-score (0.81) among compared cloud detection methods while remaining deployable on constrained onboard hardware.

Architecture:

MobileNetV2 encoder    [Inverted residual blocks, stride-2 ×5]    → (B, 320, 7, 7) at 1/32
Skip connections       [Feature maps at 1/2, 1/4, 1/8, 1/16 scales]
U-Net decoder          [Bilinear upsample + skip concatenation ×4] → (B, 64, H, W)
Output head            [1×1 Conv → 4 class logits]                 → (B, 4, H, W)

Key architectural parameters:

Property	Value
Input resolution	224 × 224 × 4 (B02, B03, B04, B08)
Encoder	MobileNetV2 (depthwise separable convolutions)
Total parameters	~6.6 M
FLOPs (224×224×4)	~0.62 GFLOPs
Output classes	4 (clear, thick cloud, thin cloud, cloud shadow)
Deployment output	2-class logits (binary: cloud vs. clear)

3.2 Execution Model

The complete DTACSNet-CD model — encoder, decoder, and output head — is compiled to an Odin v0 INT8 model and runs entirely on the D-IMC accelerator. The Jetson GPU handles band loading and tile quantization.

Stage breakdown:

Stage	Hardware	Precision	Description
JP2 band loading	Jetson CPU (multi-threaded)	uint16 → FP16	Parallel JP2 decompression via GDAL/OpenJPEG; 4 bands loaded concurrently
GPU normalization	Jetson GPU	FP16	Divide by 10,000 (reflectance normalization); full scene stored in GPU VRAM (~964 MB)
Tile extraction + quantization	Jetson GPU	FP16 → INT8	GPU slices 224×224 tile, applies quantization LUT: `round(tile × inv_scale + zp_in)`
H2D transfer	PCIe 3.0 x4	—	INT8 quantized tile (224×224×4 ≈ 196 KB) → D-IMC input buffer
Full model inference	Odin v0 D-IMC accelerator	INT8	U-Net encoder-decoder; outputs 224×224×2 INT8 logits
D2H transfer	PCIe 3.0 x4	—	INT8 logits (224×224×2 ≈ 98 KB) → Jetson CPU
Argmax + mask update	Jetson GPU	INT8	`class_1 > class_0` per pixel; OR-fused into full-scene GPU cloud mask tensor

Pipeline execution (sample2_ax_4core_jetson_gpu.py):

A producer-consumer threaded pipeline decouples GPU preprocessing from AIPU inference:

Producer Thread (GPU):    tile_slice → INT8_quantize → push to Queue(maxsize=16)
Inference Thread (AIPU):  pop_tile   → instance.run() → push_mask_to_GPU

The 16-item queue provides a prefetch buffer ensuring the AIPU is never starved for input. This differs from the pose estimation pipeline (Test 2), which uses the D-IMC hardware double-buffer flag — here, software-level pipelining via Python's queue.Queue achieves equivalent decoupling.

Core allocation strategy: The AIPU instance uses num_sub_devices=1, aipu_cores=4, mapping all 4 D-IMC cores to a single inference call for spatial data parallelism within each 224×224 tile. This contrasts with the pose estimation pipeline (num_sub_devices=4, aipu_cores=4), which runs 4 independent model instances in parallel across a batch of 4 images. The cloud detection approach uses inter-core tiling of the feature maps within a single tile rather than inter-core batch parallelism.

4. Benchmarking Methodology

4.1 Tiling Strategy

Full Sentinel-2 scenes at 10 m resolution are 10,980 × 10,980 pixels — far larger than any single inference pass on edge hardware. A sliding-window strategy with overlap is applied to match the 224 × 224 input resolution of DTACSNet-CD and reduce boundary artifacts:

Parameter	Value
Tile size	224 × 224
Overlap	14 pixels (TILE_SIZE / 16)
Stride	210 pixels
Tiles per row / column	53
Total tiles per scene	2,809

Edge tiles that extend beyond the image boundary are zero-padded to 224 × 224. The cloud mask is assembled by OR-fusing each tile's binary prediction into the full-scene GPU mask tensor:

cloud_mask_gpu[y1:y2, x1:x2] |= preds_np  # boolean OR-fusion

The OR policy conservatively labels any boundary pixel as cloud if any overlapping tile prediction flags it — zero false-negative risk at boundaries at the cost of occasional 1–2 pixel false-positive cloud borders at tile junctions.

4.2 Measurement Protocol

Step	Count	Purpose
Warm-up	50 tiles	Stabilize D-IMC clock states; populate Jetson instruction/data caches
Steady-state	2,809 tiles	Per-tile latency via `time.perf_counter_ns()` (1 ns resolution)
Thermal soak	10 min	Ensure GPU/AIPU at thermal equilibrium before recording begins

Clock pinning follows the same protocol as Tests 1 and 2 (MAXN Super, 25W):

sudo nvpmodel -m 2
sudo jetson_clocks --store
sudo jetson_clocks

# Verify pinned state
sudo jetson_clocks --show

4.3 Latency Decomposition

Each per-tile measurement is decomposed into non-overlapping stages using time.perf_counter timestamps:

Stage	Boundary	Notes
GPU Preprocessing	Band slice → INT8 tile ready	GPU slice + `round(tile × inv_scale + zp)` + CHW→HWC transpose
AIPU Inference	`instance.run()` entry → return	Full U-Net model on Odin v0
Post-processing	INT8 logits → binary mask update	`class_1 > class_0` argmax + OR into `cloud_mask_gpu`

Wall-clock tile throughput is reported as tiles/s (total tiles ÷ inference elapsed time). JP2 band loading time is reported separately as hardware-independent I/O overhead common to all backends.

4.4 INT8 Quantization and Calibration

The full U-Net model (encoder, decoder, output head) is quantized using Post-Training Quantization (PTQ) within the Voyager compiler:

Calibration split: Images sampled from the CloudSEN12 training split, distinct from the test split used for accuracy reporting.
Percentile clipping: Voyager clips activations at the 99.99th percentile to prevent outlier-driven scale inflation — critical for U-Net skip connections, where early encoder feature maps span wide dynamic ranges when concatenated with decoder activations.
Per-channel weight quantization: Depthwise separable convolution weights are quantized per output channel, reducing quantization error relative to per-tensor schemes.
GPU-side input quantization: The Jetson GPU applies quantization directly to the FP16 tile using the manifest-derived scale and zero-point, avoiding a CPU-side normalization step:

# sample2_ax_4core_jetson_gpu.py — producer thread
inv_scale = 1.0 / s_in   # from manifest quantize_params[0]
q_tile = torch.round(tile * inv_scale + zp_in)
in_buf_np[0, :h, :w, :] = (
    torch.clip(q_tile, -128, 127)
    .to(torch.int8)
    .permute((1, 2, 0))   # CHW → HWC (D-IMC runtime input layout)
    .cpu().numpy()
)

4.5 Environment Setup

Hardware:

NVIDIA Jetson Orin Nano Super 8GB Developer Kit
Odin v0 D-IMC accelerator (PCIe 3.0 x4, M.2 Key M slot)
NVMe SSD for Sentinel-2 JP2 scene file storage

Software Stack:

Component	Version
OS (Jetson)	Ubuntu 22.04 (L4T 36.4)
JetPack	6.2.1
CUDA	12.6
TensorRT	10.3
PyTorch	2.11
Voyager SDK	1.5
ONNX Opset	17
Python	3.10
GDAL	3.6+
rasterio	1.3+

Input data: Sentinel-2 Level-2A scene at 10 m resolution. Full tile: 10,980 × 10,980 pixels, 4 bands (B02, B03, B04, B08), stored as JP2 files. Uncompressed equivalent: ~963 MB (4 × 10980 × 10980 × 2 bytes FP16).

5. Results

5.1 Full-Scene Pipeline Timing

End-to-end timing for a full 10,980 × 10,980 scene (2,809 tiles, 4 JP2 bands), Jetson pinned to MAXN Super (25W):

Stage	Odin v0 INT8 (s)	TensorRT FP16 (s)	Notes
JP2 band loading (4 bands, parallel)	20.87	20.88	Hardware-independent I/O; dominated by OpenJPEG decompression
Model load / engine setup	1.50	0.98	One-time cost
Inference (all tiles)	29.88	9.02	2,809 tiles; batch=1 vs batch=8
Cloud fill (GPU, 4 bands)	0.48	0.29	FP32 mean of clear pixels per band
Visualization (OpenCV, 6-panel)	1.08	1.05	Downsample 6× on GPU before H2D
Output saving (mask + 3 bands)	2.85	3.12	Sequential band pull to avoid ~720 MB peak RAM
Total	57.14	35.73

JP2 decompression accounts for 36.5% of Odin v0 total time (20.87 s / 57.14 s) and 58.4% of TRT total time. The inference gap (29.88 s vs 9.02 s) reflects the Odin v0 processing single tiles at batch=1 (10.20 ms/tile) against TensorRT's batch=8 (3.20 ms/tile). Total pipeline time is 57.14 s (Odin v0) vs 35.73 s (TRT).

5.2 Per-Tile Latency Breakdown

Mean values across 2,809 tiles, Jetson pinned to MAXN Super (25W), Odin v0 at 100% D-IMC utilization:

Phase	Component	Precision	Mean Latency (ms)	% of Total
GPU preprocessing (slice + quantize + CHW→HWC)	Jetson GPU	FP16 → INT8	0.34	22.7%
H2D (INT8 tile → Odin v0)	PCIe 3.0 x4	—	0.06	4.0%
Full Model Inference	Odin v0 D-IMC accelerator	INT8	0.94	62.7%
D2H (INT8 logits → Jetson)	PCIe 3.0 x4	—	0.05	3.3%
Argmax + mask update	Jetson GPU	INT8	0.11	7.3%
Total per tile	Odin v0	INT8	1.50	—

Inference rate: 2,809 tiles / 4.2 s = 669 tiles/s. Effective pixel throughput: 669 × 224² ≈ 33.5 Mpx/s.

PCIe transfer breakdown:

Transfer	Payload	Per-tile (ms)
H2D: INT8 tile (4 bands, 224×224)	~196 KB	0.06
D2H: INT8 logits (2 classes, 224×224)	~98 KB	0.05

The fixed PCIe framing cost dominates both transfers. Together, PCIe transactions account for 7.3% of per-tile time — negligible for the current scene size, but relevant if future multi-band models (e.g., 13-band input) increase the H2D payload ~3×.

5.3 Jitter and Determinism

Latency distribution across 2,809 tiles:

Statistic	AIPU (ms)	Per-tile total (ms)
Mean (μ)	0.94	1.50
Median (P50)	0.89	1.43
95th Percentile (P95)	1.38	2.01
99th Percentile (P99)	1.72	2.38
Standard Deviation (σ)	0.21	0.24
Max observed	2.04	2.71

The 0.21 ms standard deviation on AIPU inference reflects D-IMC spatial execution determinism. For comparison, the TensorRT FP16 baseline exhibits σ = 1.87 ms per batch under sustained tiled inference — an 8.9× increase in jitter attributable to GPU warp scheduling variability and L2 cache contention from the U-Net decoder's skip connection memory access pattern.

5.4 Cloud Detection Accuracy

Accuracy is evaluated on a held-out Sentinel-2 test scene, comparing binary cloud mask outputs against the PyTorch FP32 CPU baseline (cloud_detection.py):

Backend	Precision	Pixel Agreement (%)	F2-Score	Notes
TensorRT FP16	FP16	99.3	0.81	Negligible degradation vs FP32 baseline
Odin v0 (INT8)	INT8	97.8	0.79	Minor boundary errors from INT8 depthwise conv quantization

The Odin v0 INT8 backend achieves 97.8% pixel agreement against the FP32 baseline. The 2.2% disagreement concentrates at cloud boundaries, where INT8 rounding of MobileNetV2's depthwise separable convolution activations shifts the decision boundary by 1–2 pixels. The F2-score degradation (0.81 → 0.79) is within acceptable margins for onboard scene prioritization and cloud-fill applications.

5.5 Baseline Comparison: TensorRT FP16 vs. Odin v0

To quantify the trade-offs, a TensorRT FP16 baseline was measured using a batch=8 engine:

Configuration	Inference Time (s)	Tile Rate (tiles/s)	Jitter σ (ms)	Power Draw (W)
TensorRT FP16 (batch=8)	2.6	1,081	1.87	14.4
Odin v0 INT8 (batch=1, 4 cores)	4.2	669	0.21	10.1

Key observations:

TensorRT FP16 achieves higher tile throughput (1,081 vs 669 tiles/s) due to batch=8 GPU utilization. The Odin v0 currently processes single tiles with num_sub_devices=1, aipu_cores=4, trading batch parallelism for spatial core parallelism within each tile.
The Odin v0 reduces per-tile jitter by 8.9× (1.87 ms → 0.21 ms), a significant advantage for streaming pipelines where latency predictability determines downstream pipeline sizing.
Power draw is reduced by 29.9% (14.4W → 10.1W), consistent with the D-IMC in-memory compute architecture avoiding LPDDR5 bus toggling for intermediate U-Net activations.

Candidate optimization — batch=4 for Odin v0: The current deployment uses device_connect(None, 1) (batch=1 per call). A batch=4 configuration (device_connect(device, 4) with num_sub_devices=4, aipu_cores=4) would parallelize 4 independent tiles across the 4 cores — matching the pose estimation benchmark approach — and is expected to reduce effective per-tile AIPU time from ~0.94 ms to ~0.47 ms, closing the throughput gap with TRT (669 → ~1,338 tiles/s) while retaining power and jitter advantages.

5.6 Power Efficiency

Power was measured at the carrier board input rail using a Monsoon Power Monitor at 1 kHz sampling:

Configuration	Idle Power (W)	Inference Power (W)	Energy / Tile (mJ)	Full-scene inference energy (J)
TensorRT FP16 (batch=8)	4.1	14.4	13.3	37.4
Odin v0 INT8 (batch=1, 4 cores)	4.4	10.1	15.1	42.4

TensorRT achieves better energy per tile (13.3 vs 15.1 mJ) due to higher batch GPU utilization. The Odin v0 advantage is system power draw (10.1W vs 14.4W), which matters for sustained continuous-inference satellite missions where total energy budget constrains duty cycle. At a 10W inference power allocation — typical for a small satellite payload — only the Odin v0 configuration remains within budget; the TRT-only pipeline would exceed it entirely.

6. Technical Insights

6.1 Producer-Consumer Pipeline Design

The GPU-producer / AIPU-consumer design decouples two stages with different throughput characteristics:

GPU producer: tile slice + INT8 quantization on the Jetson GPU, pipelined across 4 CPU-controlled GPU streams (ThreadPoolExecutor(max_workers=4) submitting prepare_tile tasks).
AIPU consumer: instance.run() on Odin v0, sequentially consuming tiles from the queue.

The Queue(maxsize=16) buffer provides ~24 ms of stall tolerance (16 tiles × 1.5 ms/tile) before the AIPU consumer stalls awaiting input. Under steady-state thermal conditions, GPU preprocessing (~0.34 ms/tile) is significantly faster than AIPU inference (~0.94 ms/tile), so the queue consistently stays at capacity — the AIPU is the throughput bottleneck.

GPU producer rate:  ~2,941 tiles/s  (0.34 ms/tile)
AIPU consumer rate: ~1,064 tiles/s  (0.94 ms/tile)
→ AIPU-limited; queue always full; no stall

6.2 GPU-Side INT8 Quantization

Performing quantization on the Jetson GPU rather than CPU provides two advantages:

Parallelism: The full 10980×10980 FP16 scene resides in GPU VRAM. Slicing and quantizing a 224×224×4 tile is a sub-millisecond GPU kernel, whereas a CPU-side operation would require a VRAM→RAM transfer plus CPU arithmetic.
Accuracy: The GPU applies the exact manifest-derived scale and zero-point (s_in, zp_in from manifest.json), matching the compiler's expected quantization contract without intermediate precision loss.

The CHW→HWC permute((1, 2, 0)) reformat adds ~0.02 ms per tile and is required by the D-IMC runtime's expected input layout. This cost is unavoidable and absorbed into the GPU preprocessing budget.

6.3 Tile Overlap and OR-Fusion Policy

The 14-pixel overlap (TILE_SIZE / 16) mitigates U-Net receptive field truncation at tile edges. Without overlap, the encoder's stride-2 downsampling chain at 224×224 produces a 7×7 bottleneck where edge pixels have limited spatial context, causing misclassification artifacts at tile boundaries that appear as a visible grid pattern in the assembled cloud mask.

The OR-fusion policy (|=) has zero false-negative risk at boundaries but introduces a conservative bias:

A clear-sky pixel at a tile junction is labelled cloud if any of the 1–4 overlapping tile predictions flags it.
In practice, this produces 1–3 pixel false-positive cloud borders at ~5% of tile junctions in heterogeneous scenes.
Alternative: score-space blending (average softmax probabilities before argmax in overlap regions) reduces false-positive borders but requires buffering FP32 logits for adjacent tiles simultaneously.

6.4 Float16 Overflow Protection in Cloud Fill

Band mean calculation for cloud-fill requires summing up to 10,980 × 10,980 = 120,560,400 values. Float16 maximum representable value is ~65,504 — insufficient for this accumulation even with normalized reflectance values (max 1.0): 120.56M × 1.0 >> 65,504. The implementation explicitly casts to float32 before summation:

# sample2_ax_4core_jetson_gpu.py
total_sum = band.float().sum()              # float32 — avoids FP16 overflow
cloud_sum = band[cloud_mask_gpu].float().sum()
fill_vals[i] = (total_sum - cloud_sum) / clear_count

Without this cast, high-reflectance scenes (e.g., snow-adjacent tiles) produce incorrect fill values that appear as a uniform gray wash over cloud-masked regions in the output visualization.

6.5 Thermal Stability

The Odin v0 module was monitored via its onboard temperature sensor over 2,809 tile inferences (~4.2 s active inference following a 10-minute thermal soak):

Initial temperature: 39°C
Thermal equilibrium: 51°C (reached during preceding Test 1 / Test 2 warm-up)
No thermal throttling observed (D-IMC accelerator throttle threshold: 85°C)

The Jetson GPU temperature stabilized at 58°C during the JP2 decompression phase (CPU + GPU parallel I/O), then dropped to 46°C during AIPU inference (GPU handles only the tile queue-filling workload, ~15% GPU utilization). All latency measurements in Section 5 were taken after thermal equilibrium was confirmed.

7. Running the Benchmark

7.1 Dataset Preparation

# Expected Sentinel-2 L2A directory structure:
#   Image_Land/
#   └── GRANULE/
#       └── <granule_id>/
#           └── IMG_DATA/
#               └── R10m/
#                   ├── *_B02_10m.jp2
#                   ├── *_B03_10m.jp2
#                   ├── *_B04_10m.jp2
#                   └── *_B08_10m.jp2
#
# Download scenes from Copernicus Data Space Ecosystem (CDSE) or
# Copernicus Open Access Hub using the tile ID (e.g. T44NMF).

7.2 Run Odin v0 Inference (`sample2_ax_4core_jetson_gpu.py`)

cd /benchmark/jetson_benchmark/cloud_detection
python sample2_ax_4core_jetson_gpu.py

The script auto-detects JP2 bands via glob, runs the full 8-step pipeline, and writes to the working directory:

dtacsnet_<tile_id>_<pct>pct_axelera_visualization.jpg — 6-panel OpenCV visualization
dtacsnet_<tile_id>_<pct>pct_axelera_cloud_mask.npy — binary cloud mask (uint8, 0=clear, 1=cloud)
dtacsnet_<tile_id>_<pct>pct_axelera_rgb_b0{2,3,4}_rec.npy — cloud-filled bands (float16)

Key configuration at the top of the script:

TILE_SIZE   = 224                  # DTACSNet input resolution
OVERLAP     = TILE_SIZE // 16      # 14 px boundary overlap
BATCH_SIZE  = 4                    # defined; Odin v0 currently runs batch=1 per call
MODEL_DIR   = Path("/benchmark/jetson_benchmark/cloud_detection/cloud4bands/compiled_model")
LOOKUP_STR  = "Image_Land"         # root directory for the Sentinel-2 scene

The D-IMC runtime is initialized as:

import axelera.runtime as ar

context    = ar.Context()
model      = context.load_model(MODEL_PATH)          # model.json
connection = context.device_connect(None, 1)          # batch_size=1 per call
instance   = connection.load_model_instance(
    model, num_sub_devices=1, aipu_cores=4            # 4 cores, single-tile spatial parallelism
)

7.3 Run TensorRT Inference (`sample2_trt_224.py`)

cd /benchmark/jetson_benchmark/cloud_detection
python sample2_trt_224.py

Requires cloud4bands_fp16_bs8_t224.engine in the working directory. Outputs:

dtacsnet_<tile_id>_<pct>pct_trt_visualization.jpg
dtacsnet_<tile_id>_<pct>pct_trt_cloud_mask.npy
dtacsnet_<tile_id>_<pct>pct_trt_rgb_b0{2,3,4}_rec.npy

The TRT engine is loaded and configured as:

import tensorrt as trt

with open("cloud4bands_fp16_bs8_t224.engine", "rb") as f, trt.Runtime(logger) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())

context = engine.create_execution_context()
context.set_input_shape(input_name, (BATCH_SIZE, NUM_BANDS, TILE_SIZE, TILE_SIZE))  # (8, 4, 224, 224)

7.4 Run CPU Baseline (`cloud_detection.py`)

# Uses 'Image_data' directory (not 'Image_Land') — adjust LOOKUP_STR if needed
python cloud_detection.py

Uses the original DTACSNet TorchScript model (auto-downloaded from Hugging Face on first run, ~41 MB). Processes the scene with 2048×2048 tiles (9 tiles, 128 px overlap). Expect 200–400 s total on Jetson Orin Nano CPU — suitable as a correctness reference, not a throughput baseline.

7.5 Export and Compile DTACSNet to Odin v0

# 1. Load TorchScript model and export to ONNX
import torch
from dtacs.model_wrapper import CDModel

model_ts = torch.jit.load("cloud4bands.pt", map_location="cpu")
model    = CDModel(model=model_ts)
model.eval()

dummy = torch.randn(1, 4, 224, 224)
torch.onnx.export(
    model, dummy, "cloud4bands.onnx",
    opset_version=17,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
)

# 2. Compile with Voyager SDK on host machine
# Calibration images should be drawn from the CloudSEN12 training split
# Output: cloud4bands/compiled_model/{model.json, manifest.json, ...}

8. Conclusions

8.1 Summary

Metric	Value	Notes
Full-scene inference time	4.2 s	2,809 tiles at 224×224
Tile throughput	669 tiles/s	224×224×4 INT8, batch=1
Per-tile latency (mean)	1.50 ms	AIPU + PCIe + GPU pre/post
AIPU latency (mean)	0.94 ms	D-IMC, INT8, 4 cores, single tile
Jitter (σ)	0.21 ms	AIPU per-tile
Pixel Agreement vs. FP32	97.8%	Binary cloud mask
F2-Score	0.79	CloudSEN12 GT, INT8 backend
Inference Power	10.1 W	System (Jetson + Odin v0)
Full-scene inference energy	42.4 J	Inference phase only

The Odin v0 pipeline processes a full 10,980 × 10,980 Sentinel-2 scene in 4.2 s at 10.1W system power, with 0.21 ms per-tile jitter. At one scene every ~90 minutes per orbit pass, this provides substantial headroom for in-orbit onboard cloud masking with remaining power budget available for downstream processing.

8.2 Benefits for Space DPUs

Power Budget Compliance: At 10.1W inference power, the pipeline fits within a 15W total payload power allocation — leaving margin for communication, housekeeping, and downstream analysis. The TRT-only configuration (14.4W) would exceed a 10W inference power budget.
Deterministic Tile Processing: The 0.21 ms per-tile σ enables consistent throughput without pipeline stall risk. For streaming designs where the cloud mask gates downstream analysis (e.g., atmospheric correction, change detection), predictable tile latency bounds the maximum output queue depth.
Scene Throughput: 4.2 s per full scene provides capacity for processing multiple scenes per orbit pass, enabling multi-scene cloud fraction aggregation and priority downlink selection onboard.
Radiation Tolerance Pathway: As noted in Tests 1 and 2, the Odin v0 D-IMC SRAM-bounded architecture presents a smaller SEU-vulnerable compute surface than the 1024-core Jetson GPU. Hosting the full DTACSNet model on Odin v0 keeps GPU utilization minimal, reducing the radiation-sensitive compute surface for the inference workload.

8.3 Limitations and Open Issues

Batch=1 Throughput Gap: The current Odin v0 deployment uses num_sub_devices=1, aipu_cores=4 (batch=1 per call, all 4 cores on one tile). A batch=4 configuration (num_sub_devices=4, aipu_cores=4) would parallelize 4 independent tiles across the 4 cores — analogous to the pose estimation pipeline (Test 2) — expected to reduce effective per-tile AIPU time from ~0.94 ms to ~0.47 ms, closing the throughput gap with TRT.
OR-Fusion Boundary Artifacts: The OR-fusion policy may introduce 1–3 pixel false-positive cloud borders at tile junctions. Score-space blending (averaging softmax probabilities in overlap regions before argmax) would reduce this artifact but requires buffering float logits for adjacent tiles simultaneously.
JP2 Loading Dominates Total Time: At ~24.8 s, JP2 decompression accounts for ~75% of total pipeline time regardless of backend. Pre-converted Cloud-Optimized GeoTIFF (COG) or NVJPEG2000-accelerated decoding would expose the inference improvement more clearly in end-to-end benchmarks.
4-Band Input Only: The deployment model uses only the 4 × 10 m bands. The full DTACSNet architecture (all 13 Sentinel-2 bands) achieves higher thin cloud and cloud shadow detection performance. A 13-band Odin v0 model would require re-compilation and would increase H2D transfer payload ~3× (~196 KB → ~588 KB per tile) and AIPU inference time proportionally to the expanded input convolution FLOPs.
Synthetic Calibration Distribution: PTQ calibration uses CloudSEN12 images under the dataset's geographic distribution. Deployment over spectrally unusual scenes (e.g., ice sheets, salt flats, active volcanic ash) may produce activation distributions outside the calibration range, increasing INT8 quantization error. Online recalibration or domain-adaptive quantization should be considered for global operational deployments.

Appendix: Hardware Specifications

Feature	NVIDIA Jetson Orin Nano Super 8GB	Odin v0
Compute Core	Ampere GPU (1024-core, 8 SM)	4× AI Cores (D-IMC)
Memory Architecture	Unified LPDDR5 (68 GB/s)	Local SRAM / PCIe Gen3
On-chip Storage	—	4 MB L1, 32 MB L2 / core
External Interface	—	PCIe 3.0 x4 (~4 GB/s peak)
Configured Power	25W (MAXN Super, `nvpmodel -m 2`)	High-Performance Mode
Target Precision	FP16 (GPU preprocessing)	INT8 (full model)
Peak Throughput	67 TOPS INT8 (Sparse)	214 TOPS INT8

2. Background: Sentinel-2 Cloud Detection​

2.1 Problem Foundation​

2.2 Loss Function​

2.3 Dataset (CloudSEN12)​

2.4 Evaluation Metrics​

3. Architecture: DTACSNet-CD​

3.1 Backbone​

3.2 Execution Model​

4. Benchmarking Methodology​

4.1 Tiling Strategy​

4.2 Measurement Protocol​

4.3 Latency Decomposition​

4.4 INT8 Quantization and Calibration​

4.5 Environment Setup​

5. Results​

5.1 Full-Scene Pipeline Timing​

5.2 Per-Tile Latency Breakdown​

5.3 Jitter and Determinism​

5.4 Cloud Detection Accuracy​

5.5 Baseline Comparison: TensorRT FP16 vs. Odin v0​

5.6 Power Efficiency​

6. Technical Insights​

6.1 Producer-Consumer Pipeline Design​

6.2 GPU-Side INT8 Quantization​

6.3 Tile Overlap and OR-Fusion Policy​

6.4 Float16 Overflow Protection in Cloud Fill​

6.5 Thermal Stability​

7. Running the Benchmark​

7.1 Dataset Preparation​

7.2 Run Odin v0 Inference (sample2_ax_4core_jetson_gpu.py)​

7.3 Run TensorRT Inference (sample2_trt_224.py)​

7.4 Run CPU Baseline (cloud_detection.py)​

7.5 Export and Compile DTACSNet to Odin v0​

8. Conclusions​

8.1 Summary​

8.2 Benefits for Space DPUs​

8.3 Limitations and Open Issues​

Appendix: Hardware Specifications​