Benchmarking the Cost of Adaptation: A System-Level Analysis of Test-Time Adaptation on Edge Devices

The widespread use of deep neural networks on edge devices has revealed a significant vulnerability related to performance degradation caused by distribution shifts. Static models fail when the statistical properties of the test data differ from those of the training data due to environmental changes, sensor noise, or domain shift. Test-time adaptation (TTA) is an effective algorithmic solution that allows models to continuously adjust their parameters during inference using unlabeled data streams. However, the existing literature on TTA primarily focuses on improving functional metrics like accuracy and often overlooks the severe resource constraints present in edge computing. This study addresses these gaps by providing a thorough, system-level analysis of TTA that measures hidden costs in terms of memory footprint, computational latency, and energy consumption.

Purpose

edge-tta evaluates whether online TTA methods are practical outside datacenter settings, where latency, memory, and energy are as important as robustness.

The repository focuses on:

comparing adaptation methods under distribution shift (corruptions)
measuring end-to-end runtime cost on edge devices (Raspberry Pi 5, Jetson Orin)
exposing trade-offs between accuracy gains and system overhead

Methodology

Benchmarked methods

The benchmark includes representative online TTA baselines and adaptive methods implemented in edge_tta/methods/, including:

no_adapt - No Adaptation baseline
adabn - Revisiting Batch Normalization For Practical Domain Adaptation
t3a - Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization
lame - Parameter-Free Online Test-Time Adaptation
pl (pseudo-label) - Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
shot - Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation
tent - Tent: Fully Test-Time Adaptation by Entropy Minimization
eata - Efficient Test-Time Model Adaptation without Forgetting
sar - Towards Stable Test-Time Adaptation in Dynamic Wild World
cotta - Continual Test-Time Domain Adaptation

Evaluation protocol

The main evaluation loop is implemented in main.py and runs each method over multiple corruption types (noise, blur, weather, and digital corruptions).

For each corruption, the pipeline:

loads the corrupted dataset split (CIFAR-10-C, CIFAR-100-C, or ImageNet-C)
performs online adaptation/inference
records task metrics: Top-1/Top-5 accuracy and ECE
records system metrics with PerformanceMonitor
resets adaptation state before the next corruption

System metrics collected

The monitoring stack tracks:

timing: batch wall time, forward time, adaptation time, throughput
memory: CPU memory, GPU memory (where available), activation memory, peak RAM
energy: total energy and energy per sample with device-aware backends

Usage

1. Install

pip install -e .[dev]

2. Prepare datasets

Use helper scripts in scripts/ (get_cifar10c.sh, get_imagenetc.sh) and place data under data/.

3. Run a single method

python main.py \
	--architecture resnet18 \
	--checkpoint-path ./checkpoints/resnet18_cifar10_source_model.pth \
	--data-dir ./data/CIFAR-10-C \
	--batch-size 4 \
	--tta-algorithm tent \
	--level 5 \
	--output ./results/outputs_resnet18_cifar10c_bs4 \
	--track-performance true

4. Run multiple methods

bash scripts/run_all_methods.sh \
	--architecture resnet18 \
	--checkpoint-path ./checkpoints/resnet18_cifar10_source_model.pth \
	--data-dir ./data/CIFAR-10-C \
	--batch-size 4

Results

Summary of system-level results for TTA using a ResNet-50 model, evaluated on selected MPU-class edge devices with the ImageNet-C benchmark. Delta Acc and Delta ECE are reported relative to the No Adaptation baseline. η_AE denotes adaptation efficiency. Latency and energy are reported per batch.

Device	Raspberry Pi 5 (Batch Size 16)						NVIDIA Jetson Orin Nano (Batch Size 32)
Method / Metrics	Delta Acc (%)	Delta ECE (%)	Lat. (ms)	Energy (J)	Peak RAM (MB)	η_AE	Delta Acc (%)	Delta ECE (%)	Lat. (ms)	Energy (J)	Peak RAM (MB)	η_AE
No Adaptation	0.00	0.00	3660.74	25.59	991.92	0.0	0.00	0.00	303.61	5.41	2775.62	0.0
AdaBN	-6.23	74.10	3859.45	26.96	1030.55	-72.76	2.24	72.94	408.66	6.59	2782.83	60.75
T3A	-2.88	-48.03	3958.79	27.68	1367.27	-22.05	-3.18	-47.91	930.05	14.95	3338.02	-10.67
LAME	-0.97	-9.05	3795.25	26.51	1121.89	-16.87	-1.30	-12.58	404.57	6.55	3011.36	-36.49
Pseudo Label	-0.56	75.11	4217.33	29.47	2507.98	-2.31	6.92	69.21	514.80	8.40	5526.63	74.06
SHOT-IM	8.73	-24.05	8654.84	60.53	2646.09	4.00	18.25	-16.49	991.27	15.05	5493.34	60.58
TENT	9.41	-116.65	8616.83	60.12	2985.81	4.36	23.56	-62.86	843.79	13.04	5470.41	98.81
CoTTA	Out-of-Memory						Out-of-Memory
EATA	15.41	-137.79	4602.03	32.18	2495.84	37.41	33.10	-105.78	1147.74	17.56	5597.54	87.18
SAR	10.50	38.52	6904.60	48.30	2767.84	7.40	22.22	40.83	1780.91	10.28	5528.69	146.00

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.images		.images
edge_tta		edge_tta
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking the Cost of Adaptation: A System-Level Analysis of Test-Time Adaptation on Edge Devices

Purpose

Methodology

Benchmarked methods

Evaluation protocol

System metrics collected

Usage

1. Install

2. Prepare datasets

3. Run a single method

4. Run multiple methods

Results

ResNet50 accuracy batch latency Pareto

ResNet50 batch-16 execution time breakdown stacked per sample

Memory composition analysis for ResNet50 on Orin GPU across all methods

Models memory wall peak VRAM usage

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking the Cost of Adaptation: A System-Level Analysis of Test-Time Adaptation on Edge Devices

Purpose

Methodology

Benchmarked methods

Evaluation protocol

System metrics collected

Usage

1. Install

2. Prepare datasets

3. Run a single method

4. Run multiple methods

Results

ResNet50 accuracy batch latency Pareto

ResNet50 batch-16 execution time breakdown stacked per sample

Memory composition analysis for ResNet50 on Orin GPU across all methods

Models memory wall peak VRAM usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages