TR2026-044
TTQ: ACTIVATION-AWARE TEST-TIME QUANTIZA- TION TO ACCELERATE LLM INFERENCE ON THE FLY
-
- , "TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference on the Fly", International Conference on Learning Representations (ICLR) Workshop, April 2026.BibTeX TR2026-044 PDF
- @inproceedings{Koike-Akino2026apr,
- author = {Koike-Akino, Toshiaki and Liu, Jing and Wang, Ye},
- title = {{TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference on the Fly}},
- booktitle = {International Conference on Learning Representations (ICLR) Workshop},
- year = 2026,
- month = apr,
- url = {https://www.merl.com/publications/TR2026-044}
- }
- , "TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference on the Fly", International Conference on Learning Representations (ICLR) Workshop, April 2026.
-
MERL Contacts:
-
Research Areas:
Abstract:
To tackle the huge computational demand of large foundation models, activation- aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) frame- work which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines


