TR2026-018
LatentLLM: Activation-Aware Transform to Multi-Head Latent Attention
-
- , "LatentLLM: Activation-Aware Transform to Multi-Head Latent Attention", AAAI Conference on Artificial Intelligence, Sven Koenig, Chad Jenkins, Matthew E. Taylor, Eds., DOI: 10.1609/aaai.v40i27.39425, March 2026, vol. 40, pp. 22644-22652.BibTeX TR2026-018 PDF Video Presentation
- @inproceedings{Koike-Akino2026jan,
- author = {{{Koike-Akino, Toshiaki and Chen, Xiangyu and Liu, Jing and Wang, Ye and Wang, Pu and Brand, Matthew}}},
- title = {{{LatentLLM: Activation-Aware Transform to Multi-Head Latent Attention}}},
- booktitle = {AAAI Conference on Artificial Intelligence},
- year = 2026,
- editor = {Sven Koenig, Chad Jenkins, Matthew E. Taylor},
- volume = 40,
- number = 27,
- pages = {22644--22652},
- month = jan,
- publisher = {AAAI Press},
- doi = {10.1609/aaai.v40i27.39425},
- issn = {2374-3468},
- isbn = {978-1-57735-906-7},
- url = {https://www.merl.com/publications/TR2026-018}
- }
- , "LatentLLM: Activation-Aware Transform to Multi-Head Latent Attention", AAAI Conference on Artificial Intelligence, Sven Koenig, Chad Jenkins, Matthew E. Taylor, Eds., DOI: 10.1609/aaai.v40i27.39425, March 2026, vol. 40, pp. 22644-22652.
-
MERL Contacts:
-
Research Areas:
Abstract:
Modern foundation models such as large language models (LLMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor decomposition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory- efficient LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.




