TR2026-018

LatentLLM: Activation-Aware Transform to Multi-Head Latent Attention


    •  Koike-Akino, T., Chen, X., Liu, J., Wang, Y., Wang, P., Brand, M., "LatentLLM: Activation-Aware Transform to Multi-Head Latent Attention", AAAI Conference on Artificial Intelligence, January 2026.
      BibTeX TR2026-018 PDF Presentation
      • @inproceedings{Koike-Akino2026jan,
      • author = {{{Koike-Akino, Toshiaki and Chen, Xiangyu and Liu, Jing and Wang, Ye and Wang, Pu and Brand, Matthew}}},
      • title = {{{LatentLLM: Activation-Aware Transform to Multi-Head Latent Attention}}},
      • booktitle = {AAAI Conference on Artificial Intelligence},
      • year = 2026,
      • month = jan,
      • url = {https://www.merl.com/publications/TR2026-018}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Machine Learning

Abstract:

Modern foundation models such as large language models (LLMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor decomposition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory- efficient LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.