TR2025-109

Aligning Multimodal Representations through an Information Bottleneck

- Almudévar, A., Hernández-Lobato, J.M., Khurana, S., Marxer, R., Ortega, A., "Aligning Multimodal Representations through an Information Bottleneck", International Conference on Machine Learning (ICML), July 2025.
  BibTeX TR2025-109 PDF
  - @inproceedings{Almudévar2025jul,
  - author = {Almudévar, Antonio and Hernández-Lobato, José, M and Khurana, Sameer and Marxer, Ricard and Ortega, Alfonso},
  - title = {{Aligning Multimodal Representations through an Information Bottleneck}},
  - booktitle = {International Conference on Machine Learning (ICML)},
  - year = 2025,
  - month = jul,
  - url = {https://www.merl.com/publications/TR2025-109}
  - }
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

Contrastive losses have been extensively used as a tool for multimodal representation learning. However, it has been empirically observed that their use is not effective to learn an aligned representation space. In this paper, we argue that this phenomenon is caused by the presence of modality-specific information in the representation space. Although some of the most widely used contrastive losses maximize the mutual in- formation between representations of both modalities, they are not designed to remove the modality- specific information. We give a theoretical description of this problem through the lens of the Information Bottleneck Principle. We also empirically analyze how different hyperparameters affect the emergence of this phenomenon in a con- trolled experimental setup. Finally, we propose a regularization term in the loss function that is de- rived by means of a variational approximation and aims to increase the representational alignment. We analyze in a set of controlled experiments and real-world applications the advantages of includ- ing this regularization term.

Research Areas:

Abstract: