Moitreya Chatterjee
- Phone: 617-621-7592
- Email:
-
Position:
Research / Technical Staff
Research Scientist -
Education:
Ph.D., University of Illinois at Urbana-Champaign, 2022 -
Research Areas:
Moitreya's Quick Links
-
Biography
Moitreya's research interests are in computer vision, and multimodal machine learning with a particular emphasis on learning from audio-visual data. His PhD work received the Joan and Lalit Bahl Fellowship and the Thomas and Margaret Huang Research Award. Earlier, he earned a M.S. degree in Computer Science from the University of Southern California (USC), during which he received an Outstanding Paper Award from the ACM International Conference on Multimodal Interaction (ICMI).
-
Recent News & Events
-
NEWS MERL Papers and Workshops at CVPR 2024 Date: June 17, 2024 - June 21, 2024
Where: Seattle, WA
MERL Contacts: Petros T. Boufounos; Moitreya Chatterjee; Anoop Cherian; Michael J. Jones; Toshiaki Koike-Akino; Jonathan Le Roux; Suhas Lohit; Tim K. Marks; Pedro Miraldo; Jing Liu; Kuan-Chuan Peng; Pu (Perry) Wang; Ye Wang; Matthew Brand
Research Areas: Artificial Intelligence, Computational Sensing, Computer Vision, Machine Learning, Speech & AudioBrief- MERL researchers are presenting 5 conference papers, 3 workshop papers, and are co-organizing two workshops at the CVPR 2024 conference, which will be held in Seattle, June 17-21. CVPR is one of the most prestigious and competitive international conferences in computer vision. Details of MERL contributions are provided below.
CVPR Conference Papers:
1. "TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models" by H. Ni, B. Egger, S. Lohit, A. Cherian, Y. Wang, T. Koike-Akino, S. X. Huang, and T. K. Marks
This work enables a pretrained text-to-video (T2V) diffusion model to be additionally conditioned on an input image (first video frame), yielding a text+image to video (TI2V) model. Other than using the pretrained T2V model, our method requires no ("zero") training or fine-tuning. The paper uses a "repeat-and-slide" method and diffusion resampling to synthesize videos from a given starting image and text describing the video content.
Paper: https://www.merl.com/publications/TR2024-059
Project page: https://merl.com/research/highlights/TI2V-Zero
2. "Long-Tailed Anomaly Detection with Learnable Class Names" by C.-H. Ho, K.-C. Peng, and N. Vasconcelos
This work aims to identify defects across various classes without relying on hard-coded class names. We introduce the concept of long-tailed anomaly detection, addressing challenges like class imbalance and dataset variability. Our proposed method combines reconstruction and semantic modules, learning pseudo-class names and utilizing a variational autoencoder for feature synthesis to improve performance in long-tailed datasets, outperforming existing methods in experiments.
Paper: https://www.merl.com/publications/TR2024-040
3. "Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling" by X. Liu, Y-W. Tai, C-T. Tang, P. Miraldo, S. Lohit, and M. Chatterjee
This work presents a new strategy for rendering dynamic scenes from novel viewpoints. Our approach is based on stratifying the scene into regions based on the extent of motion of the region, which is automatically determined. Regions with higher motion are permitted a denser spatio-temporal sampling strategy for more faithful rendering of the scene. Additionally, to the best of our knowledge, ours is the first work to enable tracking of objects in the scene from novel views - based on the preferences of a user, provided by a click.
Paper: https://www.merl.com/publications/TR2024-042
4. "SIRA: Scalable Inter-frame Relation and Association for Radar Perception" by R. Yataka, P. Wang, P. T. Boufounos, and R. Takahashi
Overcoming the limitations on radar feature extraction such as low spatial resolution, multipath reflection, and motion blurs, this paper proposes SIRA (Scalable Inter-frame Relation and Association) for scalable radar perception with two designs: 1) extended temporal relation, generalizing the existing temporal relation layer from two frames to multiple inter-frames with temporally regrouped window attention for scalability; and 2) motion consistency track with a pseudo-tracklet generated from observational data for better object association.
Paper: https://www.merl.com/publications/TR2024-041
5. "RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation" by Z. Yang, J. Liu, P. Chen, A. Cherian, T. K. Marks, J. L. Roux, and C. Gan
We leverage Large Language Models (LLM) for zero-shot semantic audio visual navigation. Specifically, by employing multi-modal models to process sensory data, we instruct an LLM-based planner to actively explore the environment by adaptively evaluating and dismissing inaccurate perceptual descriptions.
Paper: https://www.merl.com/publications/TR2024-043
CVPR Workshop Papers:
1. "CoLa-SDF: Controllable Latent StyleSDF for Disentangled 3D Face Generation" by R. Dey, B. Egger, V. Boddeti, Y. Wang, and T. K. Marks
This paper proposes a new method for generating 3D faces and rendering them to images by combining the controllability of nonlinear 3DMMs with the high fidelity of implicit 3D GANs. Inspired by StyleSDF, our model uses a similar architecture but enforces the latent space to match the interpretable and physical parameters of the nonlinear 3D morphable model MOST-GAN.
Paper: https://www.merl.com/publications/TR2024-045
2. “Tracklet-based Explainable Video Anomaly Localization” by A. Singh, M. J. Jones, and E. Learned-Miller
This paper describes a new method for localizing anomalous activity in video of a scene given sample videos of normal activity from the same scene. The method is based on detecting and tracking objects in the scene and estimating high-level attributes of the objects such as their location, size, short-term trajectory and object class. These high-level attributes can then be used to detect unusual activity as well as to provide a human-understandable explanation for what is unusual about the activity.
Paper: https://www.merl.com/publications/TR2024-057
MERL co-organized workshops:
1. "Multimodal Algorithmic Reasoning Workshop" by A. Cherian, K-C. Peng, S. Lohit, M. Chatterjee, H. Zhou, K. Smith, T. K. Marks, J. Mathissen, and J. Tenenbaum
Workshop link: https://marworkshop.github.io/cvpr24/index.html
2. "The 5th Workshop on Fair, Data-Efficient, and Trusted Computer Vision" by K-C. Peng, et al.
Workshop link: https://fadetrcv.github.io/2024/
3. "SuperLoRA: Parameter-Efficient Unified Adaptation for Large Vision Models" by X. Chen, J. Liu, Y. Wang, P. Wang, M. Brand, G. Wang, and T. Koike-Akino
This paper proposes a generalized framework called SuperLoRA that unifies and extends different variants of low-rank adaptation (LoRA). Introducing new options with grouping, folding, shuffling, projection, and tensor decomposition, SuperLoRA offers high flexibility and demonstrates superior performance up to 10-fold gain in parameter efficiency for transfer learning tasks.
Paper: https://www.merl.com/publications/TR2024-062
- MERL researchers are presenting 5 conference papers, 3 workshop papers, and are co-organizing two workshops at the CVPR 2024 conference, which will be held in Seattle, June 17-21. CVPR is one of the most prestigious and competitive international conferences in computer vision. Details of MERL contributions are provided below.
-
TALK [MERL Seminar Series 2023] Dr. Tanmay Gupta presents talk titled Visual Programming - A compositional approach to building General Purpose Vision Systems Date & Time: Tuesday, October 31, 2023; 2:00 PM
Speaker: Tanmay Gupta, Allen Institute for Artificial Intelligence
MERL Host: Moitreya Chatterjee
Research Areas: Artificial Intelligence, Computer Vision, Machine LearningAbstract- Building General Purpose Vision Systems (GPVs) that can perform a huge variety of tasks has been a long-standing goal for the computer vision community. However, end-to-end training of these systems to handle different modalities and tasks has proven to be extremely challenging. In this talk, I will describe a lucrative neuro-symbolic alternative to the common end-to-end learning paradigm called Visual Programming. Visual Programming is a general framework that leverages the code-generation abilities of LLMs, existing neural models, and non-differentiable programs to enable powerful applications. Some of these applications continue to remain elusive for the current generation of end-to-end trained GPVs.
See All News & Events for Moitreya -
-
Research Highlights
-
Internships with Moitreya
-
CV0078: Internship - Audio-Visual Learning with Limited Labeled Data
MERL is looking for a highly motivated intern to work on an original research project on multimodal learning, such as audio-visual learning, using limited labeled data. A strong background in computer vision and deep learning is required. Experience in audio-visual (multimodal) learning, weakly/self-supervised learning, continual learning, and large (vision-) language models is an added plus and will be valued. The successful candidate is expected to have published at least one paper in a top-tier computer vision or machine learning venue, such as CVPR, ECCV, ICCV, ICML, ICLR, NeurIPS or AAAI, and possess solid programming skills in Python and popular deep learning frameworks such as Pytorch. The intern will collaborate with MERL researchers to develop and implement novel algorithms and prepare manuscripts for scientific publications. Successful applicants are typically graduate students on a Ph.D. track or recent Ph.D. graduates. Duration and start date are flexible, but the internship is expected to last for at least 3 months.
Required Specific Experience
- Prior publications in top-tier computer vision and/or machine learning venues, such as CVPR, ECCV, ICCV, ICML, ICLR, NeurIPS or AAAI.
- Knowledge of the latest self-supervised and weakly-supervised learning techniques.
- Experience with Large (Vision-) Language Models.
- Proficiency in scripting languages, such as Python, and deep learning frameworks such as PyTorch or Tensorflow.
-
CV0079: Internship - Novel View Synthesis of Dynamic Scenes
MERL is looking for a highly motivated intern to work on an original research project in rendering dynamic scenes from novel views. A strong background in 3D computer vision and/or computer graphics is required. Experience with the latest advances in volumetric rendering, such as neural radiance fields (NeRFs) and Gaussian Splatting (GS), is desired. The successful candidate is expected to have published at least one paper in a top-tier computer vision/graphics or machine learning venue, such as CVPR, ECCV, ICCV, SIGGRAPH, 3DV, ICML, ICLR, NeurIPS or AAAI, and possess solid programming skills in Python and popular deep learning frameworks like Pytorch. The candidate will collaborate with MERL researchers to develop algorithms and prepare manuscripts for scientific publications. The position is available for graduate students on a Ph.D. track or those that have recently graduated with a Ph.D. Duration and start date are flexible but the internship is expected to last for at least 3 months.
Required Specific Experience
- Prior publications in top computer vision/graphics and/or machine learning venues, such as CVPR, ECCV, ICCV, SIGGRAPH, 3DV, ICML, ICLR, NeurIPS or AAAI.
- Experienced in the latest novel-view synthesis approaches such as Neural Radiance Fields (NeRFs) or Gaussian Splatting (GS).
- Proficiency in coding (particularly scripting languages like Python) and familiarity with deep learning frameworks, such as PyTorch or Tensorflow.
-
-
MERL Publications
- "A Probability-guided Sampler for Neural Implicit Surface Rendering", European Conference on Computer Vision (ECCV), September 2024.BibTeX TR2024-129 PDF
- @inproceedings{Pais2024sep,
- author = {Pais, Goncalo and Piedade, Valter and Chatterjee, Moitreya and Greiff, Marcus and Miraldo, Pedro}},
- title = {A Probability-guided Sampler for Neural Implicit Surface Rendering},
- booktitle = {European Conference on Computer Vision (ECCV)},
- year = 2024,
- month = sep,
- url = {https://www.merl.com/publications/TR2024-129}
- }
, - "Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), May 2024, pp. 19667-19679.BibTeX TR2024-042 PDF Videos Software
- @inproceedings{Liu2024may,
- author = {Liu, Xinhang and Tai, Yu-wing and Tang, Chi-Keung and Miraldo, Pedro and Lohit, Suhas and Chatterjee, Moitreya},
- title = {Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling},
- booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
- year = 2024,
- pages = {19667--19679},
- month = may,
- publisher = {IEEE},
- url = {https://www.merl.com/publications/TR2024-042}
- }
, - "CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments", AAAI Conference on Artificial Intelligence, DOI: 10.1609/aaai.v38i4.28167, December 2023, pp. 3765-3773.BibTeX TR2023-154 PDF
- @inproceedings{Liu2023dec2,
- author = {Liu, Xiulong and Paul, Sudipta and Chatterjee, Moitreya and Cherian, Anoop},
- title = {CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments},
- booktitle = {Proceedings of the 38th AAAI Conference on Artificial Intelligence},
- year = 2023,
- pages = {3765--3773},
- month = dec,
- doi = {10.1609/aaai.v38i4.28167},
- url = {https://www.merl.com/publications/TR2023-154}
- }
, - "Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection", IEEE International Conference on Computer Vision Workshops (ICCV), October 2023, pp. 924-932.BibTeX TR2023-125 PDF Presentation
- @inproceedings{Sharma2023oct,
- author = {Sharma, Manish and Chatterjee, Moitreya and Peng, Kuan-Chuan and Lohit, Suhas and Jones, Michael J.},
- title = {Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection},
- booktitle = {IEEE International Conference on Computer Vision Workshops (ICCV)},
- year = 2023,
- pages = {924--932},
- month = oct,
- url = {https://www.merl.com/publications/TR2023-125}
- }
, - "Active Sparse Conversations for Improved Audio-Visual Embodied Navigation", arXiv, June 2023. ,
- "A Probability-guided Sampler for Neural Implicit Surface Rendering", European Conference on Computer Vision (ECCV), September 2024.
-
Other Publications
- "A hierarchical variational neural uncertainty model for stochastic video prediction", Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9751-9761.BibTeX
- @Inproceedings{chatterjee2021hierarchical,
- author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
- title = {A hierarchical variational neural uncertainty model for stochastic video prediction},
- booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
- year = 2021,
- pages = {9751--9761}
- }
, - "Visual scene graphs for audio source separation", Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1204-1213.BibTeX
- @Inproceedings{chatterjee2021visual,
- author = {Chatterjee, Moitreya and Le Roux, Jonathan and Ahuja, Narendra and Cherian, Anoop},
- title = {Visual scene graphs for audio source separation},
- booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
- year = 2021,
- pages = {1204--1213}
- }
, - "Dynamic graph representation learning for video dialog via multi-modal shuffled transformers", Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, pp. 1415-1423.BibTeX
- @Inproceedings{geng2021dynamic,
- author = {Geng, Shijie and Gao, Peng and Chatterjee, Moitreya and Hori, Chiori and Le Roux, Jonathan and Zhang, Yongfeng and Li, Hongsheng and Cherian, Anoop},
- title = {Dynamic graph representation learning for video dialog via multi-modal shuffled transformers},
- booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
- year = 2021,
- volume = 35,
- number = 2,
- pages = {1415--1423}
- }
, - "Sound2sight: Generating visual dynamics from sound and context", European Conference on Computer Vision, 2020, pp. 701-719.BibTeX
- @Inproceedings{chatterjee2020sound2sight,
- author = {Chatterjee, Moitreya and Cherian, Anoop},
- title = {Sound2sight: Generating visual dynamics from sound and context},
- booktitle = {European Conference on Computer Vision},
- year = 2020,
- pages = {701--719},
- organization = {Springer}
- }
, - "Coreset-based neural network compression", Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 454-470.BibTeX
- @Inproceedings{dubey2018coreset,
- author = {Dubey, Abhimanyu and Chatterjee, Moitreya and Ahuja, Narendra},
- title = {Coreset-based neural network compression},
- booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
- year = 2018,
- pages = {454--470}
- }
, - "Deep neural networks with inexact matching for person re-identification", Advances in neural information processing systems, Vol. 29, 2016.BibTeX
- @Article{subramaniam2016deep,
- author = {Subramaniam, Arulkumar and Chatterjee, Moitreya and Mittal, Anurag},
- title = {Deep neural networks with inexact matching for person re-identification},
- journal = {Advances in neural information processing systems},
- year = 2016,
- volume = 29
- }
, - "Combining two perspectives on classifying multimodal data for recognizing speaker traits", Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015, pp. 7-14.BibTeX
- @Inproceedings{chatterjee2015combining,
- author = {Chatterjee, Moitreya and Park, Sunghyun and Morency, Louis-Philippe and Scherer, Stefan},
- title = {Combining two perspectives on classifying multimodal data for recognizing speaker traits},
- booktitle = {Proceedings of the 2015 ACM on International Conference on Multimodal Interaction},
- year = 2015,
- pages = {7--14}
- }
,
- "A hierarchical variational neural uncertainty model for stochastic video prediction", Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9751-9761.
-
Software & Data Downloads
-
Videos
-
MERL Issued Patents
-
Title: "A Method and System for Scene-Aware Audio-Video Representation"
Inventors: Cherian, Anoop; Chatterjee, Moitreya; Le Roux, Jonathan
Patent No.: 12,056,213
Issue Date: Aug 6, 2024
-
Title: "A Method and System for Scene-Aware Audio-Video Representation"