Initiated in 2021, the MERL Seminar Series aims to increase exposure to outstanding emerging researchers and new research topics. We invite carefully chosen external speakers to share their work, focusing on technical topics of broad interest. The talks help promote interdisciplinary collaboration within and outside MERL.
Seminars in the series take place approximately every two weeks during the fall and spring.
Acoustic perception is invaluable to humans and robots in understanding objects and events in their environments. These sounds are dependent on properties of the source, the environment, and the receiver. Many humans possess remarkable intuition both to infer key properties of each of these three aspects from a sound and to form expectations of how these different aspects would affect the sound they hear. In order to equip robots and AI agents with similar if not stronger capabilities, our research has taken a two-fold path. First, we collect high-fidelity datasets in both controlled and uncontrolled environments which capture real sounds of objects and rooms. Second, we introduce differentiable physics-based models that can estimate acoustic properties of objects and rooms from minimal amounts of real audio data, then can predict new sounds from these objects and rooms under novel, “unseen” conditions.
Harvesting labor is the single largest cost in apple production in the U.S. Surging cost and growing shortage of labor has forced the apple industry to seek automated harvesting solutions. Despite considerable progress in recent years, the existing robotic harvesting systems still fall short of performance expectations, lacking robustness and proving inefficient or overly complex for practical commercial deployment. In this talk, I will present the development and evaluation of a new dual-arm robotic apple harvesting system. This work is a result of a continuous collaboration between Michigan State University and U.S. Department of Agriculture.
Large language models have been found to have surprising capabilities, even what have been called “sparks of artificial general intelligence.” However, understanding these models involves some significant challenges: their internal structure is extremely complicated, their training data is often opaque, and getting access to the underlying mechanisms is becoming increasingly difficult. As a consequence, researchers often have to resort to studying these systems based on their behavior. This situation is, of course, one that cognitive scientists are very familiar with — human brains are complicated systems trained on opaque data and typically difficult to study mechanistically. In this talk I will summarize some of the tools of cognitive science that are useful for understanding the behavior of large language models. Specifically, I will talk about how thinking about different levels of analysis (and Bayesian inference) can help us understand some behaviors that don’t seem particularly intelligent, how tasks like similarity judgment can be used to probe internal representations, how axiom violations can reveal interesting mechanisms, and how associations can reveal biases in systems that have been trained to be unbiased.
Learning-enabled control systems have demonstrated impressive empirical performance on challenging control problems in robotics. However, this performance often arrives with the trade-off of diminished transparency and the absence of guarantees regarding the safety and stability of the learned controllers. In recent years, new techniques have emerged to provide these guarantees by learning certificates alongside control policies — these certificates provide concise, data-driven proofs that guarantee the safety and stability of the learned control system. These methods not only allow the user to verify the safety of a learned controller but also provide supervision during training, allowing safety and stability requirements to influence the training process itself. In this talk, we present two exciting updates on neural certificates. In the first work, we explore the use of graph neural networks to learn collision-avoidance certificates that can generalize to unseen and very crowded environments. The second work presents a novel reinforcement learning approach that can produce certificate functions with the policies while addressing the instability issues in the optimization process. Finally, if time permits, I will also talk about my group's recent work using LLM and domain-specific task and motion planners to allow natural language as input for robot planning.
The explosive growth of machine learning and data-driven methodologies have revolutionized numerous fields. Yet, translating these successes to the domain of dynamical, physical systems remains a significant challenge, hindered by the complex and often unpredictable nature of such environments. Closing the loop from data to actions in these systems faces many difficulties, stemming from the need for sample efficiency and computational feasibility amidst intricate dynamics, along with many other requirements such as verifiability, robustness, and safety. In this talk, we bridge this gap by introducing innovative approaches that harness representation-based methods, domain knowledge, and the physical structures of systems. We present a comprehensive framework that integrates these components to develop reinforcement learning and control strategies that are not only tailored for the complexities of physical systems but also achieve efficiency, safety, and robustness with provable performance.
This talk will cover a new generation of technologies that can sense, connect, and perceive the physical world in unprecedented ways. These technologies can uncover hidden worlds around us, promising transformative impact on areas spanning climate change monitoring, ocean mapping, healthcare, food security, supply chain, and even extraterrestrial exploration.
The talk will cover four core technologies invented by Prof. Adib and his team. The first is an ocean internet-of-things (IoT) that uses battery-free sensors for climate change monitoring, marine life discovery, and seafood production (aquaculture). The second is a new perception technology that enables robots to sense and manipulate hidden objects. The third is a new augmented reality headset with ``X-ray vision”, which extends human perception beyond line-of-sight. The fourth is a wireless sensing technology that can “see through walls” and monitor people’s vital signs (including their breathing, heart rate, and emotions), enabling smart environments that sense humans requiring any contact with the human body.
The talk will touch on the journey of these technologies from their inception at MIT to international collaborations and startups that are translating them to real-world impact in areas spanning healthcare, climate change, and supply chain.
Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance. We present our alternative explanation in a simple mathematical model. Via the presented analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.
While robots have been successfully deployed in factory floors and warehouses, there has been limited progress in having them perform physical tasks with people at home and in the workplace. I aim to bridge the gap between their current performance in human environments and what robots are capable of doing, by making human-robot interactions efficient and robust.
In the first part of my talk, I discuss enhancing the efficiency of human-robot interactions by enabling robot manipulators to infer the preference of a human teammate and proactively assist them in a collaborative task. I show how we can leverage similarities between different users and tasks to learn compact representations of user preferences and use these representations as priors for efficient inference.
In the second part, I talk about enhancing the robustness of human-robot interactions by algorithmically generating diverse and realistic scenarios in simulation that reveal system failures. I propose formulating the problem of algorithmic scenario generation as a quality diversity problem and show how standard quality diversity algorithms can discover surprising and unexpected failure cases. I then discuss the development of a new class of quality diversity algorithms that significantly improve the search of the scenario space and the integration of these algorithms with generative models, which enables the generation of complex and realistic scenarios.
Finally, I conclude the talk with applications in mining operations, collaborative manufacturing and assistive care.
I will survey a current, heated debate in the AI research community on whether large pre-trained language models can be said to "understand" language -- and the physical and social situations language encodes -- in any important sense. I will describe arguments that have been made for and against such understanding, and, more generally, will discuss what methods can be used to fairly evaluate understanding and intelligence in AI systems. I will conclude with key questions for the broader sciences of intelligence that have arisen in light of these discussions.
Advances in machine learning have led to powerful models for audio and language, proficient in tasks like speech recognition and fluent language generation. Beyond their immense utility in engineering applications, these models offer valuable tools for cognitive science and neuroscience. In this talk, I will demonstrate how these artificial neural network models can be used to understand how the human brain processes language. The first part of the talk will cover how audio neural networks serve as computational accounts for brain activity in the auditory cortex. The second part will focus on the use of large language models, such as those in the GPT family, to non-invasively control brain activity in the human language system.
Imaging in low light settings is extremely challenging due to low photon counts, both in photography and in microscopy. In photography, imaging under low light, high gain settings often results in highly structured, non-Gaussian sensor noise that’s hard to characterize or denoise. In this talk, we address this by developing a GAN-tuned physics-based noise model to more accurately represent camera noise at the lowest light, and highest gain settings. Using this noise model, we train a video denoiser using synthetic data and demonstrate photorealistic videography at starlight (submillilux levels of illumination) for the first time.
For multiphoton microscopy, which is a form a scanning microscopy, there’s a trade-off between field of view, phototoxicity, acquisition time, and image quality, often resulting in noisy measurements. While deep learning-based methods have shown compelling denoising performance, can we trust these methods enough for critical scientific and medical applications? In the second part of this talk, I’ll introduce a learned, distribution-free uncertainty quantification technique that can both denoise and predict pixel-wise uncertainty to gauge how much we can trust our denoiser’s performance. Furthermore, we propose to leverage this learned, pixel-wise uncertainty to drive an adaptive acquisition technique that rescans only the most uncertain regions of a sample. With our sample and algorithm-informed adaptive acquisition, we demonstrate a 120X improvement in total scanning time and total light dose for multiphoton microscopy, while successfully recovering fine structures within the sample.
When designing complex systems, we need to consider multiple trade-offs at various abstraction levels and scales, and choices of single components need to be studied jointly. For instance, the design of future mobility solutions (e.g., autonomous vehicles, micromobility) and the design of the mobility systems they enable are closely coupled. Indeed, knowledge about the intended service of novel mobility solutions would impact their design and deployment process, whilst insights about their technological development could significantly affect transportation management policies. Optimally co-designing sociotechnical systems is a complex task for at least two reasons. On one hand, the co-design of interconnected systems (e.g., large networks of cyber-physical systems) involves the simultaneous choice of components arising from heterogeneous natures (e.g., hardware vs. software parts) and fields, while satisfying systemic constraints and accounting for multiple objectives. On the other hand, components are connected via collaborative and conflicting interactions between different stakeholders (e.g., within an intermodal mobility system). In this talk, I will present a framework to co-design complex systems, leveraging a monotone theory of co-design and tools from game theory. The framework will be instantiated in the task of designing future mobility systems, all the way from the policies that a city can design, to the autonomy of vehicles part of an autonomous mobility-on-demand service. Through various case studies, I will show how the proposed approaches allow one to efficiently answer heterogeneous questions, unifying different modeling techniques and promoting interdisciplinarity, modularity, and compositionality. I will then discuss open challenges for compositional systems design optimization, and present my agenda to tackle them.
This talk reviews the concept of predictive multiplicity in machine learning. Predictive multiplicity arises when different classifiers achieve similar average performance for a specific learning task yet produce conflicting predictions for individual samples. We discuss a metric called “Rashomon Capacity” for quantifying predictive multiplicity in multi-class classification. We also present recent findings on the multiplicity cost of differentially private training methods and group fairness interventions in machine learning.
This talk is based on work published at ICML'20, NeurIPS'22, ACM FAccT'23, and NeurIPS'23.
Building General Purpose Vision Systems (GPVs) that can perform a huge variety of tasks has been a long-standing goal for the computer vision community. However, end-to-end training of these systems to handle different modalities and tasks has proven to be extremely challenging. In this talk, I will describe a lucrative neuro-symbolic alternative to the common end-to-end learning paradigm called Visual Programming. Visual Programming is a general framework that leverages the code-generation abilities of LLMs, existing neural models, and non-differentiable programs to enable powerful applications. Some of these applications continue to remain elusive for the current generation of end-to-end trained GPVs.
Inverse Optimal Control (IOC) aims to achieve an objective function corresponding to a certain task from an expert robot driven by optimal control, which has become a powerful tool in many applications in robotics. We will present our recent solutions to IOC based on incomplete observations of systems' trajectories, which enables an autonomous system to “sense-and-adapt", i.e., incrementally improving the learning of objective functions as new data arrives. This also leads to a distributed algorithm to solve IOC in multi-agent systems, in which each agent can only access part of the overall trajectory of an optimal control system and cannot solve IOC by itself. This is perhaps the first distributed method to IOC. Applications of IOC into human prediction will also be given.
Recent advances in multimodal models that fuse vision and language are revolutionizing robotics. In this lecture, I will begin by introducing recent multimodal foundational models and their applications in robotics. The second topic of this talk will address our recent work on multimodal language processing in robotics. The shortage of home care workers has become a pressing societal issue, and the use of domestic service robots (DSRs) to assist individuals with disabilities is seen as a possible solution. I will present our work on DSRs that are capable of open-vocabulary mobile manipulation, referring expression comprehension and segmentation models for everyday objects, and future captioning methods for cooking videos and DSRs.
Contact interactions are pervasive in key real-world robotic tasks like manipulation and walking. However, the non-smooth dynamics associated with impacts and friction remain challenging to model, and motion planning and control algorithms that can fluently and efficiently reason about contact remain elusive. In this talk, I will share recent work from my research group that takes an “optimization-first” approach to these challenges: collision detection, physics, motion planning, and control are all posed as constrained optimization problems. We then build a set of algorithmic and numerical tools that allow us to flexibly compose these optimization sub-problems to solve complex robotics problems involving discontinuous, unplanned, and uncertain contact mechanics.
Carbon capture, utilization, and storage (CCUS) is a promising pathway to decarbonize fossil-based power and industrial sectors and is a bridging technology for a sustainable transition to a net-zero emission energy future. This talk aims to provide an overview of design and optimization of CCUS systems. I will also attempt to give a brief perspective on emerging interests in process systems engineering research (e.g., systems integration, multiscale modeling, strategic planning, and optimization under uncertainty). The purpose is not to cover all aspects of PSE research for CCUS but rather to foster discussion by presenting some plausible future directions and ideas.
Quantum technology holds potential for revolutionizing how information is processed, transmitted, and acquired. While quantum computation and quantum communication have been among the well-known examples of quantum technology, it is increasingly recognized that quantum sensing is the application with the most potential for immediate wide-spread practical utilization. In this talk, I will provide an overview of the field of quantum sensing with nitrogen vacancy (NV) centers in diamond as a specific example. I will introduce the physical system of NV and describe some basic quantum sensing protocols. Then, I will present some state-of-the-art and examples where quantum sensors such as NV can accomplish what traditional sensors cannot. Lastly, I will discuss potential future directions in the area of NV quantum sensing.
Machine learning can be used to identify animals from their sound. This could be a valuable tool for biodiversity monitoring, and for understanding animal behaviour and communication. But to get there, we need very high accuracy at fine-grained acoustic distinctions across hundreds of categories in diverse conditions. In our group we are studying how to achieve this at continental scale. I will describe aspects of bioacoustic data that challenge even the latest deep learning workflows, and our work to address this. Methods covered include adaptive feature representations, deep embeddings and few-shot learning.
The talk will be divided into two parts. The first part of the talk introduces a class of first-order methods for constrained optimization that are based on an analogy to non-smooth dynamical systems. The key underlying idea is to express constraints in terms of velocities instead of positions, which has the algorithmic consequence that optimizations over feasible sets at each iteration are replaced with optimizations over local, sparse convex approximations. This results is a simplified suite of algorithms and an expanded range of possible applications in machine learning. In the second part of my talk, I will present a robot learning algorithm for trajectory tracking. The method incorporates prior knowledge about the system dynamics and by optimizing over feedforward actions, the risk of instability during deployment is mitigated. The algorithm will be evaluated on a ping-pong playing robot that is actuated by soft pneumatic muscles.
The decarbonization of buildings presents new challenges for the reliability of the electrical grid because of the intermittency of renewable energy sources and increase in grid load brought about by end-use electrification. To restore reliability, grid-interactive efficient buildings can provide flexibility services to the grid through demand response. Residential demand response programs are hindered by the need for manual intervention by customers. To maximize the energy flexibility potential of residential buildings, an advanced control architecture is needed. Reinforcement learning is well-suited for the control of flexible resources as it can adapt to unique building characteristics compared to expert systems. Yet, factors hindering the adoption of RL in real-world applications include its large data requirements for training, control security and generalizability. This talk will cover some of our recent work addressing these challenges. We proposed the MERLIN framework and developed a digital twin of a real-world 17-building grid-interactive residential community in CityLearn. We show that 1) independent RL-controllers for batteries improve building and district level KPIs compared to a reference RBC by tailoring their policies to individual buildings, 2) despite unique occupant behaviors, transferring the RL policy of any one of the buildings to other buildings provides comparable performance while reducing the cost of training, 3) training RL-controllers on limited temporal data that does not capture full seasonality in occupant behavior has little effect on performance. Although, the zero-net-energy (ZNE) condition of the buildings could be maintained or worsened because of controlled batteries, KPIs that are typically improved by ZNE condition (electricity price and carbon emissions) are further improved when the batteries are managed by an advanced controller.
In this talk, I will discuss our recent research on understanding post-hoc interpretability. I will begin by introducing a characterization of post-hoc interpretability methods as local function approximators, and the implications of this viewpoint, including a no-free-lunch theorem for explanations. Next, we shall challenge the assumption that post-hoc explanations provide information about a model's discriminative capabilities p(y|x) and instead demonstrate that many common methods instead rely on a conditional generative model p(x|y). This observation underscores the importance of being cautious when using such methods in practice. Finally, I will propose to resolve this via regularization of model structure, specifically by training low curvature neural networks, resulting in improved model robustness and stable gradients.
High-dimensional spatio-temporal dynamics can often be encoded in a low-dimensional subspace. Engineering applications for modeling, characterization, design, and control of such large-scale systems often rely on dimensionality reduction to make solutions computationally tractable in real-time. Common existing paradigms for dimensionality reduction include linear methods, such as the singular value decomposition (SVD), and nonlinear methods, such as variants of convolutional autoencoders (CAE). However, these encoding techniques lack the ability to efficiently represent the complexity associated with spatio-temporal data, which often requires variable geometry, non-uniform grid resolution, adaptive meshing, and/or parametric dependencies. To resolve these practical engineering challenges, we propose a general framework called Neural Implicit Flow (NIF) that enables a mesh-agnostic, low-rank representation of large-scale, parametric, spatial-temporal data. NIF consists of two modified multilayer perceptrons (MLPs): (i) ShapeNet, which isolates and represents the spatial complexity, and (ii) ParameterNet, which accounts for any other input complexity, including parametric dependencies, time, and sensor measurements. We demonstrate the utility of NIF for parametric surrogate modeling, enabling the interpretable representation and compression of complex spatio-temporal dynamics, efficient many-spatial-query tasks, and improved generalization performance for sparse reconstruction.
Robots can act as a force multiplier for people, whether a robot assisting an astronaut with a repair on the International Space station, a UAV taking flight over our cities, or an autonomous vehicle driving through our streets. Existing approaches use action-based representations that do not capture the goal-based meaning of a language expression and do not generalize to partially observed environments. The aim of my research program is to create autonomous robots that can understand complex goal-based commands and execute those commands in partially observed, dynamic environments. I will describe demonstrations of object-search in a POMDP setting with information about object locations provided by language, and mapping between English and Linear Temporal Logic, enabling a robot to understand complex natural language commands in city-scale environments. These advances represent steps towards robots that interpret complex natural language commands in partially observed environments using a decision theoretic framework.