Research Highlights

Highlights of significant research at MERL.

Take some time to look deeper into research done at MERL.

  • PS-NeuS: A Probability-guided Sampler for Neural Implicit Surface Rendering

    For more accurate and detailed 3D reconstructions and improved image rendering, especially for the regions of interest in a scene.

    Several variants of Neural Radiance Fields (NeRFs) have significantly improved the accuracy of image synthesis and surface reconstruction of 3D scenes/objects. A key characteristic of these methods is that they cannot afford to train the neural network using every possible input, specifically, every pixel and every 3D point along each pixel’s projection ray. While vanilla NeRFs uniformly sample both the image pixels and the 3D points along the projection rays, some variants guide the sampling of the 3D points along the projection rays. In this paper, we propose a guided sampling of both the image . . .

  • Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-Aware Spatio-Temporal Sampling

    Motion-aware novel view rendering and tracking.

    Free-viewpoint rendering aims to realize a realistic rendering of a 3D scene that is consistent with the geometry of the 3D scene, when rendered from any given viewing direction. To make such systems ubiquitous, it is essential that they be capable of handling dynamic scenes, i.e. those where objects change in their position or configuration or both over time. Existing approaches for this task, propose pipelines that are agnostic to the semantic content of the scene and thus treat every region in the 3D space, as being equally important, when rendering. This results in the system struggling to render the regions of the scene that have high motion. In this paper, we depart from such a . . .

  • TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

    Text-conditioned image-to-video generation: synthesizes a realistic video starting from a given image and a text description.

    Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., “a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. . . .

  • Generation or Replication: Auscultating Audio Latent Diffusion Models

    Identified partially replicated training examples from the full TANGO model.

    For each generated example we show the top match found in the training set for both similarity methods explored in the paper -CLAP and mel. While the generated sounds are not identical to the training data, they have striking similarities in terms of features such as event onsets, which appear to be replicated from the training data.

  • Steered Diffusion

    A method can be used for a wide variety of photorealistic conditional image generation tasks, including image colorization, super-resolution, semantic generation, identity replication, and text-guided editing.

    Capitalizing on the power of diffusion models that have been trained with unlabeled data for unconditional image generation, our work enables them to be repurposed for conditional image synthesis without the need for any retraining. Since no additional training is required for conditional generation, we call this zero-shot conditional image generation. Previous approaches utilizing diffusion models for zero-shot conditional generation can either perform either label-based generation or fine-grained conditional generation. In . . .

  • Robust Machine Learning

    Improving natural-robust accuracy tradeoff for adversary-resilient deep learning.

    Deep learning is widely applied, yet incredibly vulnerable to adversarial examples, i.e., virtually imperceptible perturbations that fool deep neural networks (DNNs). We aim at developing robust machine learning technology: practical defenses that yield deep learning-based systems that are resilient to adversarial examples, through better theoretical understanding of the fragility of conventional DNNs.

  • Edge-Assisted Internet of Vehicles for Smart Mobility

    A new paradigm for connected and automated driving.

    Intelligent transportation is a key component to enable Smart Cities. However, vehicular systems are highly complex and dynamic, where mobile vehicles, pedestrians, road conditions, driver characteristics and weather all play important roles. Connected and automated vehicles (CAVs), with the assistance of infrastructure units, allow for sensing and control information to be communicated and acted upon to achieve intelligent transportation goals.

    Edge-assisted IoV reduces communication latency for real-time operation and utilizes the advanced features and data collection methods of the connected and automated vehicles to realize smart mobility functions. How to best . . .

  • Biosignal Processing for Human-Machine Interaction

    Pre-shot learning techniques to read your mind and biosignals for calibration-free brain-computer interface (BCI) and human-machine interaction (HMI).

    Realizing Sci-Fi scenes--an intelligent robot is able to read your thoughts in your mind--may be no longer a far future dream thanks to rapid progress in robotics, sensors, and artificial intelligence (AI). Biosignal processing to analyze human’s physiological states is a key enabling technology for mind sensing in HMI and BCI systems. When machine intelligence can collaboratively support human intelligence without conflict, HMI systems will make a breakthrough in various scenarios, including teleworking, maintaining remote facilities, disaster response, epidemic care, . . .

  • Video Anomaly Detection

    This research tackles the problem of automatically detecting unusual activity in video sequences.

    This research tackles the problem of automatically detecting unusual activity in video sequences. To solve the problem, an algorithm is first given video sequences from a fixed camera showing normal activity. A model representing normal activity is created and used to evaluate new video sequences from the same fixed camera. Any parts of the testing video that do not match the model formed from normal video are considered anomalous.

    We describe two variations of a novel algorithm for video anomaly detection which we evaluate along with two previously published algorithms on the Street Scene dataset (described later). . . .

  • Seamless Speech Recognition

    A new multilingual speech recognition technology that simultaneously identifies the language spoken and recognizes the words.

    We describe a new multilingual speech recognition technology that that simultaneously identifies the language spoken and recognizes the words. The system can also understand multiple people speaking either the same or different languages simultaneously.

  • Video Description

    A novel neural network architecture that fuses multimodal information using a modality-dependent attention mechanism.

    Understanding scenes through sensed information is a fundamental challenge for man machine interface. We aim to develop methods for learning semantic representations from multimodal information, including both visual and audio data, as the basis for intelligent communications and interface with machines. Towards this goal, we invented a modality-dependent attention mechanism for video captioning based on encoder-decoder sentence generation using recurrent neural networks (RNNs).

  • Speech Enhancement

    NMF meets Kalman filter dynamics for high-quality speech enhancement in non-stationary noise.

    Non-negative data arise in a variety of important signal processing domains, such as power spectra of signals, pixels in images, and count data. We introduce a novel non-negative dynamical system for sequences of such data, and describe its application to modeling speech and audio power spectra.

  • MERL Shopping Dataset

    We describe a recurrent deep network for detecting actions in video sequences.

    This research attempts to solve the problem of finding particular actions occurring in a video. Much of the past work in this field has looked at the related problem of action recognition. In action recognition, the algorithm is given a short video clip of an action and asked to classify which action is present. In contrast, the problem of action detection requires the algorithm to look through a long video and find the start and stop points of all instances of each known action. We consider action detection to be a more difficult, but much more useful problem to solve in practice.

  • Deep Clustering

    Training deep discriminative embeddings to solve the cocktail party problem.

    The human auditory system gives us the extraordinary ability to converse in the midst of a noisy throng of party goers. Solving this so-called cocktail party problem has proven extremely challenging for computers, and separating and recognizing speech in such conditions has been the holy grail of speech processing for more than 50 years. Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering, producing unprecedented speaker-independent single-channel separation performance on two-speaker and three-speaker mixtures.

  • Point-Plane SLAM

    Real-time 3D reconstruction using an RGB-D sensor on a tablet.

    We present a real-time 3D reconstruction system using an RGB-D sensor on a hand-held tablet. The main novelty of the system is a simultaneous localization and mapping (SLAM) algorithm that uses both point and plane features as primitives. Planes are the most common structures in man-made indoor and outdoor scenes.

    As the core of the algorithm, we show that it is possible to register 3D data in two different coordinate systems using any combination of three point/plane features (3 planes, 2 planes and 1 point, 1 plane and 2 points, and 3 points). We use the minimal set of features in a RANSAC framework to robustly compute correspondences and estimate the . . .

  • mmWave Beam-SNR Fingerprinting (mmBSF)

    mmWave Beam-SNR Fingerprinting (mmBSF) for Precise Indoor Localization using Commercial-Off-The-Shelf (COTS) Routers.

    We describe our in-house dataset and an approach of fingerprinting-based indoor localization using COTS mmWave WiFi routers compliant with the IEEE802.11ad standards.