TR2026-054

GPT Sonography: Hand Gesture Decoding from Forearm Ultrasound Images via a Large Vision-Language Model

- Bimbraw, K., Wang, Y., Liu, J., Koike-Akino, T., "GPT Sonography: Hand Gesture Decoding from Forearm Ultrasound Images via a Large Vision-Language Model", IEEE Access, DOI: 10.1109/ACCESS.2026.3687477, Vol. 14, pp. 70724-70736, April 2026.
  BibTeX TR2026-054 PDF
  - @article{Bimbraw2026may,
  - author = {Bimbraw, Keshav and Wang, Ye and Liu, Jing and Koike-Akino, Toshiaki},
  - title = {{GPT Sonography: Hand Gesture Decoding from Forearm Ultrasound Images via a Large Vision-Language Model}},
  - journal = {IEEE Access},
  - year = 2026,
  - volume = 14,
  - pages = {70724--70736},
  - month = may,
  - doi = {10.1109/ACCESS.2026.3687477},
  - issn = {2169-3536},
  - url = {https://www.merl.com/publications/TR2026-054}
  - }
MERL Contacts:
Research Areas:

Artificial Intelligence, Machine Learning

Abstract:

Large vision-language models (LVLMs), such as the Generative Pre-trained Transformer 4-omni (GPT-4o), are emerging multi-modal foundation models which have great potential as powerful artificial-intelligence (AI) assistance tools for a myriad of applications, including healthcare, industrial, and academic sectors. Although such foundation models perform well in a wide range of general tasks, their capability without fine-tuning is often limited in specialized tasks. However, full fine-tuning of large foundation models is challenging due to enormous computation/memory/dataset requirements. Ultrasound data from the forearm has been shown to be used for hand gesture estimation. However, this typically requires training deep learning models with a large quantity of labeled data. We show that GPT-4o can decode hand gestures from forearm ultrasound data even with no fine-tuning, and improves with few-shot, retrieval augmented in-context learning. In our experiments, the average classification accuracy improved from 19.3% (0-shot) to 74.0% (2-shot) for within-session testing, and from 20.0% (0-shot) to 61.3% (3-shot) for cross-session testing. This demonstrates the potential of LVLMs for ultrasound-based gesture recognition by enabling an alternative to prior ultrasound gesture pipelines that require dedicated model training and large labeled datasets. Additionally, we show that few-shot in-context learning and retrieval-augmented selection can substantially improve performance without any model fine-tuning.

TR2026-054

GPT Sonography: Hand Gesture Decoding from Forearm Ultrasound Images via a Large Vision-Language Model

MERL Contacts:

Ye
Wang

Jing
Liu

Toshiaki
Koike-Akino

Research Areas:

Abstract:

MERL Contacts:

YeWang

JingLiu

ToshiakiKoike-Akino

Research Areas:

Abstract:

Ye
Wang

Jing
Liu

Toshiaki
Koike-Akino