TR2026-017

Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models


    •  Rashid, M.R.U., Dasu, V.A., Wang, Y., Tan, G.G., Mehnaz, S., "Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models", AAAI Conference on Artificial Intelligence, January 2026.
      BibTeX TR2026-017 PDF
      • @inproceedings{Rashid2026jan,
      • author = {Rashid, Md Rafi Ur and Dasu, Vishnu Asutosh and Wang, Ye and Tan, G. Gary and Mehnaz, Shagufta},
      • title = {{Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models}},
      • booktitle = {AAAI Conference on Artificial Intelligence},
      • year = 2026,
      • month = jan,
      • url = {https://www.merl.com/publications/TR2026-017}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Machine Learning

Abstract:

Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplat- ing potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing out- right rejections to ≤4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92–99% accuracy on adversarial Q&A and 4–10× lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.