TR2026-036

Evaluating Security Policy Compliance in Infrastructure as Code Generated by Large Language Models


    •  Ryo, H., Wang, Y., Koike-Akino, T., Liu, J., Parsons, K., Hato, J., "Evaluating Security Policy Compliance in Infrastructure as Code Generated by Large Language Models", International Symposium on Digital Forensics and Security, March 2026.
      BibTeX TR2026-036 PDF
      • @inproceedings{Ryo2026mar,
      • author = {Ryo, Hase and Wang, Ye and Koike-Akino, Toshiaki and Liu, Jing and Parsons, Kieran and Hato, Jumpei},
      • title = {{Evaluating Security Policy Compliance in Infrastructure as Code Generated by Large Language Models}},
      • booktitle = {International Symposium on Digital Forensics and Security},
      • year = 2026,
      • month = mar,
      • url = {https://www.merl.com/publications/TR2026-036}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Machine Learning

Abstract:

Infrastructure as Code (IaC) automates cloud re- source provisioning, yet developing and maintaining IaC scripts remains challenging due to variations in domain-specific lan- guages across providers. Recent advances in large language models (LLMs) offer promise for automating IaC generation, but the security policy compliance of LLM-generated IaC scripts is as important as deployability. In this work, we empirically evaluate configuration-level policy violations in LLM-generated IaC scripts using the IaC-Eval benchmark and Checkov for security policy assessment. Our results indicate that modern LLMs available as of 2025 exhibit improved syntactic correctness and better alignment with user intent, particularly when using retries and error feedback. However, security policy violations persist in generated IaC scripts, typically ranging from around five to fifteen per script across six difficulty levels defined in IaC-Eval. These results underscore the necessity of rigorous verification before deploying generated IaC scripts.