Autonomous Role-based Guard for UI Security.
ARGUS is a lightweight, browser-driven agentic AI framework designed to automate the detection of Insecure Direct Object Reference (IDOR) vulnerabilities — the most prevalent form of Broken Access Control in modern web applications.
By combining multimodal screenshot analysis with cross-role API probing through an Eyes-Brain-Hands architecture, ARGUS bridges the gap between static scanners and manual penetration testing — operating entirely as a black-box tool requiring no source code access.
Abstract
Broken Access Control remains the #1 risk in the OWASP Top 10:2025, with 100% of applications tested exhibiting some form of vulnerability. This paper introduces ARGUS — a browser-based multi-agent system that combines Large Language Models for reasoning, Playwright for browser automation, and Retrieval-Augmented Generation (RAG) to maintain context across multi-step user flows.
Evaluated against OWASP Juice Shop, ARGUS demonstrates that model selection and prompt specificity both materially affect detection reliability and efficiency. Lighter models require explicit workflow scaffolding; stronger models may perform better without it.
Jubril A. Akanbi
Northeastern University
Linghe Zhou
Northeastern University
Yiyang Wang
Northeastern University
Dr. Maryam Tanha
Northeastern University (Supervisor)
Eyes — Perception Layer
Captures screenshots, DOM content, and network traffic through Playwright to build a comprehensive picture of application state at every step.
Brain — Reasoning Engine
Three specialized LLM agents — Discovery, Strategy, and Analyzer — coordinated by an Orchestrator, with a RAG module grounded in OWASP patterns.
Hands — Action Layer
Playwright-powered browser automation: navigation, UI interaction, API request replay, and role/session switching for cross-role testing.
Coordinated by the Access Map — a dynamic knowledge structure recording endpoints, roles, and observed authorization outcomes. The Strategy Agent is augmented with a ChromaDB-backed RAG module for OWASP-grounded decision making.
Model choice and prompt design both drive outcomes.
Two controlled experiments against OWASP Juice Shop (3 trials per condition) evaluated how model capability and prompt specificity affect IDOR detection reliability and efficiency.
| Model | Structured Prompt — Avg. Iterations | Loose Prompt — Avg. Iterations | Key Insight |
|---|---|---|---|
| Haiku-4.5 | 5.3 | 0 / 3 runs | Completely dependent on explicit workflow structure — collapses to 0% without it. |
| Qwen-3.6-plus | 3.7 | 5.7 | Maintains 100% success but compensates with ~57% more output tokens under loose prompting. |
| GPT-5-mini | 8.0 | 4.0 | Performs better without explicit steps — the structured prompt introduces unnecessary overhead. |
Reliability is not model-tier
Detection success is not guaranteed by model capability alone — prompt structure materially affects outcomes across all tiers.
Light models need scaffolding
Haiku-4.5 dropped from 100% to 0% without an explicit phased workflow — structured prompts are a functional requirement for lower-tier models.
Stronger models may be over-constrained
GPT-5-mini achieved its best results (4.0 iter avg) under a loosened prompt — suggesting structured workflows can introduce unnecessary overhead.
Agentic security testing is a viable direction.
ARGUS demonstrates that a browser-based, multimodal LLM agent can autonomously detect IDOR vulnerabilities with meaningful reliability. The findings confirm that careful co-design of model selection and prompt strategy is essential for deploying such systems in practice.
Future work should prioritize extending beyond read-only GET-based probing to cover PUT, PATCH, and DELETE methods; expanding the vulnerability taxonomy to include RBAC bypass and privilege escalation; and generalizing ARGUS to work across diverse web applications as a configurable, user-facing tool.
Expand IDOR testing to PUT/PATCH/DELETE HTTP methods for unauthorized modification detection
Extend vulnerability taxonomy: RBAC bypass, privilege escalation, mass assignment
Build a re-seeding pipeline for the RAG knowledge base to keep pace with evolving vulnerability patterns
Generalize ARGUS for unseen web applications — single-target to multi-target configuration
Package as a configurable, user-facing tool for security practitioners
Interested in this research?
Whether you want to collaborate, apply this approach to your application, or discuss findings — we'd like to hear from you.