Recent research tested leading agents building full apps, uncovering 143 vulnerabilities such as improper token handling across models.
In a report on coding security released on 11 March 11, 2026, an “AI-native” cybersecurity firm has claimed to discover significant security shortcomings in leading AI coding tools.
DryRun Security, an Austin, Texas-based firm, had tested Anthropic’s Claude, OpenAI’s Codex, and Google’s Gemini, by tasking them with developing two full applications — a family allergy tracker web app and a browser racing game —via sequential pull requests mimicking real engineering workflows.
Across 38 scans, 143 vulnerabilities surfaced, with 87% of pull requests introducing at least one flaw, according to a report in Yahoo news:
- Claude had generated the most unresolved high-severity issues in the final codebases
- Codex showed the strongest remediation, fixing more problems iteratively and ending with the fewest critical vulnerabilities
- Gemini had placed between them, addressing some early flaws in later changes but still leaving multiple severe risks
- None of the coding agents produced a secure product, as all overlooked key protections
- The AI coding agents generated functional software quickly, but security was not built into their processes, and the bots often skipped essential features or botched authentication logic
- Common failures spanned all models, including improper JSON Web Token handling, no defenses against brute-force attacks, susceptibility to token replay exploits, and weak refresh token cookie settings.
- Authentication safeguards, when created for REST APIs, were inconsistently applied to WebSocket endpoints, exposing app segments.
These results amplify enterprise ongoing worries about AI-assisted coding. A February 2026 study had found over 25% of AI-generated code contained OWASP Top 10 vulnerabilities, but DryRun’s recent work uniquely tracks flaws compounding over full development cycles.
As software development teams speed up them projects via agents, ongoing scans during workflows—not just end-stage reviews — are vital to curb risk buildup and technical debt, according to industry observers.