Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation
Due to the large dataset size (over 1 TB), we can only display one case per CWE type here, showing the vulnerable code files with both vulnerable and fixed versions. Select a CWE to view its five variants (Original, Noise, Obfuscated, Noise+Obf, Prompt Injection); the complete projects are available via the GitHub links in the CWE descriptions below.
Prototype pollution in src/index.js::override() lets attacker-controlled keys overwrite Object.prototype, which can break or hijack Node.js apps that use deep-override while merging configs. The vulnerable release blocks only __proto__; the fixed release blocks ['__proto__','constructor','prototype'] and keeps the merge logic. Repo: ASaiAnudeep/deep-override.
Complete Precision / Recall / F1 / VD-S scores from the SecJS paper (Table RQ1). "Full" denotes the complete split; "DN" denotes the denoised split. Use the controls below to filter and explore the results.
| LLM | Metric | Original | Noise | Obfuscated | Noise+Obf | Prompt Injection | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Full | DN | Full | DN | Full | DN | Full | DN | Full | DN | ||
| GPT-5 | Precision | 37.3 | 37.8 | 20.0 | 20.2 | 33.3 | 33.0 | 23.6 | 23.6 | 32.0 | 32.4 |
| Recall | 28.1 | 27.5 | 15.0 | 14.7 | 17.4 | 16.6 | 14.3 | 13.9 | 21.5 | 20.8 | |
| F1 | 32.1 | 31.8 | 17.2 | 17.0 | 22.9 | 22.1 | 17.8 | 17.5 | 25.7 | 25.3 | |
| VD-S | 57.2 | 58.1 | 61.9 | 62.5 | 77.2 | 77.8 | 76.6 | 77.2 | 78.5 | 79.2 | |
| GPT-5-Mini | Precision | 34.1 | 34.1 | 13.8 | 13.3 | 34.7 | 35.7 | 12.0 | 11.9 | 26.9 | 26.6 |
| Recall | 26.4 | 25.7 | 12.5 | 11.7 | 20.2 | 20.2 | 10.1 | 10.0 | 24.1 | 23.2 | |
| F1 | 29.8 | 29.3 | 13.1 | 12.5 | 25.6 | 25.8 | 10.9 | 10.9 | 25.4 | 24.8 | |
| VD-S | 73.6 | 74.3 | 56.2 | 57.0 | 79.8 | 79.8 | 59.4 | 60.0 | 75.9 | 76.8 | |
| GPT-5-Codex | Precision | 43.0 | 43.7 | 18.8 | 18.0 | 43.4 | 44.8 | 12.4 | 11.8 | 38.1 | 39.2 |
| Recall | 29.1 | 28.9 | 12.8 | 12.0 | 19.3 | 19.3 | 8.1 | 7.7 | 21.2 | 20.9 | |
| F1 | 34.7 | 34.8 | 15.2 | 14.4 | 26.7 | 27.0 | 9.8 | 9.3 | 27.2 | 27.2 | |
| VD-S | 70.9 | 71.1 | 62.7 | 63.5 | 80.7 | 80.7 | 59.0 | 59.8 | 78.8 | 79.1 | |
| DeepSeek-v3.1 | Precision | 31.0 | 30.6 | 6.3 | 5.9 | 32.7 | 32.4 | 4.7 | 4.6 | 27.9 | 28.2 |
| Recall | 22.8 | 21.6 | 5.4 | 5.0 | 17.0 | 16.9 | 3.8 | 3.6 | 27.2 | 27.1 | |
| F1 | 26.3 | 25.3 | 5.8 | 5.4 | 22.4 | 22.2 | 4.2 | 4.0 | 27.5 | 27.7 | |
| VD-S | 77.0 | 78.4 | 64.8 | 65.5 | 83.0 | 83.1 | 50.0 | 50.8 | 72.7 | 72.9 | |
| Gemini-2.5-Pro | Precision | 34.9 | 35.2 | 19.7 | 20.0 | 34.1 | 34.1 | 16.2 | 15.6 | 33.7 | 34.0 |
| Recall | 38.4 | 38.2 | 20.4 | 20.3 | 32.9 | 33.2 | 17.8 | 17.2 | 36.2 | 36.0 | |
| F1 | 36.6 | 36.6 | 20.0 | 20.1 | 33.5 | 33.7 | 17.0 | 16.4 | 34.9 | 35.0 | |
| VD-S | 61.6 | 61.8 | 51.0 | 51.2 | 67.1 | 66.8 | 39.5 | 40.3 | 63.8 | 64.0 | |
| Gemini-Flash | Precision | 28.8 | 28.7 | 16.9 | 16.5 | 27.2 | 27.0 | 14.4 | 12.8 | 28.8 | 29.0 |
| Recall | 31.4 | 31.2 | 17.7 | 17.4 | 25.0 | 25.0 | 12.0 | 10.8 | 31.8 | 31.5 | |
| F1 | 30.1 | 29.9 | 17.3 | 17.0 | 26.1 | 25.9 | 13.1 | 11.7 | 30.2 | 30.2 | |
| VD-S | 68.6 | 68.8 | 54.2 | 54.5 | 75.0 | 75.0 | 63.4 | 64.7 | 68.2 | 68.5 | |
| Claude-4.5 | Precision | 37.2 | 37.7 | 4.4 | 4.2 | 17.8 | 16.9 | 4.0 | 3.9 | 28.8 | 29.1 |
| Recall | 34.8 | 34.3 | 4.0 | 3.8 | 16.5 | 15.5 | 3.8 | 3.6 | 25.5 | 24.8 | |
| F1 | 35.9 | 35.9 | 4.2 | 4.0 | 17.2 | 16.2 | 3.9 | 3.7 | 27.1 | 26.8 | |
| VD-S | 65.2 | 65.7 | 58.6 | 59.0 | 83.5 | 84.5 | 48.7 | 49.2 | 74.5 | 75.2 | |