SecJS Benchmark

Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation

A. Metadata

B. Demo

Due to the large dataset size (over 1 TB), we can only display one case per CWE type here, showing the vulnerable code files with both vulnerable and fixed versions. Select a CWE to view its five variants (Original, Noise, Obfuscated, Noise+Obf, Prompt Injection); the complete projects are available via the GitHub links in the CWE descriptions below.

What the five variants mean

  • Original: Raw GitHub projects with paired vulnerable and fixed commits.
  • Noise: Adds harmless file/DB/DOM statements so we can see whether models overreact to unrelated APIs.
  • Obfuscated: Runs javascript-obfuscator to rename identifiers and encode literals while keeping semantics.
  • Noise + Obf: Applies the noise injections first and then obfuscates the entire file for a tougher stress test.
  • Prompt Injection: Inserts misleading audit comments (for example "Security audit passed") to test textual prompt attacks.

                
            

C. Motivating Example

CVE-2021-25941 Case Study (deep-override)

Prototype pollution in src/index.js::override() lets attacker-controlled keys overwrite Object.prototype, which can break or hijack Node.js apps that use deep-override while merging configs. The vulnerable release blocks only __proto__; the fixed release blocks ['__proto__','constructor','prototype'] and keeps the merge logic. Repo: ASaiAnudeep/deep-override.

How this case instantiates the three SecJS principles (nine checkpoints)

Principle I: Comprehensiveness
  • Principle I-1 · CWE coverage: SecJS-Gen pulls CVEs from NVD and Mend.io so the dataset covers older CWEs (SQLi-89, XSS-79, command injection-78, auth bypass-287, hard-coded creds-798) plus newer CWEs (prototype pollution-1321, ReDoS-1333, XXE-611).
  • Principle I-2 · Result and reasoning: JudgeJS checks both project-level ⟨has_vulnerability, CWE⟩ pairs and function-level ⟨has_vulnerability, CWE, file, function⟩ pairs, so the model must point to `src/index.js::override()` instead of giving a one-line verdict.
  • Principle I-3 · Real GitHub projects: Every sample keeps the full repo (15 files here) plus a frontend/backend/full-stack tag, so the Node.js dependency path (express-session → body-parser → deep-override) stays in context.
Principle II: No Underestimation
  • Principle II-1 · CWE equivalence: MITRE CAPEC groups (for example XSS ⇒ CWE-79/80/83, prototype pollution ⇒ CWE-1321/915) mean predictions inside the same group are counted as correct.
  • Principle II-2 · Denoised controls: Samples carry ONEFUNC / NVDCHECK / SUSPICIOUS flags so experiments can focus on the high-confidence part (ONEFUNC+NVDCHECK) and skip noisy commits.
  • Principle II-3 · Strong prompting: The claude-code-security-review prompt guides the model through three steps: list known safe libraries, compare code with those patterns, then trace the taint path from source to sink.
Principle III: No Overestimation
  • Principle III-1 · Full-project inputs: JudgeJS always feeds the entire repo (package files, helpers, tests), so the model has to find the vulnerable `override` function itself.
  • Principle III-2 · Vulnerable/fixed pairs: Every CVE provides both versions; for this case the fixed build blocks `['__proto__','constructor','prototype']`, exposing detectors that flag strings instead of reasoning.
  • Principle III-3 · Four augmentation tracks: Noise (safe sinks), Obfuscation (renamed identifiers), Noise+Obf, and Prompt Injection (misleading comments) check whether the model still works after simple transformations.

D. LLM Evaluation Results

Complete Precision / Recall / F1 / VD-S scores from the SecJS paper (Table RQ1). "Full" denotes the complete split; "DN" denotes the denoised split. Use the controls below to filter and explore the results.

LLM Metric Original Noise Obfuscated Noise+Obf Prompt Injection
FullDN FullDN FullDN FullDN FullDN
GPT-5Precision37.337.820.020.233.333.023.623.632.032.4
Recall28.127.515.014.717.416.614.313.921.520.8
F132.131.817.217.022.922.117.817.525.725.3
VD-S57.258.161.962.577.277.876.677.278.579.2
GPT-5-MiniPrecision34.134.113.813.334.735.712.011.926.926.6
Recall26.425.712.511.720.220.210.110.024.123.2
F129.829.313.112.525.625.810.910.925.424.8
VD-S73.674.356.257.079.879.859.460.075.976.8
GPT-5-CodexPrecision43.043.718.818.043.444.812.411.838.139.2
Recall29.128.912.812.019.319.38.17.721.220.9
F134.734.815.214.426.727.09.89.327.227.2
VD-S70.971.162.763.580.780.759.059.878.879.1
DeepSeek-v3.1Precision31.030.66.35.932.732.44.74.627.928.2
Recall22.821.65.45.017.016.93.83.627.227.1
F126.325.35.85.422.422.24.24.027.527.7
VD-S77.078.464.865.583.083.150.050.872.772.9
Gemini-2.5-ProPrecision34.935.219.720.034.134.116.215.633.734.0
Recall38.438.220.420.332.933.217.817.236.236.0
F136.636.620.020.133.533.717.016.434.935.0
VD-S61.661.851.051.267.166.839.540.363.864.0
Gemini-FlashPrecision28.828.716.916.527.227.014.412.828.829.0
Recall31.431.217.717.425.025.012.010.831.831.5
F130.129.917.317.026.125.913.111.730.230.2
VD-S68.668.854.254.575.075.063.464.768.268.5
Claude-4.5Precision37.237.74.44.217.816.94.03.928.829.1
Recall34.834.34.03.816.515.53.83.625.524.8
F135.935.94.24.017.216.23.93.727.126.8
VD-S65.265.758.659.083.584.548.749.274.575.2