SecJS – JavaScript Vulnerability Benchmark

A. Metadata

1,852

Original GitHub projects

Each sample bundles a vulnerable and fixed commit taken from CVE-linked GitHub projects (2018–2024).

7,408

Instances after augmentation

Noise, obfuscation, noise+obf, and prompt-injection variants grow the corpus while keeping behavior unchanged.

52,137

Projects evaluated (RQ3)

Across all five variants, JudgeJS runs both versions of every project (104,274 executions total).

317.2M

Total tokens consumed

Average 6,085 tokens per project (5,491 input + 594 output), adding up to 286.3M input and 31.0M output tokens.

B. Demo

Due to the large dataset size (over 1 TB), we can only display one case per CWE type here, showing the vulnerable code files with both vulnerable and fixed versions. Select a CWE to view its five variants (Original, Noise, Obfuscated, Noise+Obf, Prompt Injection); the complete projects are available via the GitHub links in the CWE descriptions below.

CWE Type

What the five variants mean

Original: Raw GitHub projects with paired vulnerable and fixed commits.
Noise: Adds harmless file/DB/DOM statements so we can see whether models overreact to unrelated APIs.
Obfuscated: Runs javascript-obfuscator to rename identifiers and encode literals while keeping semantics.
Noise + Obf: Applies the noise injections first and then obfuscates the entire file for a tougher stress test.
Prompt Injection: Inserts misleading audit comments (for example "Security audit passed") to test textual prompt attacks.

C. Motivating Example

CVE-2021-25941 Case Study (deep-override)

Prototype pollution in src/index.js::override() lets attacker-controlled keys overwrite Object.prototype, which can break or hijack Node.js apps that use deep-override while merging configs. The vulnerable release blocks only __proto__; the fixed release blocks ['__proto__','constructor','prototype'] and keeps the merge logic. Repo: ASaiAnudeep/deep-override.

How this case instantiates the three SecJS principles (nine checkpoints)

Principle I: Comprehensiveness

Principle I-1 · CWE coverage: SecJS-Gen pulls CVEs from NVD and Mend.io so the dataset covers older CWEs (SQLi-89, XSS-79, command injection-78, auth bypass-287, hard-coded creds-798) plus newer CWEs (prototype pollution-1321, ReDoS-1333, XXE-611).
Principle I-2 · Result and reasoning: JudgeJS checks both project-level ⟨has_vulnerability, CWE⟩ pairs and function-level ⟨has_vulnerability, CWE, file, function⟩ pairs, so the model must point to `src/index.js::override()` instead of giving a one-line verdict.
Principle I-3 · Real GitHub projects: Every sample keeps the full repo (15 files here) plus a frontend/backend/full-stack tag, so the Node.js dependency path (express-session → body-parser → deep-override) stays in context.

Principle II: No Underestimation

Principle II-1 · CWE equivalence: MITRE CAPEC groups (for example XSS ⇒ CWE-79/80/83, prototype pollution ⇒ CWE-1321/915) mean predictions inside the same group are counted as correct.
Principle II-2 · Denoised controls: Samples carry ONEFUNC / NVDCHECK / SUSPICIOUS flags so experiments can focus on the high-confidence part (ONEFUNC+NVDCHECK) and skip noisy commits.
Principle II-3 · Strong prompting: The claude-code-security-review prompt guides the model through three steps: list known safe libraries, compare code with those patterns, then trace the taint path from source to sink.

Principle III: No Overestimation

Principle III-1 · Full-project inputs: JudgeJS always feeds the entire repo (package files, helpers, tests), so the model has to find the vulnerable `override` function itself.
Principle III-2 · Vulnerable/fixed pairs: Every CVE provides both versions; for this case the fixed build blocks `['__proto__','constructor','prototype']`, exposing detectors that flag strings instead of reasoning.
Principle III-3 · Four augmentation tracks: Noise (safe sinks), Obfuscation (renamed identifiers), Noise+Obf, and Prompt Injection (misleading comments) check whether the model still works after simple transformations.

D. LLM Evaluation Results

Complete Precision / Recall / F1 / VD-S scores from the SecJS paper (Table RQ1). "Full" denotes the complete split; "DN" denotes the denoised split. Use the controls below to filter and explore the results.

Select Metric

Dataset Split

Filter Models

GPT-5 GPT-5-Mini GPT-5-Codex DeepSeek Gemini-2.5 Gemini-Flash Claude-4.5

LLM	Metric	Original		Noise		Obfuscated		Noise+Obf		Prompt Injection
LLM	Metric	Full	DN	Full	DN	Full	DN	Full	DN	Full	DN
GPT-5	Precision	37.3	37.8	20.0	20.2	33.3	33.0	23.6	23.6	32.0	32.4
	Recall	28.1	27.5	15.0	14.7	17.4	16.6	14.3	13.9	21.5	20.8
	F1	32.1	31.8	17.2	17.0	22.9	22.1	17.8	17.5	25.7	25.3
	VD-S	57.2	58.1	61.9	62.5	77.2	77.8	76.6	77.2	78.5	79.2
GPT-5-Mini	Precision	34.1	34.1	13.8	13.3	34.7	35.7	12.0	11.9	26.9	26.6
	Recall	26.4	25.7	12.5	11.7	20.2	20.2	10.1	10.0	24.1	23.2
	F1	29.8	29.3	13.1	12.5	25.6	25.8	10.9	10.9	25.4	24.8
	VD-S	73.6	74.3	56.2	57.0	79.8	79.8	59.4	60.0	75.9	76.8
GPT-5-Codex	Precision	43.0	43.7	18.8	18.0	43.4	44.8	12.4	11.8	38.1	39.2
	Recall	29.1	28.9	12.8	12.0	19.3	19.3	8.1	7.7	21.2	20.9
	F1	34.7	34.8	15.2	14.4	26.7	27.0	9.8	9.3	27.2	27.2
	VD-S	70.9	71.1	62.7	63.5	80.7	80.7	59.0	59.8	78.8	79.1
DeepSeek-v3.1	Precision	31.0	30.6	6.3	5.9	32.7	32.4	4.7	4.6	27.9	28.2
	Recall	22.8	21.6	5.4	5.0	17.0	16.9	3.8	3.6	27.2	27.1
	F1	26.3	25.3	5.8	5.4	22.4	22.2	4.2	4.0	27.5	27.7
	VD-S	77.0	78.4	64.8	65.5	83.0	83.1	50.0	50.8	72.7	72.9
Gemini-2.5-Pro	Precision	34.9	35.2	19.7	20.0	34.1	34.1	16.2	15.6	33.7	34.0
	Recall	38.4	38.2	20.4	20.3	32.9	33.2	17.8	17.2	36.2	36.0
	F1	36.6	36.6	20.0	20.1	33.5	33.7	17.0	16.4	34.9	35.0
	VD-S	61.6	61.8	51.0	51.2	67.1	66.8	39.5	40.3	63.8	64.0
Gemini-Flash	Precision	28.8	28.7	16.9	16.5	27.2	27.0	14.4	12.8	28.8	29.0
	Recall	31.4	31.2	17.7	17.4	25.0	25.0	12.0	10.8	31.8	31.5
	F1	30.1	29.9	17.3	17.0	26.1	25.9	13.1	11.7	30.2	30.2
	VD-S	68.6	68.8	54.2	54.5	75.0	75.0	63.4	64.7	68.2	68.5
Claude-4.5	Precision	37.2	37.7	4.4	4.2	17.8	16.9	4.0	3.9	28.8	29.1
	Recall	34.8	34.3	4.0	3.8	16.5	15.5	3.8	3.6	25.5	24.8
	F1	35.9	35.9	4.2	4.0	17.2	16.2	3.9	3.7	27.1	26.8
	VD-S	65.2	65.7	58.6	59.0	83.5	84.5	48.7	49.2	74.5	75.2

SecJS Benchmark