What is autoresearch?

A Claude Code skill that applies Karpathy's autoresearch principles to any task. It loops autonomously — modify, verify, keep or discard, repeat — until interrupted or a goal is reached.

Does it only work for ML research?

No. It works on any task with a measurable outcome: test coverage, build time, lighthouse score, bundle size, refactoring, content writing, security auditing, and more.

What happens when a change makes things worse?

The change is automatically reverted via git reset. The failure is logged, and Claude moves on to a different approach. No manual cleanup needed.

Can I limit the number of iterations?

Yes. Use /loop N /autoresearch to run exactly N iterations. Requires Claude Code v1.0.32+. Without /loop, it runs until manually interrupted.

How does it decide what to try next?

It reads its own git history and results log to identify patterns — what worked, what failed, what hasn't been tried. Priority: fix crashes, exploit successes, explore new approaches, combine near-misses.

Is it safe to run overnight?

Yes. Every change is committed before verification, so rollback is always clean. The worst case is wasted iterations, not broken code.

I don't know what metric to use. How do I start?

Run /autoresearch:plan — the planning wizard scans your codebase, suggests metrics based on your tooling, constructs a verify command, and dry-runs it before you launch.

Can autoresearch do security audits?

Yes. Run /autoresearch:security for a full STRIDE threat model + OWASP Top 10 sweep with 4 red-team adversarial personas. It generates a structured report folder with findings, mitigations, and coverage matrix. It's read-only — analyzes code without modifying it.

Does /autoresearch:security modify my code?

By default, no — it only analyzes and reports. Add the --fix flag to auto-remediate confirmed Critical/High findings after the audit completes.

Can I use it in CI/CD pipelines?

Yes. Use --fail-on critical to exit non-zero if any Critical findings are detected. The security audit can also auto-generate a GitHub Actions workflow file.

Open Source · Claude Code Skill · Domain-Agnostic · v1.0.3

Autonomous Iteration for Any Task

Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat. Let Claude iterate autonomously with mechanical verification and automatic rollback.

✓ One atomic change per iteration — if it breaks, you know exactly why
✓ Automatic git rollback on failures — no debates, no manual cleanup
✓ Works on any domain — code, ML, content, performance, refactoring

Star on GitHub Get Started

$ /autoresearch
Goal: Increase test coverage to 95%
Metric: npm test -- --coverage | grep "All files"
Scope: src/**/*.ts
Direction: higher_is_better

# Claude loops autonomously — modify, verify, keep/discard, repeat

→ Commands Reference

All available commands at a glance.

Command	Description	Since
/autoresearch	Run the autonomous iteration loop (unlimited)	v1.0.0
/loop N /autoresearch	Run exactly N iterations then stop	v1.0.1
/autoresearch:plan	Interactive wizard: Goal → Scope, Metric, Verify config	v1.0.2
/autoresearch:security	STRIDE + OWASP + red-team security audit	v1.0.3
/autoresearch:security --diff	Delta mode — only audit changed files	v1.0.3
/autoresearch:security --fix	Auto-fix confirmed Critical/High findings	v1.0.3
/autoresearch:security --fail-on	CI/CD severity gate (critical \| high \| medium)	v1.0.3
/loop N /autoresearch:security	Bounded security audit (N iterations)	v1.0.3

→ How It Works

Set the goal. Start the loop. Walk away.

Define Goal & Metric

Tell Claude what "better" means. Pick a mechanical metric — test coverage, build time, lighthouse score, val_bpb — anything measurable.

Autonomous Loop

Claude makes one atomic change, commits, verifies the metric, and keeps or reverts. No human input needed between iterations.

Review Results

Every iteration is logged in a TSV file. Kept changes stay as git commits. You get a clean history of what worked and what didn't.

→ Features

Karpathy's autoresearch principles, generalized for any work — with planning wizard and security audit.

Security Audit (v1.0.3)

STRIDE threat model + OWASP Top 10 + 4 red-team personas. Generates structured reports with code evidence and prioritized mitigations.

Plan Wizard (v1.0.2)

Describe your goal in plain language. The wizard suggests metrics, validates your verify command with a dry-run, and outputs a ready-to-launch config.

Constraint-Driven Loop

One change per iteration. Commit before verify. Auto-revert on failure. No ambiguity in what caused what.

Mechanical Verification

No subjective "looks good." Every iteration runs a real metric — tests, benchmarks, scores, build output.

Automatic Rollback

Failed changes revert instantly via git reset. No manual cleanup, no debugging compound failures.

Git as Memory

Every kept change is a commit. The agent reads its own git history to learn what works and avoid past mistakes.

Results Logging

TSV log tracks every iteration — metric, delta, status, description. Pattern recognition across experiments.

Domain-Agnostic

Works on backend code, ML training, frontend UI, content, performance — any task with a measurable outcome.

→ Security Audit v1.0.3

Autonomous STRIDE + OWASP + red-team security audit. Generates a full threat model, maps attack surfaces, then iteratively tests each vulnerability vector with code evidence.

$ /loop 10 /autoresearch:security
Scope: src/api/**/*.ts, src/middleware/**/*.ts
Focus: authentication and authorization flows

# Setup: scan codebase → assets → trust boundaries → STRIDE model → attack surface
# Loop: test vectors → validate with code evidence → log findings → repeat
# Output: security/260315-0945-stride-owasp-full-audit/overview.md

STRIDE Threat Model

Full Spoofing, Tampering, Repudiation, Info Disclosure, DoS, and Elevation of Privilege analysis per asset and trust boundary.

OWASP Top 10 (70+ Checks)

Systematic coverage across all 10 OWASP categories. Coverage matrix tracks tested vs untested. Aims for 100% coverage.

4 Red-Team Personas

Security Adversary, Supply Chain Attacker, Insider Threat, and Infrastructure Attacker. Each drives which vectors get tested.

Structured Report Folder

Each audit creates a timestamped folder with 7 files: overview, threat model, attack surface, findings, OWASP coverage, dependency audit, and recommendations.

Flags

--diff

Only audit files changed since last audit

--fix

Auto-fix confirmed Critical/High findings

--fail-on

CI/CD gate: exit non-zero at severity threshold

# Full combo: delta audit + auto-fix + CI gate
/loop 15 /autoresearch:security --diff --fix --fail-on critical

→ Use Cases

Same loop, different domains. The principles are universal — the metrics are domain-specific.

Backend Code

Metric:: Tests pass + coverage %
Scope:: src/**/*.ts
Verify:: npm test

Frontend UI

Metric:: Lighthouse score
Scope:: src/components/**
Verify:: npx lighthouse

ML Training

Metric:: val_bpb / loss
Scope:: train.py
Verify:: uv run train.py

Performance

Metric:: Benchmark time (ms)
Scope:: Target files
Verify:: npm run bench

Refactoring

Metric:: Tests pass + LOC reduced
Scope:: Target module
Verify:: npm test && wc -l

Blog / Content

Metric:: Word count + readability
Scope:: content/*.md
Verify:: Custom script

Security Audit

Metric:: OWASP + STRIDE coverage
Scope:: src/api/**, src/middleware/**
Verify:: /autoresearch:security

→ Quick Start

Two commands to install. One command to run.

1. Install the Skill

git clone https://github.com/uditgoenka/autoresearch.git /tmp/autoresearch
cp -r /tmp/autoresearch/skills/autoresearch ~/.claude/skills/autoresearch

2. Plan Your Run (New in v1.0.2)

# Interactive wizard — builds Scope, Metric & Verify from your Goal:
/autoresearch:plan
Goal: Make the API respond faster

# The wizard scans your codebase, suggests metrics,
# dry-runs the verify command, and outputs a ready-to-paste config.

3. Run Unlimited

# Inside any project directory:
/autoresearch
Goal: Increase test coverage to 95%
Metric: npm test -- --coverage | grep "All files"
Scope: src/**/*.ts
Direction: higher_is_better

4. Run Bounded (Optional)

# Run exactly 25 iterations then stop:
/loop 25 /autoresearch
Goal: Reduce bundle size below 200KB
Metric: npm run build | grep "Total size"
Direction: lower_is_better

5. Security Audit (New in v1.0.3)

# STRIDE + OWASP + red-team security audit:
/loop 10 /autoresearch:security

# With flags: delta mode + auto-fix + CI gate:
/loop 15 /autoresearch:security --diff --fix --fail-on critical

→ Changelog

Release history and what shipped in each version.

v1.0.3Mar 15, 2026#3, #4, #5, #6

Autonomous Security Audit

+/autoresearch:security — STRIDE threat model + OWASP Top 10 + red-team (4 adversarial personas)
+--diff flag: delta mode, only audit files changed since last audit
+--fix flag: auto-remediate confirmed Critical/High findings
+--fail-on flag: CI/CD severity gate for pipeline blocking
+Structured report folder with 7 dedicated markdown files per audit
+CI/CD GitHub Action template auto-generation
+Historical comparison across audit runs (new/fixed/recurring)
+Commands Reference table added to README

v1.0.2Mar 15, 2026#2

Plan Your Run Wizard

+/autoresearch:plan — interactive wizard converts Goal → Scope, Metric, Verify config
+Mandatory dry-run validation of verify command before accepting
+Metric suggestion database by domain (code, performance, content, refactoring)
+Launch options: unlimited, bounded (/loop N), or copy-only

v1.0.1Mar 14, 2026#1

Controlled Iterations with /loop

+/loop N /autoresearch — run exactly N iterations then stop with summary
+Early completion when goal is achieved before N iterations
+Smart exploitation when <3 iterations remain
+Final summary: baseline → current best, keeps/discards/crashes

v1.0.0Mar 13, 2026

Initial Release

+Core autoresearch loop: modify → verify → keep/discard → repeat
+7 core principles from Karpathy’s autoresearch, generalized
+Mechanical verification, automatic rollback, git as memory
+TSV results logging with pattern recognition
+Domain-agnostic: code, ML, content, performance, refactoring

→ FAQ

Common questions about Autoresearch.

Start Iterating Autonomously

Free. Open source. Works on any task.

Star on GitHub Read the Docs