Open Source · Claude Code Skill · Domain-Agnostic · v1.0.3
Autonomous Iteration for Any Task
Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat. Let Claude iterate autonomously with mechanical verification and automatic rollback.
- ✓ One atomic change per iteration — if it breaks, you know exactly why
- ✓ Automatic git rollback on failures — no debates, no manual cleanup
- ✓ Works on any domain — code, ML, content, performance, refactoring
$ /autoresearch Goal: Increase test coverage to 95% Metric: npm test -- --coverage | grep "All files" Scope: src/**/*.ts Direction: higher_is_better # Claude loops autonomously — modify, verify, keep/discard, repeat
→ Commands Reference
All available commands at a glance.
| Command | Description | Since |
|---|---|---|
| /autoresearch | Run the autonomous iteration loop (unlimited) | v1.0.0 |
| /loop N /autoresearch | Run exactly N iterations then stop | v1.0.1 |
| /autoresearch:plan | Interactive wizard: Goal → Scope, Metric, Verify config | v1.0.2 |
| /autoresearch:security | STRIDE + OWASP + red-team security audit | v1.0.3 |
| /autoresearch:security --diff | Delta mode — only audit changed files | v1.0.3 |
| /autoresearch:security --fix | Auto-fix confirmed Critical/High findings | v1.0.3 |
| /autoresearch:security --fail-on | CI/CD severity gate (critical | high | medium) | v1.0.3 |
| /loop N /autoresearch:security | Bounded security audit (N iterations) | v1.0.3 |
→ How It Works
Set the goal. Start the loop. Walk away.
Define Goal & Metric
Tell Claude what "better" means. Pick a mechanical metric — test coverage, build time, lighthouse score, val_bpb — anything measurable.
Autonomous Loop
Claude makes one atomic change, commits, verifies the metric, and keeps or reverts. No human input needed between iterations.
Review Results
Every iteration is logged in a TSV file. Kept changes stay as git commits. You get a clean history of what worked and what didn't.
→ Features
Karpathy's autoresearch principles, generalized for any work — with planning wizard and security audit.
Security Audit (v1.0.3)
STRIDE threat model + OWASP Top 10 + 4 red-team personas. Generates structured reports with code evidence and prioritized mitigations.
Plan Wizard (v1.0.2)
Describe your goal in plain language. The wizard suggests metrics, validates your verify command with a dry-run, and outputs a ready-to-launch config.
Constraint-Driven Loop
One change per iteration. Commit before verify. Auto-revert on failure. No ambiguity in what caused what.
Mechanical Verification
No subjective "looks good." Every iteration runs a real metric — tests, benchmarks, scores, build output.
Automatic Rollback
Failed changes revert instantly via git reset. No manual cleanup, no debugging compound failures.
Git as Memory
Every kept change is a commit. The agent reads its own git history to learn what works and avoid past mistakes.
Results Logging
TSV log tracks every iteration — metric, delta, status, description. Pattern recognition across experiments.
Domain-Agnostic
Works on backend code, ML training, frontend UI, content, performance — any task with a measurable outcome.
→ Security Audit v1.0.3
Autonomous STRIDE + OWASP + red-team security audit. Generates a full threat model, maps attack surfaces, then iteratively tests each vulnerability vector with code evidence.
$ /loop 10 /autoresearch:security Scope: src/api/**/*.ts, src/middleware/**/*.ts Focus: authentication and authorization flows # Setup: scan codebase → assets → trust boundaries → STRIDE model → attack surface # Loop: test vectors → validate with code evidence → log findings → repeat # Output: security/260315-0945-stride-owasp-full-audit/overview.md
STRIDE Threat Model
Full Spoofing, Tampering, Repudiation, Info Disclosure, DoS, and Elevation of Privilege analysis per asset and trust boundary.
OWASP Top 10 (70+ Checks)
Systematic coverage across all 10 OWASP categories. Coverage matrix tracks tested vs untested. Aims for 100% coverage.
4 Red-Team Personas
Security Adversary, Supply Chain Attacker, Insider Threat, and Infrastructure Attacker. Each drives which vectors get tested.
Structured Report Folder
Each audit creates a timestamped folder with 7 files: overview, threat model, attack surface, findings, OWASP coverage, dependency audit, and recommendations.
Flags
--diffOnly audit files changed since last audit
--fixAuto-fix confirmed Critical/High findings
--fail-onCI/CD gate: exit non-zero at severity threshold
# Full combo: delta audit + auto-fix + CI gate /loop 15 /autoresearch:security --diff --fix --fail-on critical
→ Use Cases
Same loop, different domains. The principles are universal — the metrics are domain-specific.
Backend Code
- Metric:
- Tests pass + coverage %
- Scope:
- src/**/*.ts
- Verify:
- npm test
Frontend UI
- Metric:
- Lighthouse score
- Scope:
- src/components/**
- Verify:
- npx lighthouse
ML Training
- Metric:
- val_bpb / loss
- Scope:
- train.py
- Verify:
- uv run train.py
Performance
- Metric:
- Benchmark time (ms)
- Scope:
- Target files
- Verify:
- npm run bench
Refactoring
- Metric:
- Tests pass + LOC reduced
- Scope:
- Target module
- Verify:
- npm test && wc -l
Blog / Content
- Metric:
- Word count + readability
- Scope:
- content/*.md
- Verify:
- Custom script
Security Audit
- Metric:
- OWASP + STRIDE coverage
- Scope:
- src/api/**, src/middleware/**
- Verify:
- /autoresearch:security
→ Quick Start
Two commands to install. One command to run.
1. Install the Skill
git clone https://github.com/uditgoenka/autoresearch.git /tmp/autoresearch cp -r /tmp/autoresearch/skills/autoresearch ~/.claude/skills/autoresearch
2. Plan Your Run (New in v1.0.2)
# Interactive wizard — builds Scope, Metric & Verify from your Goal: /autoresearch:plan Goal: Make the API respond faster # The wizard scans your codebase, suggests metrics, # dry-runs the verify command, and outputs a ready-to-paste config.
3. Run Unlimited
# Inside any project directory: /autoresearch Goal: Increase test coverage to 95% Metric: npm test -- --coverage | grep "All files" Scope: src/**/*.ts Direction: higher_is_better
4. Run Bounded (Optional)
# Run exactly 25 iterations then stop: /loop 25 /autoresearch Goal: Reduce bundle size below 200KB Metric: npm run build | grep "Total size" Direction: lower_is_better
5. Security Audit (New in v1.0.3)
# STRIDE + OWASP + red-team security audit: /loop 10 /autoresearch:security # With flags: delta mode + auto-fix + CI gate: /loop 15 /autoresearch:security --diff --fix --fail-on critical
→ Changelog
Release history and what shipped in each version.
Autonomous Security Audit
- +/autoresearch:security — STRIDE threat model + OWASP Top 10 + red-team (4 adversarial personas)
- +--diff flag: delta mode, only audit files changed since last audit
- +--fix flag: auto-remediate confirmed Critical/High findings
- +--fail-on flag: CI/CD severity gate for pipeline blocking
- +Structured report folder with 7 dedicated markdown files per audit
- +CI/CD GitHub Action template auto-generation
- +Historical comparison across audit runs (new/fixed/recurring)
- +Commands Reference table added to README
Plan Your Run Wizard
- +/autoresearch:plan — interactive wizard converts Goal → Scope, Metric, Verify config
- +Mandatory dry-run validation of verify command before accepting
- +Metric suggestion database by domain (code, performance, content, refactoring)
- +Launch options: unlimited, bounded (/loop N), or copy-only
Controlled Iterations with /loop
- +/loop N /autoresearch — run exactly N iterations then stop with summary
- +Early completion when goal is achieved before N iterations
- +Smart exploitation when <3 iterations remain
- +Final summary: baseline → current best, keeps/discards/crashes
Initial Release
- +Core autoresearch loop: modify → verify → keep/discard → repeat
- +7 core principles from Karpathy’s autoresearch, generalized
- +Mechanical verification, automatic rollback, git as memory
- +TSV results logging with pattern recognition
- +Domain-agnostic: code, ML, content, performance, refactoring
→ FAQ
Common questions about Autoresearch.