Skill Conductor: полный lifecycle скиллов для Claude Code

Claude Code использует скиллы, но кто их создаёт? Skill Conductor превращает Claude Code в самоулучшающуюся систему — agent, который пишет скиллы для других агентов.

META: агент, который учит агентов

Skill Conductor — это meta-skill. Вместо выполнения пользовательских задач он создаёт, тестирует, оценивает и упаковывает скиллы для других агентов.

Full lifecycle: draft → test → review → improve → repeat

6 режимов работы:

CREATE — новый скилл с нуля
IMPROVE — фиксим существующий скилл
VALIDATE — тестируем quality & triggering
REVIEW — quality gate для third-party скиллов
OPTIMIZE — автоматическая оптимизация description
PACKAGE — упаковка для distribution

Mode 1: CREATE — TDD для скиллов

Test-Driven Development подход: сначала fail, потом fix.

Step 1: Capture Intent

What specific task should this skill handle?
What would a user say to trigger it?
What should NOT trigger it?

Пример:

Task: "Extract data from design files for developer handoff"
Trigger: "analyze this Figma file", "need design specs"
NOT trigger: Sketch files, Adobe XD, general design feedback

Step 2: Baseline (TDD RED)

Verify agent fails WITHOUT the skill:

Take one scenario
Run in clean session
Document failure

Если агент уже справляется — скилл не нужен.

Step 3: Architecture

Выбрать паттерн из references/patterns.md:

Pattern: Sequential workflow → Use when: clear step-by-step process

Pattern: Iterative refinement → Use when: output improves with cycles

Pattern: Context-aware selection → Use when: same goal, different tools by context

Pattern: Domain intelligence → Use when: specialized knowledge beyond tool access

Pattern: Multi-MCP coordination → Use when: workflow spans multiple services

Step 4: Scaffold

uv run scripts/init_skill.py <skill-name> --path <output-dir> [--resources scripts,references,assets]

Структура:

skill-name/
├── SKILL.md          # required — the brain
├── scripts/          # deterministic operations (executed, not loaded)  
├── references/       # detailed docs (loaded on demand)
└── assets/           # templates, images (never loaded)

SKILL.md: мозг скилла

Frontmatter — критично важен:

---
name: kebab-case-name
description: >
  [Purpose in one sentence]. Use when [triggers].
  Do NOT use for [negative triggers].
---

GOLDEN RULE: description определяет triggering. Без правильного description скилл никогда не сработает.

Good description:

# ✅ Purpose + triggers, no process
description: Analyze Figma design files for developer handoff. Use when user uploads .fig files or asks for "design specs". Do NOT use for Sketch or Adobe XD.

Bad description:

# ❌ Process in description (agent skips body)
description: Exports Figma assets, generates specs, creates Linear tasks, posts to Slack.

Body structure:

# Skill Name

## Overview
What this enables. 1-2 sentences.

## [Main sections]
Step-by-step with numbered sequences.
Concrete templates over prose.
Imperative voice throughout.

## Common Mistakes
What goes wrong + how to fix.

Writing rules:

One term per concept** — "template", not template/boilerplate/scaffold
Progressive disclosure** — SKILL.md brain (<500 lines), references = details
Token budget** — frequently loaded: <200 words, standard: <500 lines
Imperative voice** — "Extract the data", not "You should extract"

Eval Loop: test-driven improvement

Step 6: Test Cases

Create evals/evals.json:

3–5 eval prompts covering core use cases
Define expectations (verifiable statements)
Start without assertions — observe first, then write

Eval execution:

Spawn executor subagent with skill active
Spawn baseline run (same prompt, no skill)
Grade outputs using agents/grader.md
Launch eval viewer: uv run eval-viewer/generate_review.py <workspace>

Viewer показывает:

Side-by-side comparison (with skill vs without)
Timing data, token usage
Pass/fail assertions
Quality scores

Mode 2: IMPROVE — диагностика проблем

Problem classification:

Problem: Undertriggering → Signal: skill doesn't load → Fix: add keywords to description

Problem: Overtriggering → Signal: loads for unrelated queries → Fix: add negative triggers

Problem: Skips body → Signal: follows description only → Fix: remove process from description

Problem: Inconsistent output → Signal: varies across sessions → Fix: add explicit templates, reduce freedom

Problem: Too slow → Signal: large context → Fix: move detail to references/

Improvement mindset:

Generalize from feedback** — don't overfit to test cases
Keep prompt lean** — remove unproductive steps
Explain the why** — LLMs have good theory of mind
Bundle repeated work** — same helper script = move to scripts/

Blind Comparison (for major changes):

Run both versions on same evals
Spawn agents/comparator.md — gets outputs A/B without knowing which is which
Comparator scores + picks winner
Spawn agents/analyzer.md — analyzes WHY winner won

Prevents bias в оценке.

Mode 3: VALIDATE — 3-stage quality check

Stage 1: Structural Validation

uv run scripts/eval_skill.py <skill-folder>

Checks: frontmatter, naming, description, body size. Target: 10/10.

Stage 2: Discovery (trigger testing)

Generate 6 prompts:

3 SHOULD trigger
3 should NOT (similar-sounding but wrong domain)

Run in clean sessions. Target: 6/6 correct.

Stage 3: 5-Axis Scoring (1-10 each):

Axis: Discovery → What it measures: triggers correctly, no false triggers

Axis: Clarity → What it measures: instructions unambiguous

Axis: Efficiency → What it measures: token budget respected

Axis: Robustness → What it measures: handles edge cases

Axis: Completeness → What it measures: covers stated use cases

Scoring: 45–50 production ready · 35–44 solid · 25–34 needs work · <25 rewrite

Mode 5: OPTIMIZE — automated description tuning

Description competes с другими скиллами за attention Claude. Optimization находит wording с best triggering accuracy.

How it works:

Create eval set: 20 queries (10 should-trigger, 10 should-not)
Train/test split (60%/40%) to prevent overfitting
Optimization loop:

Evaluate current description
Claude proposes improvement (sees only train data)
Re-evaluate
Select best by test score

Writing good eval queries:

# ❌ Too simple - Claude handles without skills
"Format this data"

# ✅ Realistic, substantive  
"my boss sent Q4 sales final FINAL v2.xlsx, add profit margin % column, revenue is col C costs col D"

Should-NOT queries: near-misses с shared keywords but need different skill. "Write fibonacci" as negative для PDF skill = useless — too easy.

Run optimization:

uv run scripts/run_loop.py \
  --eval-set evals/eval_set.json \
  --skill-path <skill-dir> \
  --model claude-sonnet-4-20250514 \
  --max-iterations 5 \
  --holdout 0.4

Skill Categories & Patterns

3 типа скиллов:

Document/Asset Creation — consistent output (docs, designs, code)
Workflow Automation — multi-step processes с methodology
MCP Enhancement — workflow guidance поверх tool access

Progressive disclosure budget:

Level: Frontmatter → When loaded: always (system prompt) → Budget: ~100 words

Level: SKILL.md body → When loaded: on trigger → Budget: <500 lines

Level: Bundled resources → When loaded: on demand → Budget: unlimited

File purposes:

Directory: SKILL.md → Loaded?: on trigger → Purpose: brain — instructions

Directory: references/ → Loaded?: on demand → Purpose: detailed docs, schemas

Directory: scripts/ → Loaded?: executed, not loaded → Purpose: deterministic operations

Directory: assets/ → Loaded?: never loaded → Purpose: templates, images

Mode 6: PACKAGE — distribution ready

Quality gate:

Run REVIEW checklist (11 points)
Validate: uv run scripts/quick_validate.py <skill-folder>
Package: uv run scripts/package_skill.py <skill-folder>

Creates: skill-name.skill (zip with .skill extension)

Checklist items:

SKILL.md exists, exact case
Valid YAML frontmatter
description ≤1024 chars, no process steps
No README inside skill folder
SKILL.md <500 lines
Scripts tested and executable
No hardcoded secrets

Real Example: Create Mode

User request: "Create a skill for analyzing security audit reports"

Step 1: Intent capture

Task: Extract vulnerabilities, classify severity, generate executive summary
Trigger: "analyze this security report", uploads PDF/DOCX
NOT trigger: code review, compliance reports

Step 2: Baseline

Run without skill → agent tries basic file reading, misses structured extraction

Step 3: Architecture

Sequential workflow: parse → extract → classify → summarize

Step 4: Scaffold

uv run scripts/init_skill.py security-audit-analyzer --resources scripts,references

Step 5: Write SKILL.md

---
name: security-audit-analyzer
description: >
  Extract and analyze vulnerabilities from security audit reports. 
  Use when user uploads security assessment PDFs/DOCX or asks to 
  "analyze security report". Do NOT use for code reviews or 
  compliance documents.
---

# Security Audit Analyzer

## Overview
Extracts vulnerabilities, classifies by CVSS severity, generates executive summary.

## Workflow
1. **Parse document** - Extract text, identify sections
2. **Extract findings** - Pull vulnerability details, evidence
3. **Classify severity** - Apply CVSS scoring if not present  
4. **Generate summary** - Executive overview with risk metrics

Step 6: Test & iterate

Create evals → run → review in viewer → fix issues → repeat

Meta-Insights

Skill Conductor teaches нас fundamentals о том, как работают agent skills:

Description is everything — без правильного triggering description skill мёртвый код

Progressive disclosure — Claude читает minimum necessary для task completion

Test-driven development — always verify failure first, then fix

Blind evaluation — bias affects даже AI judgment, automation prevents это

Token economics — every loaded character конкурирует за attention

Iteration beats perfection — small improvements с tight feedback loop > big rewrite

Заключение

Skill Conductor — это не просто tool для создания скиллов. Это meta-framework, который делает Claude Code self-improving system.

Что делает его unique:

Complete lifecycle** — от идеи до production package
Test-driven approach** — всё starts с failing test
Automated optimization** — machine learning для trigger accuracy
Quality gates** — structural + behavioral validation
Blind evaluation** — unbiased A/B comparison

GitHub: https://github.com/smixs/skill-conductor

Результат: Claude Code, который пишет скиллы для других Claude Code. Meta-programming для AI agents.

*Skill development: когда AI учит AI быть лучше.* 🧪