Train cheap models
on expensive ones.

Adaptive model distillation framework. Start with frontier API models, progressively train a local model, then withdraw the expensive dependency — while maintaining quality guarantees. Automatically. With receipts.

pip install apprentice-ai

The frontier model is a teacher,
not a dependency.

You're paying $15 per million tokens for tasks a fine-tuned 8B model could handle. Classification, extraction, routing, summarization — repetitive, domain-specific work where the frontier model is overkill once you have enough examples.

Apprentice's answer: treat every API response as both a result and a labeled training example. Over time, the local model absorbs the teacher's behavior. The API bill goes down. The local model gets better. The quality metrics prove it.

This is the opposite of prompt engineering. Prompt engineering optimizes the question. Apprentice optimizes which model you ask, based on evidence.

"The confidence engine doesn't trust the local model by default. It earns trust through a sliding window of comparison scores. Phase transitions happen mechanically: 50 examples → start comparing. 0.85 correlation sustained → start routing locally. If correlation drops, traffic shifts back."

No manual intervention required. The data decides.

Three phases. All data-driven.

No manual cutover. The system decides when the local model is ready based on sustained quality metrics.

1
Cold Start
2
Reinforcement
3
Steady State
Phase 1 — Cold Start

Learn from the teacher

Every request goes to the remote API. Responses are stored as training examples. After enough accumulate, fine-tuning begins. The caller gets results immediately — data collection is invisible.

Phase 2 — Reinforcement

Prove the student

Both models process each request. The evaluator scores local vs. remote output. A rolling window tracks correlation. When it exceeds the threshold for long enough, the system promotes to Phase 3.

Phase 3 — Steady State

Trust but verify

The local model handles most traffic. An adaptive sampler periodically sends requests to both models to verify quality. If correlation drops, the system automatically regresses to Phase 2.

Replace API costs with local inference

Everything you need to go from $15/M tokens to $0, with proof that quality holds.

📊

Confidence Engine

Sliding window correlation tracking with configurable thresholds. Phase transitions are mechanical, not manual. The data decides when the local model is ready.

🔀

Transparent Routing

The caller sends a request, gets a response. They don't know if it came from Claude, a LoRA on Llama, or both. Routing is automatic and invisible.

🧪

Multi-Evaluator

Exact match, structured field comparison, semantic similarity. Choose the evaluator that fits your task. Each task gets independent quality tracking.

💰

Budget-Aware

Multi-window spend tracking with alerts. Set monthly limits, let the system optimize. As local routing increases, your API bill drops automatically.

🔄

Adaptive Sampling

In Phase 3, sampling frequency adjusts based on correlation stability. High correlation → fewer checks. Drift detected → more checks. Self-tuning.

🔧

Multi-Provider

Anthropic, OpenAI, or any API as the remote teacher. Ollama, vLLM, or llama.cpp as the local student. Mix and match per task.

🔒

PII Protection

Hybrid multi-modal PII detection: fast regex, field heuristics, and optional NER model inference. Scrubs sensitive data before it reaches models or logs. Learns over time.

🧠

NER Integration

Optional transformer-based named entity recognition catches person names, addresses, and organizations that regex can't. Lazy-loaded — zero overhead when disabled.

📝

Feedback Loop

Human and AI feedback drives continuous improvement. False positive/negative reports adjust detection confidence. The system gets smarter with every correction.

28 components.
Zero cross-dependencies.

21 leaf implementations wired together by 7 integration compositions. Each component was contract-tested independently before integration. 2,486 tests verify the system end-to-end.

Built with Pact — contracts and tests were generated before any implementation began. Agents implemented each component in parallel, verified by mechanical test gates.

The architecture handles multiple tasks simultaneously. One task might be in Phase 3 (local model proven) while another is still collecting examples in Phase 1. Each gets its own confidence window, evaluator, and phase progression.

config_loader // YAML → validated config
task_registry // task definitions + schemas
router // local / remote / dual
evaluators // exact, structured, semantic
rolling_window // correlation tracking
phase_manager // 1 → 2 → 3 transitions
sampling_scheduler // adaptive frequency
training_data_store // collect examples
fine_tuning_orchestrator // LoRA, OpenAI, HF
budget_manager // spend tracking
pii_detection // regex + NER + heuristics
pii_tokenizer // detect → replace → restore
2,486 tests pass // verified

HTTP API & CLI

Use it as a library, a CLI tool, or a standalone HTTP service with a REST API.

Endpoint Method Purpose
/health GET Health check
/v1/run POST Submit a task request — routing is automatic
/v1/status/{skill} GET Phase, confidence, and budget for a specific skill
/v1/recommendations POST Request a recommendation for a skill
/v1/feedback POST Submit feedback on a recommendation
/v1/events POST Ingest external events (fire-and-forget)
/v1/skills GET List configured skills with phase info
/v1/report GET Generate metrics report

Up and running in 60 seconds

Python 3.12+, three dependencies. pip install apprentice-ai and go. For NER-based PII detection: pip install apprentice-ai[ml].

Define your tasks in YAML — prompt templates, evaluators, confidence thresholds. Point at a remote API and a local model. Apprentice handles the rest: data collection, fine-tuning, phase transitions, quality monitoring.

View on GitHub

# Install from PyPI
pip install apprentice-ai

# As a library
from apprentice import Apprentice

app = await Apprentice.create("apprentice.yaml")
response = await app.run("classify_ticket", {
    "text": "Payment didn't go through"
})
# response.source → "local" or "remote"

# As a CLI
apprentice serve config.yaml
apprentice status config.yaml
apprentice report config.yaml

# PII evaluation (optional: pip install apprentice-ai[ml])
apprentice pii-ingest --limit 1000
apprentice pii-evaluate --mode hybrid

Built on the ideas in Beyond Code

Apprentice is one of three systems (alongside Pact and Emergence) built to test the ideas in Beyond Code: Context, Constraints, and the New Craft of Software. The book covers the coordination, verification, and specification problems that motivated these designs.

Read the Book