Adaptive model distillation framework. Start with frontier API models, progressively train a local model, then withdraw the expensive dependency — while maintaining quality guarantees. Automatically. With receipts.
pip install apprentice-ai
You're paying $15 per million tokens for tasks a fine-tuned 8B model could handle. Classification, extraction, routing, summarization — repetitive, domain-specific work where the frontier model is overkill once you have enough examples.
Apprentice's answer: treat every API response as both a result and a labeled training example. Over time, the local model absorbs the teacher's behavior. The API bill goes down. The local model gets better. The quality metrics prove it.
This is the opposite of prompt engineering. Prompt engineering optimizes the question. Apprentice optimizes which model you ask, based on evidence.
"The confidence engine doesn't trust the local model by default. It earns trust through a sliding window of comparison scores. Phase transitions happen mechanically: 50 examples → start comparing. 0.85 correlation sustained → start routing locally. If correlation drops, traffic shifts back."
No manual intervention required. The data decides.
No manual cutover. The system decides when the local model is ready based on sustained quality metrics.
Every request goes to the remote API. Responses are stored as training examples. After enough accumulate, fine-tuning begins. The caller gets results immediately — data collection is invisible.
Both models process each request. The evaluator scores local vs. remote output. A rolling window tracks correlation. When it exceeds the threshold for long enough, the system promotes to Phase 3.
The local model handles most traffic. An adaptive sampler periodically sends requests to both models to verify quality. If correlation drops, the system automatically regresses to Phase 2.
Everything you need to go from $15/M tokens to $0, with proof that quality holds.
Sliding window correlation tracking with configurable thresholds. Phase transitions are mechanical, not manual. The data decides when the local model is ready.
The caller sends a request, gets a response. They don't know if it came from Claude, a LoRA on Llama, or both. Routing is automatic and invisible.
Exact match, structured field comparison, semantic similarity. Choose the evaluator that fits your task. Each task gets independent quality tracking.
Multi-window spend tracking with alerts. Set monthly limits, let the system optimize. As local routing increases, your API bill drops automatically.
In Phase 3, sampling frequency adjusts based on correlation stability. High correlation → fewer checks. Drift detected → more checks. Self-tuning.
Anthropic, OpenAI, or any API as the remote teacher. Ollama, vLLM, or llama.cpp as the local student. Mix and match per task.
Hybrid multi-modal PII detection: fast regex, field heuristics, and optional NER model inference. Scrubs sensitive data before it reaches models or logs. Learns over time.
Optional transformer-based named entity recognition catches person names, addresses, and organizations that regex can't. Lazy-loaded — zero overhead when disabled.
Human and AI feedback drives continuous improvement. False positive/negative reports adjust detection confidence. The system gets smarter with every correction.
21 leaf implementations wired together by 7 integration compositions. Each component was contract-tested independently before integration. 2,486 tests verify the system end-to-end.
Built with Pact — contracts and tests were generated before any implementation began. Agents implemented each component in parallel, verified by mechanical test gates.
The architecture handles multiple tasks simultaneously. One task might be in Phase 3 (local model proven) while another is still collecting examples in Phase 1. Each gets its own confidence window, evaluator, and phase progression.
Use it as a library, a CLI tool, or a standalone HTTP service with a REST API.
| Endpoint | Method | Purpose |
|---|---|---|
| /health | GET | Health check |
| /v1/run | POST | Submit a task request — routing is automatic |
| /v1/status/{skill} | GET | Phase, confidence, and budget for a specific skill |
| /v1/recommendations | POST | Request a recommendation for a skill |
| /v1/feedback | POST | Submit feedback on a recommendation |
| /v1/events | POST | Ingest external events (fire-and-forget) |
| /v1/skills | GET | List configured skills with phase info |
| /v1/report | GET | Generate metrics report |
Python 3.12+, three dependencies. pip install apprentice-ai and go.
For NER-based PII detection: pip install apprentice-ai[ml].
Define your tasks in YAML — prompt templates, evaluators, confidence thresholds. Point at a remote API and a local model. Apprentice handles the rest: data collection, fine-tuning, phase transitions, quality monitoring.
# Install from PyPI
pip install apprentice-ai
# As a library
from apprentice import Apprentice
app = await Apprentice.create("apprentice.yaml")
response = await app.run("classify_ticket", {
"text": "Payment didn't go through"
})
# response.source → "local" or "remote"
# As a CLI
apprentice serve config.yaml
apprentice status config.yaml
apprentice report config.yaml
# PII evaluation (optional: pip install apprentice-ai[ml])
apprentice pii-ingest --limit 1000
apprentice pii-evaluate --mode hybrid
Apprentice is one of three systems (alongside Pact and Emergence) built to test the ideas in Beyond Code: Context, Constraints, and the New Craft of Software. The book covers the coordination, verification, and specification problems that motivated these designs.
Read the Book