Transmogrifier

The Problem

LLM output quality varies dramatically by the linguistic register of your input. Same question, different phrasing, different accuracy:

Model	Spread	Best	Worst	Status
Claude Opus 4	18.8pp	direct (93.8%)	casual (75.0%)	sensitive
Gemini 2.5 Flash	56.2pp	direct (56.2%)	narrative (0%)	critical
Claude Haiku 4.5	6.2pp	direct (93.8%)	casual (87.5%)	mild
GPT-4o Mini	0pp	(invariant)		invariant

Casual register costs Claude Opus ~19% of correct answers. Gemini loses over half. Transmogrifier detects the input register and normalizes it before it hits the model.

Three Levels

Level 1

System prompt injection. Prepends a normalization instruction.

0ms · no API calls · 67-100% recovery

Level 2

Rule-based rewriting. Strips filler, hedging, and framing via regex.

<1ms · no API calls · preserves semantics

Level 3

LLM translation. Separate-context rewrite for heavy-lift cases.

~200ms · 1 API call · only when spread >10pp

Install

pip install git+https://github.com/jmcentire/transmogrifier.git

Optional extras:

pip install "transmogrifier[anthropic]"    # Claude backend (Level 3)
pip install "transmogrifier[openai]"       # OpenAI backend
pip install "transmogrifier[gemini]"       # Gemini backend
pip install "transmogrifier[validation]"   # Semantic equivalence checking
pip install "transmogrifier[all]"          # Everything

API

Python Library

from transmogrifier.core import Transmogrifier

t = Transmogrifier()
result = t.translate("yo what's the deal with TCP", model="claude-opus-4")

result.output_text        # "TCP" (rewritten)
result.system_prompt      # Level 1 normalization instruction
result.detected_register  # Register.casual
result.target_register    # Register.direct
result.elapsed_ms         # ~2ms
result.skipped            # False

Invariant models are skipped automatically:

result = t.translate("yo what's TCP", model="gpt-4o-mini")
result.skipped      # True
result.skip_reason  # "invariant model (0.0pp spread)"

CLI

# Detect register
transmogrify detect "yo what's the deal with TCP"
# {"register": "casual", "confidence": 1.0}

# Classify task type
transmogrify classify "Write a Python function to sort a list"
# {"task_type": "code", "confidence": 1.0}

# Translate (task-aware routing)
transmogrify translate "yo what's the difference between TCP and UDP" --model gemini-2-5-flash
# Detected: casual | Task: analysis | Target: technical (per-task optimal)

# List profiles with per-task breakdown
transmogrify profile list

# Show detailed profile
transmogrify profile show claude-opus-4

# Calibrate a new model (50 tasks x 5 registers = 250 API calls)
transmogrify profile calibrate my-model --provider anthropic

MCP Server

claude mcp add --scope user --transport stdio transmogrifier -- transmog-mcp

Tools: transmog_translate, transmog_detect, transmog_profiles

Key Findings

Register sensitivity is model-specific

GPT-4o Mini shows zero sensitivity. Claude Opus shows 18.8pp. Gemini shows 56.2pp. A one-size-fits-all approach wastes effort on invariant models and under-serves sensitive ones. Transmogrifier uses per-model profiles to adapt.

Register x task interaction is non-uniform (v0.2.0)

Academic register halves reasoning accuracy on Claude Opus (40% vs 80% direct). Meanwhile, Gemini scores 0% on analysis with direct register but 33% with technical. The optimal register depends on both model and task type. Transmogrifier v0.2.0 includes a task classifier (factual, reasoning, code, analysis, creative, instruction) and routes to the per-(model, task) optimal register.

Model	Category	Best Register	Spread
Claude Opus	reasoning	direct (80%)	40pp (academic=40%)
Claude Opus	factual/code/analysis	(all invariant)	0pp
Gemini Flash	factual	direct/technical (100%)	100pp (narrative=0%)
Gemini Flash	analysis	technical/academic (33%)	33pp (direct=0%)
Gemini Flash	overall	technical (62.5%)	62.5pp

System prompt normalization (Level 1) exceeds direct register on Gemini

Casual input + Level 1 system prompt scored 68.8% on Gemini 2.5 Flash, beating raw direct register (56.2%) by 12.5pp. The normalization instruction provides a genuine quality boost beyond just recovering the register gap.

Same-context translation is catastrophic

Asking the model to "reinterpret this casually-worded question, then answer it" in a single turn scored 0% on Gemini. Translation must happen in a separate context. This is architecturally enforced in Transmogrifier.

Larger models are more sensitive, not less

Claude Opus (18.8pp) is more register-sensitive than Claude Haiku (6.2pp). RLHF doesn't uniformly solve this — the effect is architecture and training dependent.

Registers

Register	Example	Detection Markers
direct	What is TCP?	Short, unframed, no markers
casual	yo so like, what's the deal with TCP	Slang, filler, contractions
technical	Provide a precise technical answer: What is TCP?	Imperative verbs, structured
academic	In the context of established knowledge, TCP...	Hedging, passive voice, latinate
narrative	Explain this as if telling a story: How does TCP work?	Storytelling framing, analogies

Architecture

Register Detector → Profile Lookup → Translation Router → Level 1 + Level 2 (+ optional Level 3) → Result

Invariants:

Separate context: Level 3 is never in the same message as task execution
No data persistence: user prompts never written to disk
Fail-safe passthrough: any error returns the original unchanged
Zero-dependency L1+L2: no API calls, only pydantic + pyyaml + re

Integration

Designed as middleware for:

Constrain — before interview questions hit the LLM
Pact — before task specs and code generation prompts
Kindex — before embedding queries for retrieval

Transmogrifier · MIT License · Jeremy McEntire · Source