Compiler
Language lock
A CKF package declares one language. The compiler must guarantee that every prose field inside the package matches that language — without falsely rejecting short identifiers, brand names, or technical labels.
Why it exists
LLMs routinely drift between languages when the source mixes them (e.g. a Portuguese article that quotes English source code). Without a lock, a single Portuguese package can end up with English retrieval chunks, English procedures and English playbooks — breaking agent answers in production.
Regression that motivated the lock
preflight.detectedLanguage, reported inmetrics.sanitizer.language_recovery_applied and restored_count.Two-layer enforcement
- Prompt-level language directive. Every per-chunk LLM call carries a hard directive: "You MUST write all output in [target language]. Do not switch languages even if the source contains other languages."
- Field-aware global sanitizer. After reduce,
sanitizeMergedPackage()walks the package and applies the rules below per field-class.
Field-aware rules
The sanitizer classifies every field as one of four kinds and applies a different policy:
- Title / label / id — short identifiers. Language detection is skipped (one or two tokens is statistically meaningless). Completeness check is also skipped.
- Body / description / answer / step — long-form prose. Full language + completeness + truncation filters apply.
- Source excerpt — quoted text from the original source. Language mismatch is allowed (a PT package can legitimately quote EN source) when
allowSourceExcerptLanguageMismatch: true. - Structured value — numbers, enums, ids. Excluded from text filters entirely.
Configuration
import { sanitizeMergedPackage } from "@/lib/compiler/packageSanitizer";
sanitizeMergedPackage(pkg, {
action: "remove", // "remove" | "quarantine"
languageFilter: true, // drop items whose prose drifts
completenessFilter: true, // drop truncated / mid-sentence items
truncationFilter: true, // drop items ending with "..." mid-clause
deduplicateRichSections: true, // procedures/playbooks/if_then_rules
allowSourceExcerptLanguageMismatch: true, // a PT pkg can quote EN source
});When targetLanguage is not provided, the pipeline auto-detects it frompreflight.detectedLanguage (any ISO code — EN / PT / ES / others). Language recovery is bounded to a single re-run per compile.
Sanitizer report
Every run returns a structured report consumed by the pipeline's warnings and the MCP metrics surface:
type SanitizerReport = {
removed_count: number;
quarantined_count: number;
deduplicated_count: number;
restored_count?: number; // v1.2 — items restored by language recovery
language_recovery_applied?: boolean; // v1.2 — true when a re-run was triggered
removals: Array<{ section: string; id: string; reason: string }>;
};Operator guarantees
- A package declared
language: "pt"has zero English-only prose fields after compilation. retrieval_chunks.length ≥ 15,procedures.length ≥ 1,if_then_rules.length ≥ 1for any source long enough to warrant them (regression test on "What CKF is not").- Removals are auditable through the sanitizer report; nothing is silently dropped.
See also
- Compiler pipeline v1.3.1 — where the sanitizer fits.
- Extraction pipeline — the per-chunk LLM call that emits partials.