Compiler

Language lock

A CKF package declares one language. The compiler must guarantee that every prose field inside the package matches that language — without falsely rejecting short identifiers, brand names, or technical labels.

Why it exists

LLMs routinely drift between languages when the source mixes them (e.g. a Portuguese article that quotes English source code). Without a lock, a single Portuguese package can end up with English retrieval chunks, English procedures and English playbooks — breaking agent answers in production.

Regression that motivated the lock

Compiler v1.03 introduced a global completeness sanitizer that, applied indiscriminately, removed all retrieval chunks, procedures, if-then rules and playbooks from "What CKF is not" because it was running prose-completeness checks on title andlabel fields. v1.03.1 (rolled into v1.1) made the sanitizer field-aware. v1.2 keeps the field-aware sanitizer and adds automatic language recovery: a single re-run when the post-sanitizer output drifts from preflight.detectedLanguage, reported inmetrics.sanitizer.language_recovery_applied and restored_count.

Two-layer enforcement

Prompt-level language directive. Every per-chunk LLM call carries a hard directive: "You MUST write all output in [target language]. Do not switch languages even if the source contains other languages."
Field-aware global sanitizer. After reduce, sanitizeMergedPackage()walks the package and applies the rules below per field-class.

Field-aware rules

The sanitizer classifies every field as one of four kinds and applies a different policy:

Title / label / id — short identifiers. Language detection is skipped (one or two tokens is statistically meaningless). Completeness check is also skipped.
Body / description / answer / step — long-form prose. Full language + completeness + truncation filters apply.
Source excerpt — quoted text from the original source. Language mismatch is allowed (a PT package can legitimately quote EN source) when allowSourceExcerptLanguageMismatch: true.
Structured value — numbers, enums, ids. Excluded from text filters entirely.

Configuration

import { sanitizeMergedPackage } from "@/lib/compiler/packageSanitizer";

sanitizeMergedPackage(pkg, {
  action: "remove",                            // "remove" | "quarantine"
  languageFilter: true,                        // drop items whose prose drifts
  completenessFilter: true,                    // drop truncated / mid-sentence items
  truncationFilter: true,                      // drop items ending with "..." mid-clause
  deduplicateRichSections: true,               // procedures/playbooks/if_then_rules
  allowSourceExcerptLanguageMismatch: true,    // a PT pkg can quote EN source
});

When targetLanguage is not provided, the pipeline auto-detects it frompreflight.detectedLanguage (any ISO code — EN / PT / ES / others). Language recovery is bounded to a single re-run per compile.

Sanitizer report

Every run returns a structured report consumed by the pipeline's warnings and the MCP metrics surface:

type SanitizerReport = {
  removed_count: number;
  quarantined_count: number;
  deduplicated_count: number;
  restored_count?: number;             // v1.2 — items restored by language recovery
  language_recovery_applied?: boolean; // v1.2 — true when a re-run was triggered
  removals: Array<{ section: string; id: string; reason: string }>;
};

Operator guarantees

A package declared language: "pt" has zero English-only prose fields after compilation.
retrieval_chunks.length ≥ 15, procedures.length ≥ 1, if_then_rules.length ≥ 1 for any source long enough to warrant them (regression test on "What CKF is not").
Removals are auditable through the sanitizer report; nothing is silently dropped.