Specification
Extraction pipeline
The protocol does not mandate a specific extractor, but every conformant pipeline must produce the same shape and obey the same source-traceability and language-lock rules. The reference implementation is CKF Compiler v1.3.1.
Pipeline stages
- Preflight (v1.2).
profileSource()detects language, format and record count, and hard-blocks empty / filename-only / hash-only inputs before any LLM cost is incurred. - Segmentation (v1.2).
segmentSource()emitsSourceSpan[]with stablesource_record_ids — one per JSONL record, one per FAQ Q/A, one per normative article, or structural ids for prose. Asource_manifestis built from these spans and propagates all the way tosource_traceability; this is what enablescompletecoverage on record-oriented inputs. - Ingestion. Normalize whitespace and encoding for the LLM stage.
- Chunking. Split spans into semantically coherent passages (see
src/lib/compiler/chunker.ts), preserving each chunk'ssourceSpanIdandsourceRecordId. - Lift (per-chunk LLM call). For each chunk, extract entities, concepts, principles, heuristics and rules with explicit
source_basis. The prompt carries a hard language directive matching the caller'stargetLanguage. - Reduce. Merge per-chunk partials into a single package, deduplicating by id and locking language.
- Promote. Walk
atomic_units+retrieval_chunksand promote conditionals intoif_then_rules, multi-step actions intoplaybooksand failure modes intoanti_patterns. - Sanitize. Apply the field-aware global sanitizer (see Language lock) to remove language drift, truncated artifacts and rich-section duplicates — without rejecting legitimate titles/labels. Triggers an automatic re-run when the output drifts from the detected source language.
- EnsureIds + RebuildSourceTraceability. Re-assign deterministic ids and rebuild the
source_traceabilitysection against the surviving items, carryingsource_record_idwhen present. - Coverage pass (v1.2). Insert missing retrieval chunks / QA pairs / atomic units according to the active mode (
summary·balanced·complete). - Numeric guards (v1.2). Verify currencies, percentages, dates, durations and citation references against the source spans; correct truncations, flag unverifiables.
- Score (quality). Compute
human_readabilityandai_utility_score, calibrate metadata.
See Compiler pipeline v1.3.1 for the canonical entry point and metrics surface, and Preflight & coverage for the v1.2 additions in detail.
Source-basis labels
Every extracted item carries one of five labels. The label is mandatory and drives downstream agent behavior:
explicit— the source states it directly.inferred— the extractor combined two or more explicit statements.synthesized— produced by the extractor (e.g. retrieval chunks). Not a claim about the world.author_opinion— the source's stated opinion, not a fact.uncertain— extractor was unsure; agents should treat with caution.
No silent inference
explicit source quote requires inferred oruncertain — never explicit. Failing this rule is a conformance error.Compression levels
Four levels control how aggressively the extractor compresses prose into structure:
light— preserves most prose; few inferred items; highhuman_readability.standard— balanced default; recommended for most sources.dense— maximal structure; minimal prose; highai_utility_score.agentic— optimized for autonomous agents; emphasizes playbooks, decision rules and tool guidance.
Reference contract
The legacy heuristic compiler keeps the same TypeScript signature:
function compileCkf(rawText: string, options: {
sourceType: string;
compressionLevel: "light" | "standard" | "dense" | "agentic";
outputFormat: "markdown" | "json" | "yaml";
language?: string;
}): { pkg: CkfPackage; warnings: string[] };The v1.3.1 pipeline accepts pre-extracted partials plus the new preflight + spans context:
import { runCkfPipeline } from "@/lib/compiler/pipeline";
const result = runCkfPipeline(partials, {
chunks, // ChunkRef[]
spans, // SourceSpan[] from segmentSource() (v1.2)
sourceManifest, // built via buildSourceManifest(spans) (v1.2)
profile, // SourceProfile from profileSource() (v1.2)
coverageMode: "balanced",// "summary" | "balanced" | "complete" (v1.2)
filename: "policy.md",
targetLanguage: "en", // any ISO code — hard language lock
});
result.pkg; // CkfPackage
result.quality; // { human_readability, ai_utility_score, ... }
result.promotion; // { promoted, rejected }
result.sanitizer; // { removed_count, ..., restored_count, language_recovery_applied }
result.preflight; // SourceProfile (v1.2)
result.coverage; // { mode, inserted_*, source_record_coverage } (v1.2)
result.numericIntegrity; // { numeric_integrity_score, exact_matches, ... } (v1.2)
result.compilerVersion; // "v1.3.1"