Specification

Extraction pipeline

The protocol does not mandate a specific extractor, but every conformant pipeline must produce the same shape and obey the same source-traceability and language-lock rules. The reference implementation is CKF Compiler v1.3.1.

Pipeline stages

Preflight (v1.2). profileSource() detects language, format and record count, and hard-blocks empty / filename-only / hash-only inputs before any LLM cost is incurred.
Segmentation (v1.2). segmentSource() emits SourceSpan[] with stable source_record_ids — one per JSONL record, one per FAQ Q/A, one per normative article, or structural ids for prose. A source_manifest is built from these spans and propagates all the way to source_traceability; this is what enables complete coverage on record-oriented inputs.
Ingestion. Normalize whitespace and encoding for the LLM stage.
Chunking. Split spans into semantically coherent passages (see src/lib/compiler/chunker.ts), preserving each chunk's sourceSpanId and sourceRecordId.
Lift (per-chunk LLM call). For each chunk, extract entities, concepts, principles, heuristics and rules with explicit source_basis. The prompt carries a hard language directive matching the caller's targetLanguage.
Reduce. Merge per-chunk partials into a single package, deduplicating by id and locking language.
Promote. Walk atomic_units + retrieval_chunks and promote conditionals into if_then_rules, multi-step actions into playbooks and failure modes into anti_patterns.
Sanitize. Apply the field-aware global sanitizer (see Language lock) to remove language drift, truncated artifacts and rich-section duplicates — without rejecting legitimate titles/labels. Triggers an automatic re-run when the output drifts from the detected source language.
EnsureIds + RebuildSourceTraceability. Re-assign deterministic ids and rebuild the source_traceability section against the surviving items, carrying source_record_id when present.
Coverage pass (v1.2). Insert missing retrieval chunks / QA pairs / atomic units according to the active mode (summary · balanced · complete).
Numeric guards (v1.2). Verify currencies, percentages, dates, durations and citation references against the source spans; correct truncations, flag unverifiables.
Score (quality). Compute human_readability and ai_utility_score, calibrate metadata.

See Compiler pipeline v1.3.1 for the canonical entry point and metrics surface, and Preflight & coverage for the v1.2 additions in detail.

Source-basis labels

Every extracted item carries one of five labels. The label is mandatory and drives downstream agent behavior:

explicit — the source states it directly.
inferred — the extractor combined two or more explicit statements.
synthesized — produced by the extractor (e.g. retrieval chunks). Not a claim about the world.
author_opinion — the source's stated opinion, not a fact.
uncertain — extractor was unsure; agents should treat with caution.

No silent inference

Producing an item without an explicit source quote requires inferred oruncertain — never explicit. Failing this rule is a conformance error.

Compression levels

Four levels control how aggressively the extractor compresses prose into structure:

light — preserves most prose; few inferred items; high human_readability.
standard — balanced default; recommended for most sources.
dense — maximal structure; minimal prose; high ai_utility_score.
agentic — optimized for autonomous agents; emphasizes playbooks, decision rules and tool guidance.

Reference contract

The legacy heuristic compiler keeps the same TypeScript signature:

function compileCkf(rawText: string, options: {
  sourceType: string;
  compressionLevel: "light" | "standard" | "dense" | "agentic";
  outputFormat: "markdown" | "json" | "yaml";
  language?: string;
}): { pkg: CkfPackage; warnings: string[] };

The v1.3.1 pipeline accepts pre-extracted partials plus the new preflight + spans context:

import { runCkfPipeline } from "@/lib/compiler/pipeline";

const result = runCkfPipeline(partials, {
  chunks,                  // ChunkRef[]
  spans,                   // SourceSpan[] from segmentSource()  (v1.2)
  sourceManifest,          // built via buildSourceManifest(spans) (v1.2)
  profile,                 // SourceProfile from profileSource()  (v1.2)
  coverageMode: "balanced",// "summary" | "balanced" | "complete" (v1.2)
  filename: "policy.md",
  targetLanguage: "en",    // any ISO code — hard language lock
});

result.pkg;                // CkfPackage
result.quality;            // { human_readability, ai_utility_score, ... }
result.promotion;          // { promoted, rejected }
result.sanitizer;          // { removed_count, ..., restored_count, language_recovery_applied }
result.preflight;          // SourceProfile (v1.2)
result.coverage;           // { mode, inserted_*, source_record_coverage } (v1.2)
result.numericIntegrity;   // { numeric_integrity_score, exact_matches, ... } (v1.2)
result.compilerVersion;    // "v1.3.1"