Compiler
Compiler pipeline v1.3.1
The canonical post-extraction pipeline. Single source of truth for every surface that turns a raw source into a final CKF package — /compiler, /compiler/demo, MCP, Lab and admin recompile.
Status
Implemented in src/lib/compiler/pipeline.ts. Entry point: runCkfPipeline(partials, options).
What changed in v1.3.1
- Canonical PDF metadata extractor. A new schema-stable module (
src/lib/compiler/pdfMetadataExtractor.ts) runs once over the front-matter (pages 1-5) and back-matter (last 3 pages) of every PDF source and derivessource_title,source_subtitle,source_authors[],source_edition,source_publisher,source_year,source_isbndirectly from the source — never the LLM. Each override emits an auditable warning. - Controlled
source_typevocabulary for PDFs.pdf_book⇒"PDF e-book",pdf_document⇒"PDF document". The LLM no longer decides what kind of source it is looking at — that comes from the preflight. - Subsection contamination sanitizer. Strips prefixes/suffixes like
"Capítulo 3 — "and"(seção do e-book …)"from anysource_title, regardless of format. - "Not found" is auditable. When the extractor can't find a field (e.g. ISBN missing from front-matter), the pipeline keeps the LLM value and emits a warning so the auditor knows that specific field is LLM-derived.
What changed in v1.3
- PDF-aware traceability. Page sentinels, chapter spans, per-item provenance via
audit_matrix. - Semantic dedupe + caps. PDF books get section-aware deduplication keyed off page count.
- Source SHA-256 + compiler profile. Every package self-identifies its source and the exact pipeline that produced it.
What changed in v1.2
- Source preflight.
profileSource()detects language, format and record count, and hard-blocks empty / filename-only / hash-only inputs before any LLM cost is incurred. - Record-level segmentation.
segmentSource()emitsSourceSpan[]with stablesource_record_ids and asource_manifestthat propagates all the way tosource_traceability. - Coverage modes.
summary·balanced·complete. Auto-upgrades tocompletewhen the preflight detects a record-oriented format (jsonl_records,json_array_records,faq,legal_norm). - Numeric integrity guards. Domain-agnostic extractor for currencies, percentages, dates, durations, and citation references.
- Language recovery. When the post-sanitizer output drifts from
preflight.detectedLanguage, the pipeline triggers a re-run and reports the result inmetrics.sanitizer.language_recovery_applied+restored_count.
See Preflight & coverage for the long-form walkthrough.
Canonical PDF metadata block (v1.3.1)
For paginated sources, the final package carries a fully canonical metadata block:
{
"source_type": "PDF e-book",
"source_scope": "full_book",
"source_title": "Orientações Baseadas no Cérebro para Transformar Ensino em Aprendizagem",
"source_subtitle": "guia prático para professores",
"source_author": "Paulo Tomazinho",
"source_authors": ["Paulo Tomazinho"],
"source_publisher": "Editora Exemplo",
"source_edition": "1ª edição",
"source_year": "2024",
"source_isbn": "9786500000000",
"page_count": 123,
"source_sha256": "…",
"source_traceability_mode": "item_level_with_pdf_page_spans_chapter_spans_excerpts_and_hashes",
"source_manifest_count": 123,
"compiler_profile": "ckf-v1.3.1-auditable-default"
}Heuristic vs LLM compile
This pipeline runs on top of LLM partials: it is the post-extraction stage of the LLM compile path (/compiler, MCP ckf.compile_llm, Lab, admin recompile). The MCP also exposes a default heuristic compile tool (ckf.compile) that bypasses any LLM call and emits a schema-stable package directly from the source text — zero-config, always works, no key, no auth. Use the heuristic when reliability matters more than fidelity; switch to the LLM path (BYOK, or Advanced AI for admin/allowlist) when you want richer inference and reduced composition hallucination. Side-by-side: /compiler-heuristic-vs-llm.
Stages
┌────────────────────┐
│ Source text │
└─────────┬──────────┘
▼
┌──────────────┐ profileSource() src/lib/compiler/sourceProfiler.ts
│ preflight │ language · format · records · blocked?
└──────┬───────┘
▼
┌──────────────┐ segmentSource() src/lib/compiler/sourceSegmenter.ts
│ segment │ SourceSpan[] + sourceManifest (source_record_id, hashes)
└──────┬───────┘
▼
┌──────────────┐ chunksFromSpans() src/lib/compiler/chunker.ts
│ chunk │ ChunkRef + NumericFact[] context
└──────┬───────┘
▼
┌──────────────┐ per-chunk LLM call (CkfPartial per chunk)
│ lift │
└──────┬───────┘
▼
┌──────────────┐ reduce() src/lib/compiler/reduce.ts
│ reduce │ merge by id, lock target language
└──────┬───────┘
▼
┌──────────────┐ promoteAtomicsAndChunks() src/lib/compiler/promote.ts
│ promote │ atomic_units ↦ if_then_rules / playbooks / anti_patterns
└──────┬───────┘
▼
┌──────────────┐ sanitizeMergedPackage() src/lib/compiler/packageSanitizer.ts
│ sanitize │ field-aware language / completeness / truncation
└──────┬───────┘ (auto language recovery re-run when needed)
▼
┌──────────────┐ ensureIds() · rebuildSourceTraceability()
│ ids+trace │ stable ids, propagates source_record_id
└──────┬───────┘
▼
┌──────────────┐ coveragePass() src/lib/compiler/coveragePass.ts
│ coverage │ inserts retrieval_chunks / qa_pairs / atomic_units per mode
└──────┬───────┘
▼
┌──────────────┐ extractNumericFacts() + verify
│ numeric │ currencies, dates, durations, percents, citations
└──────┬───────┘
▼
┌──────────────┐ computeQuality() src/lib/compiler/quality.ts
│ quality │ human_readability · ai_utility_score
└──────┬───────┘
▼
┌──────────────────────┐
│ Final CkfPackage │ + warnings[] + preflight + coverage + numeric_integrity
└──────────────────────┘Each stage is pure and independently testable. Order is load-bearing: sanitize must run before ensureIds + rebuildSourceTraceability so that removed items never appear in the rebuilt traceability section; coverage and numeric guards run after the package is stable.
Entry point
import { runCkfPipeline, COMPILER_VERSION } from "@/lib/compiler/pipeline";
const result = runCkfPipeline(partials, {
chunks, // ChunkRef[] used for source-text scoring
spans, // v1.2 — SourceSpan[] from segmentSource()
sourceManifest, // v1.2 — manifest with source_record_id, hashes
profile, // v1.2 — preflight (language, format, recordCount)
coverageMode: "balanced",// "summary" | "balanced" | "complete"
filename: "policy.md",
targetLanguage: "en", // any ISO code — hard language lock
sourceText: rawText,
});
result.pkg; // CkfPackage
result.quality; // QualityReport
result.promotion; // { promoted, rejected }
result.sanitizer; // { removed_count, ..., restored_count, language_recovery_applied }
result.preflight; // SourceProfile (v1.2)
result.coverage; // { mode, inserted_*, source_record_coverage } (v1.2)
result.numericIntegrity; // { numeric_integrity_score, exact_matches, ... } (v1.2)
result.warnings; // string[]
result.compilerVersion; // COMPILER_VERSION === "v1.3.1"Where it runs
- /compiler — full LLM compile via
compileToCkf()incompiler.functions.ts. - /compiler/demo — single-chunk LLM compile, same pipeline.
- /api/mcp · ckf.compile_llm — server-side, accepts
language,coverage_modeand BYOK. - /admin/recompile-ckf · /admin/recompile-pages — bulk recompilation for posts and pages.
- /lab — used when ingesting an arbitrary source for an A/B/A study.
Metrics surface
The MCP ckf.compile_llm response and the compiler_jobs table both record a normalized metrics object:
{
"compiler_version": "v1.3.1",
"preflight": {
"detectedLanguage": "en",
"detectedFormat": "jsonl_records",
"recordCount": 42,
"sourceCharCount": 18234
},
"coverage": {
"mode": "complete",
"inserted_retrieval_chunks": 42,
"inserted_qa_pairs": 0,
"inserted_atomic_units": 7,
"source_record_coverage": 1.0
},
"numeric_integrity": {
"numeric_integrity_score": 0.98,
"total_values_checked": 51,
"exact_matches": 50,
"corrected": 1,
"unverifiable": 0
},
"sanitizer": {
"removed_count": 3,
"quarantined_count": 0,
"deduplicated_count": 5,
"restored_count": 0,
"language_recovery_applied": false
},
"promotion": { "promoted": 8, "rejected": 2 },
"quality": { "human_readability": 0.81, "ai_utility_score": 0.74 }
}Why a single pipeline
One pipeline, every surface
Domain-agnostic by design
Lab parity
runCkfPipeline as /compiler. PDF extraction is structure-preserving (groups items by Y-coordinate, detects paragraph gaps and headings, strips repeated headers/footers) and the chunker default is 6k chars. The Lab also exposes a chunk-size selector (4k / 6k / 12k) and defaults the Lovable provider togemini-2.5-pro rather than Flash for higher-fidelity extraction. When the chosen model routes through Gemini (direct or via the Lovable Gateway as google/*), the compiler sends a slim variant of the tool schema — descriptions and enums are dropped to stay within Gemini's constrained-decoding state budget; the extraction rules remain in the system prompt.See also
- Preflight & coverage — the v1.2 additions in detail.
- Language lock — the field-aware sanitizer + recovery.
- Extraction pipeline — what produces the partials this stage consumes.
- Structuring rules — ids, dedup, promotion.
- MCP Server — call the pipeline from any agent.