Compiler

Compiler pipeline v1.3.1

The canonical post-extraction pipeline. Single source of truth for every surface that turns a raw source into a final CKF package — /compiler, /compiler/demo, MCP, Lab and admin recompile.

Status

compiler v1.3.1

profile ckf-v1.3.1-auditable-default

protocol ckf-1.0

stable surface

Implemented in src/lib/compiler/pipeline.ts. Entry point: runCkfPipeline(partials, options).

What changed in v1.3.1

Canonical PDF metadata extractor. A new schema-stable module (src/lib/compiler/pdfMetadataExtractor.ts) runs once over the front-matter (pages 1-5) and back-matter (last 3 pages) of every PDF source and derives source_title, source_subtitle, source_authors[], source_edition, source_publisher, source_year, source_isbn directly from the source — never the LLM. Each override emits an auditable warning.
Controlled source_type vocabulary for PDFs. pdf_book ⇒ "PDF e-book", pdf_document ⇒ "PDF document". The LLM no longer decides what kind of source it is looking at — that comes from the preflight.
Subsection contamination sanitizer. Strips prefixes/suffixes like "Capítulo 3 — " and "(seção do e-book …)" from any source_title, regardless of format.
"Not found" is auditable. When the extractor can't find a field (e.g. ISBN missing from front-matter), the pipeline keeps the LLM value and emits a warning so the auditor knows that specific field is LLM-derived.

What changed in v1.3

PDF-aware traceability. Page sentinels, chapter spans, per-item provenance via audit_matrix.
Semantic dedupe + caps. PDF books get section-aware deduplication keyed off page count.
Source SHA-256 + compiler profile. Every package self-identifies its source and the exact pipeline that produced it.

What changed in v1.2

Source preflight. profileSource() detects language, format and record count, and hard-blocks empty / filename-only / hash-only inputs before any LLM cost is incurred.
Record-level segmentation. segmentSource() emits SourceSpan[] with stable source_record_ids and a source_manifest that propagates all the way to source_traceability.
Coverage modes. summary · balanced · complete. Auto-upgrades to complete when the preflight detects a record-oriented format (jsonl_records, json_array_records, faq, legal_norm).
Numeric integrity guards. Domain-agnostic extractor for currencies, percentages, dates, durations, and citation references.
Language recovery. When the post-sanitizer output drifts from preflight.detectedLanguage, the pipeline triggers a re-run and reports the result in metrics.sanitizer.language_recovery_applied + restored_count.

See Preflight & coverage for the long-form walkthrough.

Canonical PDF metadata block (v1.3.1)

For paginated sources, the final package carries a fully canonical metadata block:

json

{
  "source_type": "PDF e-book",
  "source_scope": "full_book",
  "source_title": "Orientações Baseadas no Cérebro para Transformar Ensino em Aprendizagem",
  "source_subtitle": "guia prático para professores",
  "source_author": "Paulo Tomazinho",
  "source_authors": ["Paulo Tomazinho"],
  "source_publisher": "Editora Exemplo",
  "source_edition": "1ª edição",
  "source_year": "2024",
  "source_isbn": "9786500000000",
  "page_count": 123,
  "source_sha256": "…",
  "source_traceability_mode": "item_level_with_pdf_page_spans_chapter_spans_excerpts_and_hashes",
  "source_manifest_count": 123,
  "compiler_profile": "ckf-v1.3.1-auditable-default"
}

Heuristic vs LLM compile

This pipeline runs on top of LLM partials: it is the post-extraction stage of the LLM compile path (/compiler, MCP ckf.compile_llm, Lab, admin recompile). The MCP also exposes a default heuristic compile tool (ckf.compile) that bypasses any LLM call and emits a schema-stable package directly from the source text — zero-config, always works, no key, no auth. Use the heuristic when reliability matters more than fidelity; switch to the LLM path (BYOK, or Advanced AI for admin/allowlist) when you want richer inference and reduced composition hallucination. Side-by-side: /compiler-heuristic-vs-llm.

Stages

┌────────────────────┐
│  Source text       │
└─────────┬──────────┘
          ▼
   ┌──────────────┐   profileSource()           src/lib/compiler/sourceProfiler.ts
   │  preflight   │   language · format · records · blocked?
   └──────┬───────┘
          ▼
   ┌──────────────┐   segmentSource()           src/lib/compiler/sourceSegmenter.ts
   │  segment     │   SourceSpan[] + sourceManifest (source_record_id, hashes)
   └──────┬───────┘
          ▼
   ┌──────────────┐   chunksFromSpans()         src/lib/compiler/chunker.ts
   │  chunk       │   ChunkRef + NumericFact[] context
   └──────┬───────┘
          ▼
   ┌──────────────┐   per-chunk LLM call        (CkfPartial per chunk)
   │  lift        │
   └──────┬───────┘
          ▼
   ┌──────────────┐   reduce()                  src/lib/compiler/reduce.ts
   │  reduce      │   merge by id, lock target language
   └──────┬───────┘
          ▼
   ┌──────────────┐   promoteAtomicsAndChunks() src/lib/compiler/promote.ts
   │  promote     │   atomic_units ↦ if_then_rules / playbooks / anti_patterns
   └──────┬───────┘
          ▼
   ┌──────────────┐   sanitizeMergedPackage()   src/lib/compiler/packageSanitizer.ts
   │  sanitize    │   field-aware language / completeness / truncation
   └──────┬───────┘     (auto language recovery re-run when needed)
          ▼
   ┌──────────────┐   ensureIds() · rebuildSourceTraceability()
   │  ids+trace   │   stable ids, propagates source_record_id
   └──────┬───────┘
          ▼
   ┌──────────────┐   coveragePass()            src/lib/compiler/coveragePass.ts
   │  coverage    │   inserts retrieval_chunks / qa_pairs / atomic_units per mode
   └──────┬───────┘
          ▼
   ┌──────────────┐   extractNumericFacts() + verify
   │  numeric     │   currencies, dates, durations, percents, citations
   └──────┬───────┘
          ▼
   ┌──────────────┐   computeQuality()          src/lib/compiler/quality.ts
   │  quality     │   human_readability · ai_utility_score
   └──────┬───────┘
          ▼
   ┌──────────────────────┐
   │  Final CkfPackage    │  + warnings[] + preflight + coverage + numeric_integrity
   └──────────────────────┘

Each stage is pure and independently testable. Order is load-bearing: sanitize must run before ensureIds + rebuildSourceTraceability so that removed items never appear in the rebuilt traceability section; coverage and numeric guards run after the package is stable.

Entry point

import { runCkfPipeline, COMPILER_VERSION } from "@/lib/compiler/pipeline";

const result = runCkfPipeline(partials, {
  chunks,                  // ChunkRef[] used for source-text scoring
  spans,                   // v1.2 — SourceSpan[] from segmentSource()
  sourceManifest,          // v1.2 — manifest with source_record_id, hashes
  profile,                 // v1.2 — preflight (language, format, recordCount)
  coverageMode: "balanced",// "summary" | "balanced" | "complete"
  filename: "policy.md",
  targetLanguage: "en",    // any ISO code — hard language lock
  sourceText: rawText,
});

result.pkg;                 // CkfPackage
result.quality;             // QualityReport
result.promotion;           // { promoted, rejected }
result.sanitizer;           // { removed_count, ..., restored_count, language_recovery_applied }
result.preflight;           // SourceProfile (v1.2)
result.coverage;            // { mode, inserted_*, source_record_coverage } (v1.2)
result.numericIntegrity;    // { numeric_integrity_score, exact_matches, ... } (v1.2)
result.warnings;            // string[]
result.compilerVersion;     // COMPILER_VERSION === "v1.3.1"

Where it runs

/compiler — full LLM compile via compileToCkf() in compiler.functions.ts.
/compiler/demo — single-chunk LLM compile, same pipeline.
/api/mcp · ckf.compile_llm — server-side, accepts language, coverage_mode and BYOK.
/admin/recompile-ckf · /admin/recompile-pages — bulk recompilation for posts and pages.
/lab — used when ingesting an arbitrary source for an A/B/A study.

Metrics surface

The MCP ckf.compile_llm response and the compiler_jobs table both record a normalized metrics object:

json

{
  "compiler_version": "v1.3.1",
  "preflight": {
    "detectedLanguage": "en",
    "detectedFormat": "jsonl_records",
    "recordCount": 42,
    "sourceCharCount": 18234
  },
  "coverage": {
    "mode": "complete",
    "inserted_retrieval_chunks": 42,
    "inserted_qa_pairs": 0,
    "inserted_atomic_units": 7,
    "source_record_coverage": 1.0
  },
  "numeric_integrity": {
    "numeric_integrity_score": 0.98,
    "total_values_checked": 51,
    "exact_matches": 50,
    "corrected": 1,
    "unverifiable": 0
  },
  "sanitizer": {
    "removed_count": 3,
    "quarantined_count": 0,
    "deduplicated_count": 5,
    "restored_count": 0,
    "language_recovery_applied": false
  },
  "promotion": { "promoted": 8, "rejected": 2 },
  "quality": { "human_readability": 0.81, "ai_utility_score": 0.74 }
}

Why a single pipeline

One pipeline, every surface

Before v1.1 each surface ran its own ad-hoc sequence of reducers and sanitizers, which let bugs surface in one place and not another. v1.1 centralized the order so every compile — UI, MCP, Lab, recompile — produces structurally consistent output. v1.2 extended that contract with preflight, coverage and numeric guards, keeping the same single-entry-point guarantee.

Domain-agnostic by design

Every v1.2 module operates on structure, not on subject matter. Numeric guards cover international currencies, ISO/US/EU dates and generic citation forms; language detection spans EN / PT / ES / others; coverage modes are driven by source format, never by topic. The pipeline ships zero domain-specific heuristics (no IRPF, finance, legal or medical hardcoding) — it should produce the same shape for an API doc, a textbook chapter, a FAQ or a normative text.

Lab parity

The Lab compiles through the same runCkfPipeline as /compiler. PDF extraction is structure-preserving (groups items by Y-coordinate, detects paragraph gaps and headings, strips repeated headers/footers) and the chunker default is 6k chars. The Lab also exposes a chunk-size selector (4k / 6k / 12k) and defaults the Lovable provider togemini-2.5-pro rather than Flash for higher-fidelity extraction. When the chosen model routes through Gemini (direct or via the Lovable Gateway as google/*), the compiler sends a slim variant of the tool schema — descriptions and enums are dropped to stay within Gemini's constrained-decoding state budget; the extraction rules remain in the system prompt.