Documentation

Compiler v1.3.1

Preflight & coverage

Four schema-stable, domain-agnostic guarantees the v1.3.1 pipeline adds on top of v1.1: source preflight, record-level segmentation, coverage modes, and numeric integrity.

Status

compiler v1.3.1
protocol ckf-1.0
stable

Preflight

Before any chunking or LLM call, profileSource() inspects the raw input and reports what it found. The preflight is what enables auto-tuning of coverage and what prevents the compiler from spending tokens on inputs that have no extractable content.

ts
import { profileSource } from "@/lib/compiler/sourceProfiler";

const profile = profileSource(text, { filename });
// profile.detectedLanguage     — "en" | "pt-BR" | "es" | "unknown" | ...
// profile.detectedFormat       — "jsonl_records" | "json_array_records" | "faq"
//                              | "markdown" | "transcript" | "legal_norm" | "plain_text" | ...
// profile.recordCount          — number of records when format is record-oriented
// profile.sourceCharCount      — total characters
// profile.sourceWordCount      — total words
// profile.estimatedChunks      — projected number of chunks for the LLM stage
// profile.hasStructuredRecords — true when records can be addressed individually
// profile.warnings             — non-fatal hints surfaced in the UI
// profile.blocked              — true ⇒ compilation MUST abort (e.g. hash-only or empty)
// profile.blockedReason        — short explanation when blocked

Hard-blocking conditions today: empty input, filename-only input, hex-hash-only input. More can be added without changing the contract.

Source manifest & record IDs

segmentSource() takes the source text and the preflight and emits aSourceSpan[]. Each span carries a stable sourceRecordId when the input is record-oriented (one span per JSONL record, one per FAQ Q/A, one per normative article), or a structural id derived from headings / paragraph offsets when the input is prose.

ts
import { segmentSource, buildSourceManifest } from "@/lib/compiler/sourceSegmenter";

const spans = await segmentSource(text, profile, { filename });
const sourceManifest = buildSourceManifest(spans);
// SourceSpan { id, sourceRecordId?, sourceType, path, lineStart/End, charStart/End,
//              textSha256, text }
// sourceManifest[i] = { source_record_id, source_type, text_preview, ... }

Spans are passed to the chunker, which subdivides large spans without losing thesourceRecordId association. Every downstream item that originates from a chunk inherits the manifest entry and reaches the final package viasource_traceability.

Coverage modes

The coverage pass ensures that record-oriented inputs do not lose records to LLM summarization. It runs after sanitation and inserts schema-stable items wherever the requested mode demands them.

ModeBehaviorUse when
summaryNo inserts. Trusts the LLM's compression.Prose, long-form, gist-level use cases.
balancedDefault. Inserts retrieval chunks for under-covered spans only.Mixed inputs, most real-world sources.
completeOne retrieval chunk per source record; QA pairs and atomic units when present.FAQs, JSONL datasets, normative corpora.

Auto-upgrade

When the caller does not specify coverageMode and the preflight reports a record-oriented format (jsonl_records, json_array_records,faq, legal_norm), the pipeline auto-upgrades to complete. This is a structural rule — it never inspects the content of the records, only their shape.

Numeric integrity

A common LLM failure mode is silent truncation of literal tokens — US$ 1,234.56 becomes US$ 1,234, 2024-05-31 becomes May 2024. The v1.3.1 pipeline runs extractNumericFacts() against every source span and verifies that every numeric / citation token referenced in the final package appears verbatim. Tokens are typed but not domain-specific:

  • moneyUS$, , £, ¥, R$, ISO codes (USD, EUR, GBP, BRL, …), both US (1,234.56) and EU (1.234,56) formats.
  • percent20%, 0.5%, 5 per cent.
  • date — ISO (2024-05-31), US (05/31/2024), EU (31/05/2024), long form (March 5, 2024, 5 de março de 2024).
  • duration — EN (seconds / minutes / hours / days / weeks / months / years) and PT (segundos / minutos / horas / dias / semanas / meses / anos).
  • citation_reference — DOI, ISBN, RFC, ISO, §, Section, Chapter, Art., Capítulo, Lei / Decreto / Portaria / Instrução Normativa.
  • number — fallback for plain numeric literals.
json
{
  "numeric_integrity": {
    "numeric_integrity_score": 0.98,
    "total_values_checked": 51,
    "exact_matches": 50,
    "corrected": 1,
    "unverifiable": 0
  }
}

corrected tokens are restored against the source span before the package is serialized. unverifiable tokens are flagged in warnings and counted toward the quality score, never silently rewritten.

Language recovery

When the final package's prose drifts from preflight.detectedLanguage (or from the caller's explicit targetLanguage) past a structural threshold, the pipeline triggers a single re-run with a reinforced language directive and the field-aware sanitizer rebuilds the affected sections. The outcome is reported back in metrics:

json
{
  "sanitizer": {
    "removed_count": 3,
    "quarantined_count": 0,
    "deduplicated_count": 5,
    "restored_count": 12,
    "language_recovery_applied": true
  }
}

Recovery is bounded (at most one re-run) and applies to any declared language, not only EN / PT. The mechanism is the same for an English source that leaked into Spanish as for a Portuguese source that leaked into English.

See also

CKF v1.0 for this page has not been compiled yet. Downloads become available once an admin runs the compiler.