Compiler v1.3.1
Preflight & coverage
Four schema-stable, domain-agnostic guarantees the v1.3.1 pipeline adds on top of v1.1: source preflight, record-level segmentation, coverage modes, and numeric integrity.
Status
Preflight
Before any chunking or LLM call, profileSource() inspects the raw input and reports what it found. The preflight is what enables auto-tuning of coverage and what prevents the compiler from spending tokens on inputs that have no extractable content.
import { profileSource } from "@/lib/compiler/sourceProfiler";
const profile = profileSource(text, { filename });
// profile.detectedLanguage — "en" | "pt-BR" | "es" | "unknown" | ...
// profile.detectedFormat — "jsonl_records" | "json_array_records" | "faq"
// | "markdown" | "transcript" | "legal_norm" | "plain_text" | ...
// profile.recordCount — number of records when format is record-oriented
// profile.sourceCharCount — total characters
// profile.sourceWordCount — total words
// profile.estimatedChunks — projected number of chunks for the LLM stage
// profile.hasStructuredRecords — true when records can be addressed individually
// profile.warnings — non-fatal hints surfaced in the UI
// profile.blocked — true ⇒ compilation MUST abort (e.g. hash-only or empty)
// profile.blockedReason — short explanation when blockedHard-blocking conditions today: empty input, filename-only input, hex-hash-only input. More can be added without changing the contract.
Source manifest & record IDs
segmentSource() takes the source text and the preflight and emits aSourceSpan[]. Each span carries a stable sourceRecordId when the input is record-oriented (one span per JSONL record, one per FAQ Q/A, one per normative article), or a structural id derived from headings / paragraph offsets when the input is prose.
import { segmentSource, buildSourceManifest } from "@/lib/compiler/sourceSegmenter";
const spans = await segmentSource(text, profile, { filename });
const sourceManifest = buildSourceManifest(spans);
// SourceSpan { id, sourceRecordId?, sourceType, path, lineStart/End, charStart/End,
// textSha256, text }
// sourceManifest[i] = { source_record_id, source_type, text_preview, ... }Spans are passed to the chunker, which subdivides large spans without losing thesourceRecordId association. Every downstream item that originates from a chunk inherits the manifest entry and reaches the final package viasource_traceability.
Coverage modes
The coverage pass ensures that record-oriented inputs do not lose records to LLM summarization. It runs after sanitation and inserts schema-stable items wherever the requested mode demands them.
| Mode | Behavior | Use when |
|---|---|---|
| summary | No inserts. Trusts the LLM's compression. | Prose, long-form, gist-level use cases. |
| balanced | Default. Inserts retrieval chunks for under-covered spans only. | Mixed inputs, most real-world sources. |
| complete | One retrieval chunk per source record; QA pairs and atomic units when present. | FAQs, JSONL datasets, normative corpora. |
Auto-upgrade
coverageMode and the preflight reports a record-oriented format (jsonl_records, json_array_records,faq, legal_norm), the pipeline auto-upgrades to complete. This is a structural rule — it never inspects the content of the records, only their shape.Numeric integrity
A common LLM failure mode is silent truncation of literal tokens — US$ 1,234.56 becomes US$ 1,234, 2024-05-31 becomes May 2024. The v1.3.1 pipeline runs extractNumericFacts() against every source span and verifies that every numeric / citation token referenced in the final package appears verbatim. Tokens are typed but not domain-specific:
money—US$,€,£,¥,R$, ISO codes (USD,EUR,GBP,BRL, …), both US (1,234.56) and EU (1.234,56) formats.percent—20%,0.5%,5 per cent.date— ISO (2024-05-31), US (05/31/2024), EU (31/05/2024), long form (March 5, 2024,5 de março de 2024).duration— EN (seconds / minutes / hours / days / weeks / months / years) and PT (segundos / minutos / horas / dias / semanas / meses / anos).citation_reference— DOI, ISBN, RFC, ISO,§,Section,Chapter,Art.,Capítulo,Lei / Decreto / Portaria / Instrução Normativa.number— fallback for plain numeric literals.
{
"numeric_integrity": {
"numeric_integrity_score": 0.98,
"total_values_checked": 51,
"exact_matches": 50,
"corrected": 1,
"unverifiable": 0
}
}corrected tokens are restored against the source span before the package is serialized. unverifiable tokens are flagged in warnings and counted toward the quality score, never silently rewritten.
Language recovery
When the final package's prose drifts from preflight.detectedLanguage (or from the caller's explicit targetLanguage) past a structural threshold, the pipeline triggers a single re-run with a reinforced language directive and the field-aware sanitizer rebuilds the affected sections. The outcome is reported back in metrics:
{
"sanitizer": {
"removed_count": 3,
"quarantined_count": 0,
"deduplicated_count": 5,
"restored_count": 12,
"language_recovery_applied": true
}
}Recovery is bounded (at most one re-run) and applies to any declared language, not only EN / PT. The mechanism is the same for an English source that leaked into Spanish as for a Portuguese source that leaked into English.
See also
- Compiler pipeline v1.3.1 — where these stages fit in the full order.
- Language lock — the field-aware sanitizer that recovery relies on.
- Structuring rules — how
source_record_idpropagates intosource_traceability. - Extraction pipeline — preflight + segmentation feed the chunker described here.