Heuristic vs LLM compilation
The CKF schema is the same in both modes — 22 sections, schema-stable output. What differs is how knowledge gets into the schema. A heuristic engine recognizes surface patterns. An LLM engine infers meaning. The gap between the two is exactly what this page is about.
See the difference at a glance
Pick a scenario. The middle column is what a regex-and-TF heuristic recovers. The right column is what an LLM produces from the exact same source.
# Database Migrations Guide A migration is a versioned, idempotent SQL script that brings a database schema from one state to another. Every migration must be reviewed before merging. Avoid editing migrations after they have been applied to any shared environment — instead, create a new migration that reverses or amends the change. If a migration fails halfway, the transaction is rolled back and the schema returns to the previous state.
concepts:
- id: CON_001
label: Migration
definition: "A migration is a versioned, idempotent SQL script that brings a database schema from one state to another."
confidence: 0.78
heuristics:
- id: HEU_001
trigger: "relevant decision context detected"
interpretation: "Every migration must be reviewed before merging."
recommended_action: "Follow the recommended practice."
confidence: 0.76
- id: HEU_002
trigger: "context contains a known failure mode"
interpretation: "Avoid editing migrations after they have been applied to any shared environment."
recommended_action: "Avoid the described action."
confidence: 0.76
causal_chains:
- id: CAU_001
cause: "If a migration fails halfway"
effect: "the transaction is rolled back and the schema returns to the previous state."
confidence: 0.70concepts:
- id: CON_001
label: Migration
definition: "Versioned, idempotent SQL script that transitions a database schema between two known states."
depends_on: []
enables: [CON_002]
confidence: 0.92
- id: CON_002
label: Schema state
definition: "The committed structural definition of a database at a given migration version."
confidence: 0.85
decision_rules:
- id: RULE_001
condition: "A migration has not yet been applied to a shared environment."
decision: "It may be edited in place before merge review."
failure_mode: "Editing a migration already applied to staging or production."
confidence: 0.88
- id: RULE_002
condition: "A migration has already been applied to a shared environment."
decision: "Do NOT edit it; author a follow-up migration that amends or reverses the change."
confidence: 0.90
causal_chains:
- id: CAU_001
cause: "A migration fails partway through execution."
mechanism: "The enclosing transaction aborts."
effect: "All statements in the migration are rolled back; the schema reverts to the pre-migration state."
intervention_points: ["wrap DDL in a transaction", "fail fast on first error"]
confidence: 0.92What the LLM caught that the heuristic missed
- LLM split the implicit 'Schema state' concept that the heuristic never noticed.
- LLM produced TWO conditional decision rules ('not yet applied' vs 'already applied'); the heuristic kept the avoidance as a flat heuristic.
- LLM added the mechanism ('enclosing transaction aborts') and intervention points to the causal chain — the heuristic only kept cause and effect.
Heuristic compilation
Pure algorithm. No model, no API, no network. The compiler walks your text once and fills the CKF schema from surface patterns it can match with regex and frequency counts.
- Tokenize & sentence-split — Unicode-aware regex over the full text.
- Top-N nouns — TF counts (stopwords removed) become candidate entities and concept labels.
- Pattern extraction — regex matches like should/must/deve create heuristics, if … then creates IF-THEN rules, because/causes creates causal chains.
- Co-occurrence relations — entities that appear together get a weak co_occurs_with link.
- Sliding-window chunks — sentences grouped by count to populate retrieval chunks and atomic units.
- Serialize — the partially populated package is written out in the canonical CKF format.
text ──▶ tokenize ──▶ sentence split
│ │
▼ ▼
top-N nouns regex patterns
(TF count) (must / avoid / if / because)
│ │
└──────┬───────┘
▼
co-occurrence map
│
▼
canonical CKF (22 sections)LLM compilation
Same target schema. Different filling strategy: the LLM is forced — via function calling with a strict CKF tool schema — to return structured fields the post-processor can trust.
- Semantic chunking — split on markdown structure first, fall back to size; preserve sentence boundaries.
- Per-chunk extraction — LLM call with temperature=0 and a forced function-call schema; the model fills CKF fields directly. Synonyms, coreference, implicit concepts and causal inferences happen here.
- Reduce & merge — multiple chunks merged: entities deduped by alias, concepts merged by canonical label, relations consolidated.
- Promotion pass — strong atomic units get promoted to typed sections (rules, anti-patterns, causal chains).
- Sanitize — drop unfounded items, quarantine low-confidence ones, dedupe.
- Quality gates — coverage, density, traceability, schema-completeness scored; metadata calibrated.
text ──▶ semantic chunker (markdown-aware)
│
▼ per chunk
┌──────────────────────────┐
│ LLM + function calling │
│ forced CKF tool schema │
└──────────────────────────┘
│ N partials
▼
reduce · dedupe · merge
│
▼
promote · sanitize · ensureIds
│
▼
quality gates · calibrate
│
▼
canonical CKF (22 sections)Where each engine wins
The schema is identical. The question is what you can get into it for your text, your budget, and your privacy constraints.
| Capability | Heuristic | LLM |
|---|---|---|
| Synonyms & paraphrase | Not detected | Merged automatically |
| Implicit concepts | Only visible text | Inferred from context |
| Coreference (pronouns, aliases) | Treated as new entity | Resolved to canonical subject |
| Causal relations | Only when "because/causes" appears | Inferred even without keywords |
| Reformulated definitions | Needs the "X is Y" pattern | Recognized in any form |
| Narrative text | Fails — keywords rarely appear | Works well |
| Well-structured manuals | ~70% of LLM quality | Reference quality |
| Cost | Zero | Paid per token |
| Latency | Milliseconds | 10 s – 2 min |
| Schema-stability | Total — pure algorithm | High with temperature=0 + function calling, not 100% |
| Privacy | Local, never leaves the browser | Source text sent to provider |
| File size ceiling | ~10 MB (browser memory) | ~960k chars per run (chunked) |
Hallucination by Composition
A heuristic CKF can be structurally valid and still be semantically wrong — fragments that look related by surface co-occurrence may not actually entail each other. When a RAG pipeline or agent then composes those fragments to answer a question, the model treats them as a coherent argument and hallucinates with extra confidence. This is the failure mode we call Composition Hallucination, and it is the strongest reason to prefer LLM compilation for anything that will feed a downstream agent.
Read the articleWhen to use which
Use heuristic when…
- • You need a CKF in milliseconds for prototyping.
- • The source is well-structured (manual, README, spec).
- • You cannot send the text to any third-party provider.
- • You want a baseline to compare against an LLM run.
- • You are processing thousands of documents and cost matters more than recall.
- • You only need retrieval chunks and atomic units, not deep reasoning structure.
Use LLM when…
- • The CKF will feed a production RAG or agent (avoid Composition Hallucination).
- • The text is narrative, persuasive, or domain-rich.
- • You need real concepts, causal chains, anti-patterns and decision rules — not just keywords.
- • Synonyms and coreference matter (legal, medical, scientific).
- • The source crosses 22 CKF sections in non-obvious ways.
- • Quality matters more than latency or cost.