Compiler internals

Heuristic vs LLM compilation

The CKF schema is the same in both modes — 22 sections, schema-stable output. What differs is how knowledge gets into the schema. A heuristic engine recognizes surface patterns. An LLM engine infers meaning. The gap between the two is exactly what this page is about.

Open Compiler Read: Composition Hallucination

Same text, two outputs

See the difference at a glance

Pick a scenario. The middle column is what a regex-and-TF heuristic recovers. The right column is what an LLM produces from the exact same source.

Scenario · Best case for the heuristic. Explicit normative words (must, avoid) and clear 'X is Y' definitions let pattern matching get close to the LLM output.

SOURCE

raw input

# Database Migrations Guide

A migration is a versioned, idempotent SQL script that brings a database
schema from one state to another. Every migration must be reviewed before
merging. Avoid editing migrations after they have been applied to any
shared environment — instead, create a new migration that reverses or
amends the change. If a migration fails halfway, the transaction is
rolled back and the schema returns to the previous state.

HEURISTIC CKF

~5 ms · free

yaml

concepts:
  - id: CON_001
    label: Migration
    definition: "A migration is a versioned, idempotent SQL script that brings a database schema from one state to another."
    confidence: 0.78

heuristics:
  - id: HEU_001
    trigger: "relevant decision context detected"
    interpretation: "Every migration must be reviewed before merging."
    recommended_action: "Follow the recommended practice."
    confidence: 0.76
  - id: HEU_002
    trigger: "context contains a known failure mode"
    interpretation: "Avoid editing migrations after they have been applied to any shared environment."
    recommended_action: "Avoid the described action."
    confidence: 0.76

causal_chains:
  - id: CAU_001
    cause: "If a migration fails halfway"
    effect: "the transaction is rolled back and the schema returns to the previous state."
    confidence: 0.70

LLM CKF

~10–60 s · tokens

yaml

concepts:
  - id: CON_001
    label: Migration
    definition: "Versioned, idempotent SQL script that transitions a database schema between two known states."
    depends_on: []
    enables: [CON_002]
    confidence: 0.92
  - id: CON_002
    label: Schema state
    definition: "The committed structural definition of a database at a given migration version."
    confidence: 0.85

decision_rules:
  - id: RULE_001
    condition: "A migration has not yet been applied to a shared environment."
    decision: "It may be edited in place before merge review."
    failure_mode: "Editing a migration already applied to staging or production."
    confidence: 0.88
  - id: RULE_002
    condition: "A migration has already been applied to a shared environment."
    decision: "Do NOT edit it; author a follow-up migration that amends or reverses the change."
    confidence: 0.90

causal_chains:
  - id: CAU_001
    cause: "A migration fails partway through execution."
    mechanism: "The enclosing transaction aborts."
    effect: "All statements in the migration are rolled back; the schema reverts to the pre-migration state."
    intervention_points: ["wrap DDL in a transaction", "fail fast on first error"]
    confidence: 0.92

What the LLM caught that the heuristic missed

LLM split the implicit 'Schema state' concept that the heuristic never noticed.
LLM produced TWO conditional decision rules ('not yet applied' vs 'already applied'); the heuristic kept the avoidance as a flat heuristic.
LLM added the mechanism ('enclosing transaction aborts') and intervention points to the causal chain — the heuristic only kept cause and effect.

How it works

Heuristic compilation

Pure algorithm. No model, no API, no network. The compiler walks your text once and fills the CKF schema from surface patterns it can match with regex and frequency counts.

Tokenize & sentence-split — Unicode-aware regex over the full text.
Top-N nouns — TF counts (stopwords removed) become candidate entities and concept labels.
Pattern extraction — regex matches like should/must/deve create heuristics, if … then creates IF-THEN rules, because/causes creates causal chains.
Co-occurrence relations — entities that appear together get a weak co_occurs_with link.
Sliding-window chunks — sentences grouped by count to populate retrieval chunks and atomic units.
Serialize — the partially populated package is written out in the canonical CKF format.

text  ──▶  tokenize  ──▶  sentence split
                  │              │
                  ▼              ▼
            top-N nouns    regex patterns
            (TF count)     (must / avoid / if / because)
                  │              │
                  └──────┬───────┘
                         ▼
                co-occurrence map
                         │
                         ▼
               canonical CKF (22 sections)

How it works

LLM compilation

Same target schema. Different filling strategy: the LLM is forced — via function calling with a strict CKF tool schema — to return structured fields the post-processor can trust.

Semantic chunking — split on markdown structure first, fall back to size; preserve sentence boundaries.
Per-chunk extraction — LLM call with temperature=0 and a forced function-call schema; the model fills CKF fields directly. Synonyms, coreference, implicit concepts and causal inferences happen here.
Reduce & merge — multiple chunks merged: entities deduped by alias, concepts merged by canonical label, relations consolidated.
Promotion pass — strong atomic units get promoted to typed sections (rules, anti-patterns, causal chains).
Sanitize — drop unfounded items, quarantine low-confidence ones, dedupe.
Quality gates — coverage, density, traceability, schema-completeness scored; metadata calibrated.

text  ──▶  semantic chunker (markdown-aware)
                         │
                         ▼  per chunk
              ┌──────────────────────────┐
              │  LLM + function calling  │
              │  forced CKF tool schema  │
              └──────────────────────────┘
                         │ N partials
                         ▼
              reduce · dedupe · merge
                         │
                         ▼
         promote · sanitize · ensureIds
                         │
                         ▼
              quality gates · calibrate
                         │
                         ▼
               canonical CKF (22 sections)

Direct comparison

Where each engine wins

The schema is identical. The question is what you can get into it for your text, your budget, and your privacy constraints.

Capability	Heuristic	LLM
Synonyms & paraphrase	Not detected	Merged automatically
Implicit concepts	Only visible text	Inferred from context
Coreference (pronouns, aliases)	Treated as new entity	Resolved to canonical subject
Causal relations	Only when "because/causes" appears	Inferred even without keywords
Reformulated definitions	Needs the "X is Y" pattern	Recognized in any form
Narrative text	Fails — keywords rarely appear	Works well
Well-structured manuals	~70% of LLM quality	Reference quality
Cost	Zero	Paid per token
Latency	Milliseconds	10 s – 2 min
Schema-stability	Total — pure algorithm	High with temperature=0 + function calling, not 100%
Privacy	Local, never leaves the browser	Source text sent to provider
File size ceiling	~10 MB (browser memory)	~960k chars per run (chunked)

Important risk

Hallucination by Composition

A heuristic CKF can be structurally valid and still be semantically wrong — fragments that look related by surface co-occurrence may not actually entail each other. When a RAG pipeline or agent then composes those fragments to answer a question, the model treats them as a coherent argument and hallucinates with extra confidence. This is the failure mode we call Composition Hallucination, and it is the strongest reason to prefer LLM compilation for anything that will feed a downstream agent.

Read the article

Decision guide

When to use which

Use heuristic when…

• You need a CKF in milliseconds for prototyping.
• The source is well-structured (manual, README, spec).
• You cannot send the text to any third-party provider.
• You want a baseline to compare against an LLM run.
• You are processing thousands of documents and cost matters more than recall.
• You only need retrieval chunks and atomic units, not deep reasoning structure.

Use LLM when…

• The CKF will feed a production RAG or agent (avoid Composition Hallucination).
• The text is narrative, persuasive, or domain-rich.
• You need real concepts, causal chains, anti-patterns and decision rules — not just keywords.
• Synonyms and coreference matter (legal, medical, scientific).
• The source crosses 22 CKF sections in non-obvious ways.
• Quality matters more than latency or cost.

Open Compiler Read about Composition Hallucination