ConceptJune 12, 202622 min read

CKF Explained at Five Levels: From a 10-Year-Old to an IR Specialist

The same idea — Compiled Knowledge Format — explained five times, each level zooming in: a 10-year-old, a teenager, a non-technical adult, a technical professional, and an Information Retrieval specialist.

CKFCompiled Knowledge Format

June 12, 2026

Talk to this article

This post exists as a CKF package. Load it into your favorite LLM and discuss, summarize or apply its ideas.

The Compiled Knowledge Format (CKF) is a single idea, but it can be explained at very different zoom levels. This article walks the same concept through five audiences — a 10-year-old, a 15-year-old, a non-technical adult, a technical professional who isn't an AI/RAG specialist, and an Information Retrieval superspecialist. Read only the level you need, or read all five and watch the same shape gain detail.

Level 1 — For a 10-year-old

Imagine you have a huge book about dinosaurs.

If you ask a robot:

"Which dinosaur was a herbivore, had horns, and lived in the Cretaceous period?"

The robot might have to search the whole book, page by page, trying to understand everything on the spot. It can get confused, forget a rule, or mix information up.

CKF is like turning that book into a super-organized card for robots.

Instead of leaving the knowledge scattered like this:

page 3 talks about dinosaurs
page 20 talks about food
page 45 talks about historical periods
page 80 talks about horns

CKF organizes everything into little drawers:

Who are the characters? Triceratops, Tyrannosaurus, Stegosaurus.
What are they? Dinosaurs — herbivores, carnivores, big, small.
What are the rules? Herbivores eat plants. Carnivores eat meat.
What are the exceptions? Some animals can behave differently.
What are the steps? First check the dinosaur's type, then its food, then the period.
Where did the information come from? Page 12, paragraph 3 of the book.

So CKF is like a translator from books into something robots can use.

It takes a text made for people and turns it into a format that artificial intelligence can use better.

A simple comparison:

A normal book is like a messy backpack. CKF is that same backpack neatly organized with labels: pencils here, notebook there, snack there, toy there.

That way, when an AI needs to answer, it finds the information faster, makes fewer mistakes, and can say:

"I got this answer from here."

In one sentence: CKF is a way to organize knowledge so that AIs understand it better, answer more carefully, and show where the answer came from.

Level 2 — For a 15-year-old

CKF — Compiled Knowledge Format — is a way to turn regular documents into a more organized format that AIs can safely use.

Today, many AIs work with documents like this: they take a PDF, split it into chunks, and try to answer using the chunks they found.

That works for simple questions, but it breaks down when the document contains rules, exceptions, deadlines, procedures, definitions, and details scattered across many pages.

For example, imagine a contract that says:

"The customer can cancel the service with 30 days' notice."

But several pages later, it says:

"This rule does not apply to annual discounted contracts."

A normal AI might find only the first part and answer wrong. CKF tries to prevent that by organizing the content into categories: rules, exceptions, concepts, procedures, important questions, limits, and sources.

The idea is similar to turning a book or contract into a structured manual for AI.

Instead of handing the AI a pile of text, CKF hands it something like:

what the rule is
what the exception is
what the steps are
which terms were defined
which information is important
where each claim came from
what the document doesn't allow you to answer

CKF is not a new AI. It's a knowledge-organization layer that sits between documents and AI.

A simple comparison:

A PDF is like a drawer full of papers. A common search engine finds a few papers that look like the question. CKF tidies the drawer beforehand, adds labels, and separates rules, exceptions, deadlines, and sources.

That way the AI can answer better, with less risk of inventing or mixing up information — and it can still show where the answer came from.

In short: CKF is a way to prepare documents so that AIs can understand, query, and explain complex information more reliably.

Level 3 — For a non-technical adult

CKF is an intermediate format for turning human documents into a knowledge structure that's easier for AI systems to consume.

It starts from one simple idea: documents like PDF, DOCX, HTML, and Markdown were made for people to read, not for machines to reason about.

When an AI application has to answer questions using those documents, it usually has to find relevant excerpts and try to assemble an answer from them. That works for simple queries, but starts failing when the content involves:

rules
exceptions
procedures
relations between concepts
preconditions
deadlines
numbers
definitions
validity limits
source traceability

CKF tries to fix this by producing a more structured representation of the document.

The basic idea

Instead of using the document like this:

PDF/DOCX/HTML → extracted text → text chunks → AI answers

CKF proposes:

Original document → compilation → CKF package → AI/RAG/agent consumes it

That CKF package can live in formats like .ckf.json, .ckf.yaml, or .ckf.md.

So the document is turned into a structured artifact, similar to a combination of:

a technical summary
a semantic index
a knowledge base
an entity map
a set of rules
a list of exceptions
procedures
traceable excerpts from the source

What goes into a CKF package

Type of information	Example
Entities	people, companies, products, systems, documents
Concepts	important definitions
Rules	"if X happens, then Y applies"
Exceptions	"this rule does not apply when…"
Procedures	operational steps
Causal chains	"A causes B", "B depends on C"
Atomic units	small, verifiable statements
Retrieval chunks	excerpts prepared for search
Agent instructions	how the AI should use the knowledge
Knowledge limits	what the document does not authorize claiming
Traceability	where each item came from in the original document

Each item carries metadata: source, excerpt, confidence, and the type of origin for the information.

A practical example

Imagine a technical manual that says:

"The equipment must be turned off before maintenance."

In another section:

"Except for energized diagnostic procedures, performed only by authorized technicians."

A common system might retrieve only the first excerpt and answer: "Always turn off the equipment before maintenance." That sounds right, but it's incomplete.

In a CKF package this could look like:

decision_rules:
  - id: rule_001
    rule: "The equipment must be turned off before maintenance."
    applies_to: "standard maintenance"
    source: "Manual, section 4.2"

exceptions:
  - id: exception_001
    exception_to: "rule_001"
    condition: "energized diagnostic procedure"
    requirement: "performed only by authorized technician"
    source: "Manual, section 7.1"

Now an AI is much more likely to answer:

"In standard maintenance, the equipment must be turned off. The exception is energized diagnostic procedures, which may only be performed by authorized technicians."

How it differs from regular search

Approach	How it works
Text search	matches words or phrases
Vector search	finds semantically similar excerpts
Traditional RAG	retrieves excerpts and passes them to the model
CKF	pre-structures the knowledge into entities, rules, exceptions, procedures, and sources

CKF can be used alongside RAG. It doesn't have to replace it.

Why it matters

In enterprise applications, many AI errors don't happen because the information is missing. They happen because the AI:

found only part of the rule
missed an exception
mixed two contexts
applied a rule outside its scope
ignored a limit
cited the wrong source
invented a link between real facts

This kind of error is sometimes called a composition failure. CKF tries to reduce it by making the structure of the knowledge explicit before the answer is produced.

What CKF is not

not an AI model
not a vector database
not a chatbot
not a fine-tuning method
not a complete knowledge graph
not a complete RAG system
not a database substitute
not a tool protocol

It's a knowledge representation format for compiled knowledge, a layer between documents and intelligent systems:

Human documents → CKF → RAG / agents / chatbots / MCP / internal systems

A simple technical analogy

Think of a document as source code. The PDF is the code written for humans to read. CKF is a compiled/intermediate version, with structure made explicit for execution by systems.

Original document = source
CKF compiler      = extraction + structuring
CKF package       = intermediate representation
Agent / RAG       = runtime that consumes the knowledge

In one sentence: CKF is a format for compiling documents into structured, traceable, reusable knowledge so AI systems can answer with less error, more context, and better source attribution.

Level 5 — For an Information Retrieval superspecialist

CKF — Compiled Knowledge Format — is a document-to-knowledge intermediate representation for IR, RAG, and agent pipelines. Its goal is to shift part of the semantic, structural, and operational interpretation from query time to index/compile time.

Instead of treating a document only as a sequence of tokens, pages, passages, or vector chunks, CKF produces a versioned, auditable intermediate artifact containing multiple views of the same document:

retrievable passages
atomic units
entities
concepts
rules
exceptions
procedures
causal chains
agent instructions
knowledge limits
provenance / source traceability
confidence / source basis

In IR terms, CKF does not replace the index, the retriever, or the ranker. It redefines the indexable object.

CKF is an intermediate representation of compiled knowledge that turns human documents into semi-structured, multi-granular, provenance-aware objects, optimized for retrieval, composition, grounding, and agentic use.

1. Where CKF fits in the IR/RAG stack

A conventional RAG pipeline:

Document → parsing → chunking → embedding/sparse indexing → retrieval → reranking → LLM synthesis

With CKF:

Document
  → parsing
  → semantic compilation
  → CKF package
  → indexing of typed knowledge objects
  → retrieval / reranking / composition / generation

Raw document chunks are no longer the only retrieval unit. The system can also retrieve atomic_unit, decision_rule, exception, procedure_step, entity, concept, qa_pair, retrieval_chunk, knowledge_limit, agent_instruction, source_trace. Each item keeps an explicit link to its documentary origin.

2. CKF as a change of documentary unit

Classic IR units include document / field / paragraph / passage / sentence / chunk / entity / graph node / triple / table / cell / visual block / generated expansion. CKF proposes a hybrid unit: a typed, provenance-linked knowledge object.

A CKF item is not just a textual passage — it can be an abstraction derived from the document while still anchored in the source:

decision_rules:
  - id: dr_014
    statement: "Customers on annual discounted contracts cannot cancel without penalty before the term ends."
    applies_to: ["annual contract", "discounted contract"]
    conditions: ["term still active"]
    exceptions: ["termination due to provider breach"]
    source_basis: explicit
    confidence: 0.92
    source_trace:
      document_id: "contract_policy_v3"
      section: "4.2"
      span: "..."

For IR, this is an enriched passage with typing, normalization, provenance, conditions, and pragmatic function.

3. Relation to passage retrieval

CKF tries to improve passage retrieval along three axes.

3.1 Granularity. Traditional chunks are often arbitrary (fixed size, sliding window, paragraph, recursive splitting). CKF adds semantic granularity: one rule, one exception, one condition, one definition, one step, one constraint, one atomic unit, one operational instruction. This reduces the chance the retriever brings "nearby text" without the structure needed to answer.

3.2 Discursive function. Two semantically close excerpts may play different roles (definition, rule, exception, example, warning, procedure, motivation, recommendation, limit). CKF makes the role explicit and enables type-conditioned retrieval:

query: "can I cancel an annual discounted contract?"
retrieve:
  - decision_rules
  - exceptions
  - knowledge_limits
  - relevant source chunks

Not just the most similar chunks.

3.3 Provenance and auditability. In ordinary RAG, provenance is reconstructed from the retrieved chunk. In CKF, provenance is part of the retrievable object, enabling answer grounding, citation-aware generation, source-constrained decoding, confidence-weighted retrieval, human audit, contradiction analysis, and trace-based evaluation.

4. CKF and composition hallucination

A central problem CKF targets isn't hallucination in the unsupported-generation sense — it's composition error: the system retrieves true facts but combines them into a false, incomplete, or out-of-scope answer.

Classic case: rule A is true, exception B is true, condition C scopes A — but the model applies A without B or outside C. That's a failure in the chain retrieval → evidence selection → evidence composition → answer synthesis.

CKF tries to reduce that failure by making explicit the objects normally left implicit in text: rule, exception_to(rule), condition, scope, applicability, source_basis. Part of the composition becomes retrievable structure, not just generation-time inference.

5. CKF vs. query/document expansion

Document expansion (doc2query, SPLADE-style expansion, pseudo-relevance feedback) improves query–document matchability. CKF can improve matchability too, but its main goal is broader:

Technique	Primary goal
Query expansion	improve query recall
Document expansion	improve query–document match
Entity extraction	identify entities
KG extraction	build nodes/relations
Summarization	compress content
Passage chunking	create retrieval units
CKF	create a typed, multi-granular, provenance-aware, operationally useful intermediate representation

CKF can include expansions, but also normative, procedural, causal, and restrictive objects.

6. CKF vs. knowledge graph

CKF is not just a KG. A KG emphasizes (entity) -[relation]-> (entity). CKF may contain relations but also includes structures not naturally expressed as simple triples: procedures, exceptions, conditional rules, playbooks, anti-patterns, Q&A, retrieval chunks, agent instructions, knowledge limits, source traces, confidence, source basis.

CKF is closer to a document-grounded knowledge package than to a canonical graph. It can feed a KG without requiring all knowledge to be normalized as triples.

7. CKF vs. GraphRAG

GraphRAG improves RAG using graph structure, entities, communities, and hierarchical summarization. CKF is complementary. GraphRAG models global structure, inter-document relations, and semantic communities. CKF models the functional decomposition of a document (or corpus) into actionable knowledge types.

A possible architecture:

Documents
  → CKF compilation
  → entity/rule/procedure extraction
  → graph construction
  → graph/community retrieval
  → CKF object retrieval
  → grounded answer synthesis

CKF can supply cleaner raw material for GraphRAG.

8. CKF as late-binding structure

The same CKF package can be indexed in multiple indexes simultaneously:

BM25 / sparse index over text fields
dense embeddings over statements
hybrid index over retrieval_chunks
symbolic index over entities/rules/exceptions
graph index over references
metadata filters over source_basis/confidence/domain
temporal index over validity/effective dates

CKF separates knowledge representation from retrieval implementation, enabling late binding: different applications choose how to index and retrieve the same objects.

9. Evaluation

Intrinsic — compilation quality: schema validity, source-trace accuracy, entity/concept coverage, rule extraction precision/recall, exception linkage accuracy, numeric/date preservation, atomicity, dedup quality, contradiction detection, source_basis calibration, confidence calibration.

Extrinsic — downstream impact: answer correctness, citation correctness, groundedness, faithfulness, evidence recall/precision, multi-hop success, exception handling, procedural accuracy, refusal when unsupported, token efficiency, latency/cost, robustness under context budget constraints.

The CKF thesis is only interesting if extrinsic gains exist. Intrinsic evaluation is necessary but not sufficient.

10. CKF and relevance

CKF lets you enrich relevance from semantic_similarity(query, chunk) to:

relevance(query, object) =
  semantic similarity
  + object_type compatibility
  + source reliability
  + source_basis
  + confidence
  + recency/version
  + applicability conditions
  + exception linkage
  + citation availability

For "Can I do X?" a decision_rule and its exceptions may outrank a high-similarity narrative paragraph. For "How do I do X?" procedure_steps come first. For "What does X mean?" concepts / definitions. For "When should I not do X?" exceptions / anti_patterns / knowledge_limits.

11. Retrieval over CKF

1. Query understanding
   - classify intent: definition, rule, exception, procedure, troubleshooting, factual, unsupported check
2. Candidate generation
   - hybrid retrieval over retrieval_chunks and atomic_units
   - filtered retrieval over object types
   - entity linking
   - source metadata filtering
3. Evidence assembly
   - include base rule
   - include linked exceptions
   - include procedure dependencies
   - include source spans
   - include knowledge limits
4. Reranking
   - semantic relevance + object-type relevance + provenance quality + confidence + contradiction risk + constraint coverage
5. Generation
   - answer with assembled evidence; cite source_trace; abstain if unsupported

The main departure from traditional RAG is step 3: evidence assembly no longer depends only on top-k chunks.

12. CKF as context control

CKF allows leaner, more precise context budgets:

base rule:         80 tokens
exception:         60 tokens
procedure step:    50 tokens
source span:      120 tokens
knowledge limit:   40 tokens

instead of four 800-token chunks. This can improve context precision, faithfulness, cost, latency, and controllability.

13. Atomic units

Atomic units act as small, indexable, verifiable propositions:

atomic_units:
  - id: au_102
    statement: "The standard cancellation notice is 30 days."
    source_basis: explicit
    source_trace: ...

This brings CKF close to proposition-level retrieval, claim extraction, semantic units, evidence atoms, decontextualized sentence retrieval, fact indexing — with the difference that atomic units coexist with higher-level objects like rules and procedures.

14. Passage decontextualization

Many chunks suffer because they depend on prior context:

"In those cases, the deadline is reduced to 10 days."

In isolation that's almost useless. During compilation CKF may decontextualize:

"For cancellations due to provider failure, the notice period is reduced to 10 days."

while keeping source_trace to the original. Decontextualization improves retrieval but raises distortion risk — hence the need for source_basis, confidence, trace, validation, numeric guards, and a human audit path.

15. Multi-vector / fielded indexing

CKF is naturally compatible with fielded retrieval. One object can produce separate embeddings for statement, normalized statement, source excerpt, entity list, conditions, exceptions, domain tags, generated questions, procedural title — enabling multi-vector retrieval per object, similar to ColBERT-like or field-aware dense retrieval, but at the typed-knowledge level.

16. Sparse retrieval

Because content is decontextualized and normalized, BM25 can improve: objects contain explicit terms previously implicit in headings, prior sections, or visual context. Fields like aliases, entities, concepts, conditions, and generated_questions act as controlled document expansion.

17. Hybrid

A good CKF retriever is likely hybrid: BM25/SPLADE over textual fields, dense retrieval over statements/source spans, metadata filtering over type/source/version, graph traversal over linked rules/exceptions/procedures, reranking with a cross-encoder or LLM judge.

18. Versioning

Documents change. CKF versions document_id, document_hash, package_version, compiler_version, schema_version, item_id, source_span, validity, effective date. A CKF index can filter by effective_date <= now, superseded = false, domain = "compliance", document_version = latest. Important for IR because "which policy is in force today?" is not only semantic — it's temporal and normative.

19. Abstention

CKF can improve correct non-answers. knowledge_limits can explicitly represent: the document does not cover X, the rule does not apply to Y, the source is ambiguous, the information depends on external policy, the version is insufficient, documents conflict. Ordinary RAG tends to fill gaps; CKF provides retrievable objects supporting abstention or conditioned answers.

20. Agentic IR

For agents, CKF works as a cognitive resource pack: facts, rules, procedures, constraints, allowed actions, escalation paths, source references, tool instructions, decision boundaries. Different from handing the agent raw chunks. CKF is especially relevant when IR does not terminate in answer generation but in actions.

21. Honest technical critique

Objection 1: LLM extraction can introduce error. True. CKF trades one problem for another: part of the generation error migrates to compile time. Mitigation: mandatory source trace, confidence, source_basis, validators, numeric guards, human review, intrinsic evaluation, compiler versioning.

Objection 2: recall loss. Yes. If compilation omits a rule or exception, downstream is limited. Mitigation: keep retrieval_chunks alongside structured objects, also index original text, measure coverage, fall back to source retrieval, detect low coverage.

Objection 3: schema too rigid. Possible, especially across heterogeneous domains. Mitigation: extensibility, optional fields, domain-specific sections, schema versioning, custom object types.

Objection 4: compilation cost. Yes. CKF only pays off when the document is reused, audited, or queried many times — or when errors are costly. Not ideal for disposable, simple, or low-criticality content.

22. Where CKF tends to be strong

Long documents, standards, policies, contracts, manuals, procedures, compliance, exceptions, multi-version corpora, high error cost, citation/audit requirements, decision- or workflow-executing agents.

23. Where CKF tends to be weak

Simple factual questions, short documents, ephemeral content, broad exploratory search, raw-recall-dominant tasks, very noisy corpora without good extraction, free-generation-acceptable tasks, scenarios with no reuse of the compiled package.

24. Research framing

Explicit typed knowledge objects with source-grounded provenance, generated at indexing time from human documents, improve downstream answer faithfulness, exception handling, and compositional correctness under constrained context budgets compared to passage-only RAG.

The relevant experiment isn't "CKF looks more organized" but: does CKF improve evidence recall/precision and answer faithfulness for compositional, procedural, and exception-sensitive queries — controlling for same documents, model, context budget, query set, evaluation rubric, and comparable retrieval infrastructure?

25. Specialist summary

CKF is best understood as an intermediate IR layer — document-grounded, typed, provenance-aware, multi-granular — for knowledge-intensive NLP. It turns documents into retrievable objects richer than chunks and less rigid than a full KG.

It promises better evidence selection, evidence assembly, grounding, citation, exception handling, procedural correctness, abstention, context efficiency, and agentic use. Its biggest weakness is compilation quality — so CKF's validity depends on rigorous intrinsic evaluation and, above all, on extrinsic gains in downstream tasks.

In one technical sentence: CKF is a semantic document compilation layer that produces typed, source-grounded, retrieval-ready knowledge objects to improve retrieval-augmented reasoning and reduce composition errors in document-based AI systems.

Closing

The document is not the knowledge. The knowledge has to be compiled. CKF is the same idea, whether you tell it to a child with a dinosaur book or to a researcher with a retrieval stack — it just gains resolution as the audience changes.

CKFExplainerRAGInformation RetrievalLLM

Continue reading

ResearchJune 12, 202612 min read

CKF on the global map: how the Compiled Knowledge Format compares to RAG, Document AI, GraphRAG, and semantic standards

A comparative analysis between CKF and the main global alternatives for structuring documents, preparing data for LLMs, building RAG, creating knowledge graphs, and standardizing APIs.

ResearchMay 22, 202618 min read

CKF Project Review: From CKF-0.1 to CKF Compiler v1.03.1

A scientific retrospective of the CKF Compiler, tracing the journey from CKF-0.1 (≈10% semantic preservation) to v1.03.1 — the first balanced release that simultaneously preserves meaning, structure, retrieval surface, sanitation, metadata and traceability.

ResearchMay 20, 202613 min read

Compiling Knowledge for AI Agents: The CKF Format and Knowledge Operations (KnowOps)

A vision paper proposing CKF — an open format that compiles documents into typed, schema-stable knowledge packages — and KnowOps, a framework that ports software-engineering lifecycle practices to agent-consumed knowledge bases. Empirical efficacy is the subject of a pre-registered confirmatory study currently in preparation.