CKF Explained at Five Levels: From a 10-Year-Old to an IR Specialist
The same idea — Compiled Knowledge Format — explained five times, each level zooming in: a 10-year-old, a teenager, a non-technical adult, a technical professional, and an Information Retrieval specialist.
Talk to this article
This post exists as a CKF package. Load it into your favorite LLM and discuss, summarize or apply its ideas.
The Compiled Knowledge Format (CKF) is a single idea, but it can be explained at very different zoom levels. This article walks the same concept through five audiences — a 10-year-old, a 15-year-old, a non-technical adult, a technical professional who isn't an AI/RAG specialist, and an Information Retrieval superspecialist. Read only the level you need, or read all five and watch the same shape gain detail.
Level 1 — For a 10-year-old
Imagine you have a huge book about dinosaurs.
If you ask a robot:
"Which dinosaur was a herbivore, had horns, and lived in the Cretaceous period?"
The robot might have to search the whole book, page by page, trying to understand everything on the spot. It can get confused, forget a rule, or mix information up.
CKF is like turning that book into a super-organized card for robots.
Instead of leaving the knowledge scattered like this:
- page 3 talks about dinosaurs
- page 20 talks about food
- page 45 talks about historical periods
- page 80 talks about horns
CKF organizes everything into little drawers:
- Who are the characters? Triceratops, Tyrannosaurus, Stegosaurus.
- What are they? Dinosaurs — herbivores, carnivores, big, small.
- What are the rules? Herbivores eat plants. Carnivores eat meat.
- What are the exceptions? Some animals can behave differently.
- What are the steps? First check the dinosaur's type, then its food, then the period.
- Where did the information come from? Page 12, paragraph 3 of the book.
So CKF is like a translator from books into something robots can use.
It takes a text made for people and turns it into a format that artificial intelligence can use better.
A simple comparison:
A normal book is like a messy backpack. CKF is that same backpack neatly organized with labels: pencils here, notebook there, snack there, toy there.
That way, when an AI needs to answer, it finds the information faster, makes fewer mistakes, and can say:
"I got this answer from here."
In one sentence: CKF is a way to organize knowledge so that AIs understand it better, answer more carefully, and show where the answer came from.
Level 2 — For a 15-year-old
CKF — Compiled Knowledge Format — is a way to turn regular documents into a more organized format that AIs can safely use.
Today, many AIs work with documents like this: they take a PDF, split it into chunks, and try to answer using the chunks they found.
That works for simple questions, but it breaks down when the document contains rules, exceptions, deadlines, procedures, definitions, and details scattered across many pages.
For example, imagine a contract that says:
"The customer can cancel the service with 30 days' notice."
But several pages later, it says:
"This rule does not apply to annual discounted contracts."
A normal AI might find only the first part and answer wrong. CKF tries to prevent that by organizing the content into categories: rules, exceptions, concepts, procedures, important questions, limits, and sources.
The idea is similar to turning a book or contract into a structured manual for AI.
Instead of handing the AI a pile of text, CKF hands it something like:
- what the rule is
- what the exception is
- what the steps are
- which terms were defined
- which information is important
- where each claim came from
- what the document doesn't allow you to answer
CKF is not a new AI. It's a knowledge-organization layer that sits between documents and AI.
A simple comparison:
A PDF is like a drawer full of papers. A common search engine finds a few papers that look like the question. CKF tidies the drawer beforehand, adds labels, and separates rules, exceptions, deadlines, and sources.
That way the AI can answer better, with less risk of inventing or mixing up information — and it can still show where the answer came from.
In short: CKF is a way to prepare documents so that AIs can understand, query, and explain complex information more reliably.
Level 3 — For a non-technical adult
CKF is an intermediate format for turning human documents into a knowledge structure that's easier for AI systems to consume.
It starts from one simple idea: documents like PDF, DOCX, HTML, and Markdown were made for people to read, not for machines to reason about.
When an AI application has to answer questions using those documents, it usually has to find relevant excerpts and try to assemble an answer from them. That works for simple queries, but starts failing when the content involves:
- rules
- exceptions
- procedures
- relations between concepts
- preconditions
- deadlines
- numbers
- definitions
- validity limits
- source traceability
CKF tries to fix this by producing a more structured representation of the document.
The basic idea
Instead of using the document like this:
PDF/DOCX/HTML → extracted text → text chunks → AI answers
CKF proposes:
Original document → compilation → CKF package → AI/RAG/agent consumes it
That CKF package can live in formats like .ckf.json, .ckf.yaml, or .ckf.md.
So the document is turned into a structured artifact, similar to a combination of:
- a technical summary
- a semantic index
- a knowledge base
- an entity map
- a set of rules
- a list of exceptions
- procedures
- traceable excerpts from the source
What goes into a CKF package
| Type of information | Example |
|---|---|
| Entities | people, companies, products, systems, documents |
| Concepts | important definitions |
| Rules | "if X happens, then Y applies" |
| Exceptions | "this rule does not apply when…" |
| Procedures | operational steps |
| Causal chains | "A causes B", "B depends on C" |
| Atomic units | small, verifiable statements |
| Retrieval chunks | excerpts prepared for search |
| Agent instructions | how the AI should use the knowledge |
| Knowledge limits | what the document does not authorize claiming |
| Traceability | where each item came from in the original document |
Each item carries metadata: source, excerpt, confidence, and the type of origin for the information.
A practical example
Imagine a technical manual that says:
"The equipment must be turned off before maintenance."
In another section:
"Except for energized diagnostic procedures, performed only by authorized technicians."
A common system might retrieve only the first excerpt and answer: "Always turn off the equipment before maintenance." That sounds right, but it's incomplete.
In a CKF package this could look like:
decision_rules:
- id: rule_001
rule: "The equipment must be turned off before maintenance."
applies_to: "standard maintenance"
source: "Manual, section 4.2"
exceptions:
- id: exception_001
exception_to: "rule_001"
condition: "energized diagnostic procedure"
requirement: "performed only by authorized technician"
source: "Manual, section 7.1"
Now an AI is much more likely to answer:
"In standard maintenance, the equipment must be turned off. The exception is energized diagnostic procedures, which may only be performed by authorized technicians."
How it differs from regular search
| Approach | How it works |
|---|---|
| Text search | matches words or phrases |
| Vector search | finds semantically similar excerpts |
| Traditional RAG | retrieves excerpts and passes them to the model |
| CKF | pre-structures the knowledge into entities, rules, exceptions, procedures, and sources |
CKF can be used alongside RAG. It doesn't have to replace it.
Why it matters
In enterprise applications, many AI errors don't happen because the information is missing. They happen because the AI:
- found only part of the rule
- missed an exception
- mixed two contexts
- applied a rule outside its scope
- ignored a limit
- cited the wrong source
- invented a link between real facts
This kind of error is sometimes called a composition failure. CKF tries to reduce it by making the structure of the knowledge explicit before the answer is produced.
What CKF is not
- not an AI model
- not a vector database
- not a chatbot
- not a fine-tuning method
- not a complete knowledge graph
- not a complete RAG system
- not a database substitute
- not a tool protocol
It's a knowledge representation format for compiled knowledge, a layer between documents and intelligent systems:
Human documents → CKF → RAG / agents / chatbots / MCP / internal systems
A simple technical analogy
Think of a document as source code. The PDF is the code written for humans to read. CKF is a compiled/intermediate version, with structure made explicit for execution by systems.
Original document = source
CKF compiler = extraction + structuring
CKF package = intermediate representation
Agent / RAG = runtime that consumes the knowledge
In one sentence: CKF is a format for compiling documents into structured, traceable, reusable knowledge so AI systems can answer with less error, more context, and better source attribution.
Level 5 — For an Information Retrieval superspecialist
CKF — Compiled Knowledge Format — is a document-to-knowledge intermediate representation for IR, RAG, and agent pipelines. Its goal is to shift part of the semantic, structural, and operational interpretation from query time to index/compile time.
Instead of treating a document only as a sequence of tokens, pages, passages, or vector chunks, CKF produces a versioned, auditable intermediate artifact containing multiple views of the same document:
- retrievable passages
- atomic units
- entities
- concepts
- rules
- exceptions
- procedures
- causal chains
- agent instructions
- knowledge limits
- provenance / source traceability
- confidence / source basis
In IR terms, CKF does not replace the index, the retriever, or the ranker. It redefines the indexable object.
CKF is an intermediate representation of compiled knowledge that turns human documents into semi-structured, multi-granular, provenance-aware objects, optimized for retrieval, composition, grounding, and agentic use.
1. Where CKF fits in the IR/RAG stack
A conventional RAG pipeline:
Document → parsing → chunking → embedding/sparse indexing → retrieval → reranking → LLM synthesis
With CKF:
Document
→ parsing
→ semantic compilation
→ CKF package
→ indexing of typed knowledge objects
→ retrieval / reranking / composition / generation
Raw document chunks are no longer the only retrieval unit. The system can also retrieve atomic_unit, decision_rule, exception, procedure_step, entity, concept, qa_pair, retrieval_chunk, knowledge_limit, agent_instruction, source_trace. Each item keeps an explicit link to its documentary origin.
2. CKF as a change of documentary unit
Classic IR units include document / field / paragraph / passage / sentence / chunk / entity / graph node / triple / table / cell / visual block / generated expansion. CKF proposes a hybrid unit: a typed, provenance-linked knowledge object.
A CKF item is not just a textual passage — it can be an abstraction derived from the document while still anchored in the source:
decision_rules:
- id: dr_014
statement: "Customers on annual discounted contracts cannot cancel without penalty before the term ends."
applies_to: ["annual contract", "discounted contract"]
conditions: ["term still active"]
exceptions: ["termination due to provider breach"]
source_basis: explicit
confidence: 0.92
source_trace:
document_id: "contract_policy_v3"
section: "4.2"
span: "..."
For IR, this is an enriched passage with typing, normalization, provenance, conditions, and pragmatic function.
3. Relation to passage retrieval
CKF tries to improve passage retrieval along three axes.
3.1 Granularity. Traditional chunks are often arbitrary (fixed size, sliding window, paragraph, recursive splitting). CKF adds semantic granularity: one rule, one exception, one condition, one definition, one step, one constraint, one atomic unit, one operational instruction. This reduces the chance the retriever brings "nearby text" without the structure needed to answer.
3.2 Discursive function. Two semantically close excerpts may play different roles (definition, rule, exception, example, warning, procedure, motivation, recommendation, limit). CKF makes the role explicit and enables type-conditioned retrieval:
query: "can I cancel an annual discounted contract?"
retrieve:
- decision_rules
- exceptions
- knowledge_limits
- relevant source chunks
Not just the most similar chunks.
3.3 Provenance and auditability. In ordinary RAG, provenance is reconstructed from the retrieved chunk. In CKF, provenance is part of the retrievable object, enabling answer grounding, citation-aware generation, source-constrained decoding, confidence-weighted retrieval, human audit, contradiction analysis, and trace-based evaluation.
4. CKF and composition hallucination
A central problem CKF targets isn't hallucination in the unsupported-generation sense — it's composition error: the system retrieves true facts but combines them into a false, incomplete, or out-of-scope answer.
Classic case: rule A is true, exception B is true, condition C scopes A — but the model applies A without B or outside C. That's a failure in the chain retrieval → evidence selection → evidence composition → answer synthesis.
CKF tries to reduce that failure by making explicit the objects normally left implicit in text: rule, exception_to(rule), condition, scope, applicability, source_basis. Part of the composition becomes retrievable structure, not just generation-time inference.
5. CKF vs. query/document expansion
Document expansion (doc2query, SPLADE-style expansion, pseudo-relevance feedback) improves query–document matchability. CKF can improve matchability too, but its main goal is broader:
| Technique | Primary goal |
|---|---|
| Query expansion | improve query recall |
| Document expansion | improve query–document match |
| Entity extraction | identify entities |
| KG extraction | build nodes/relations |
| Summarization | compress content |
| Passage chunking | create retrieval units |
| CKF | create a typed, multi-granular, provenance-aware, operationally useful intermediate representation |
CKF can include expansions, but also normative, procedural, causal, and restrictive objects.
6. CKF vs. knowledge graph
CKF is not just a KG. A KG emphasizes (entity) -[relation]-> (entity). CKF may contain relations but also includes structures not naturally expressed as simple triples: procedures, exceptions, conditional rules, playbooks, anti-patterns, Q&A, retrieval chunks, agent instructions, knowledge limits, source traces, confidence, source basis.
CKF is closer to a document-grounded knowledge package than to a canonical graph. It can feed a KG without requiring all knowledge to be normalized as triples.
7. CKF vs. GraphRAG
GraphRAG improves RAG using graph structure, entities, communities, and hierarchical summarization. CKF is complementary. GraphRAG models global structure, inter-document relations, and semantic communities. CKF models the functional decomposition of a document (or corpus) into actionable knowledge types.
A possible architecture:
Documents
→ CKF compilation
→ entity/rule/procedure extraction
→ graph construction
→ graph/community retrieval
→ CKF object retrieval
→ grounded answer synthesis
CKF can supply cleaner raw material for GraphRAG.
8. CKF as late-binding structure
The same CKF package can be indexed in multiple indexes simultaneously:
BM25 / sparse index over text fields
dense embeddings over statements
hybrid index over retrieval_chunks
symbolic index over entities/rules/exceptions
graph index over references
metadata filters over source_basis/confidence/domain
temporal index over validity/effective dates
CKF separates knowledge representation from retrieval implementation, enabling late binding: different applications choose how to index and retrieve the same objects.
9. Evaluation
Intrinsic — compilation quality: schema validity, source-trace accuracy, entity/concept coverage, rule extraction precision/recall, exception linkage accuracy, numeric/date preservation, atomicity, dedup quality, contradiction detection, source_basis calibration, confidence calibration.
Extrinsic — downstream impact: answer correctness, citation correctness, groundedness, faithfulness, evidence recall/precision, multi-hop success, exception handling, procedural accuracy, refusal when unsupported, token efficiency, latency/cost, robustness under context budget constraints.
The CKF thesis is only interesting if extrinsic gains exist. Intrinsic evaluation is necessary but not sufficient.
10. CKF and relevance
CKF lets you enrich relevance from semantic_similarity(query, chunk) to:
relevance(query, object) =
semantic similarity
+ object_type compatibility
+ source reliability
+ source_basis
+ confidence
+ recency/version
+ applicability conditions
+ exception linkage
+ citation availability
For "Can I do X?" a decision_rule and its exceptions may outrank a high-similarity narrative paragraph. For "How do I do X?" procedure_steps come first. For "What does X mean?" concepts / definitions. For "When should I not do X?" exceptions / anti_patterns / knowledge_limits.
11. Retrieval over CKF
1. Query understanding
- classify intent: definition, rule, exception, procedure, troubleshooting, factual, unsupported check
2. Candidate generation
- hybrid retrieval over retrieval_chunks and atomic_units
- filtered retrieval over object types
- entity linking
- source metadata filtering
3. Evidence assembly
- include base rule
- include linked exceptions
- include procedure dependencies
- include source spans
- include knowledge limits
4. Reranking
- semantic relevance + object-type relevance + provenance quality + confidence + contradiction risk + constraint coverage
5. Generation
- answer with assembled evidence; cite source_trace; abstain if unsupported
The main departure from traditional RAG is step 3: evidence assembly no longer depends only on top-k chunks.
12. CKF as context control
CKF allows leaner, more precise context budgets:
base rule: 80 tokens
exception: 60 tokens
procedure step: 50 tokens
source span: 120 tokens
knowledge limit: 40 tokens
instead of four 800-token chunks. This can improve context precision, faithfulness, cost, latency, and controllability.
13. Atomic units
Atomic units act as small, indexable, verifiable propositions:
atomic_units:
- id: au_102
statement: "The standard cancellation notice is 30 days."
source_basis: explicit
source_trace: ...
This brings CKF close to proposition-level retrieval, claim extraction, semantic units, evidence atoms, decontextualized sentence retrieval, fact indexing — with the difference that atomic units coexist with higher-level objects like rules and procedures.
14. Passage decontextualization
Many chunks suffer because they depend on prior context:
"In those cases, the deadline is reduced to 10 days."
In isolation that's almost useless. During compilation CKF may decontextualize:
"For cancellations due to provider failure, the notice period is reduced to 10 days."
while keeping source_trace to the original. Decontextualization improves retrieval but raises distortion risk — hence the need for source_basis, confidence, trace, validation, numeric guards, and a human audit path.
15. Multi-vector / fielded indexing
CKF is naturally compatible with fielded retrieval. One object can produce separate embeddings for statement, normalized statement, source excerpt, entity list, conditions, exceptions, domain tags, generated questions, procedural title — enabling multi-vector retrieval per object, similar to ColBERT-like or field-aware dense retrieval, but at the typed-knowledge level.
16. Sparse retrieval
Because content is decontextualized and normalized, BM25 can improve: objects contain explicit terms previously implicit in headings, prior sections, or visual context. Fields like aliases, entities, concepts, conditions, and generated_questions act as controlled document expansion.
17. Hybrid
A good CKF retriever is likely hybrid: BM25/SPLADE over textual fields, dense retrieval over statements/source spans, metadata filtering over type/source/version, graph traversal over linked rules/exceptions/procedures, reranking with a cross-encoder or LLM judge.
18. Versioning
Documents change. CKF versions document_id, document_hash, package_version, compiler_version, schema_version, item_id, source_span, validity, effective date. A CKF index can filter by effective_date <= now, superseded = false, domain = "compliance", document_version = latest. Important for IR because "which policy is in force today?" is not only semantic — it's temporal and normative.
19. Abstention
CKF can improve correct non-answers. knowledge_limits can explicitly represent: the document does not cover X, the rule does not apply to Y, the source is ambiguous, the information depends on external policy, the version is insufficient, documents conflict. Ordinary RAG tends to fill gaps; CKF provides retrievable objects supporting abstention or conditioned answers.
20. Agentic IR
For agents, CKF works as a cognitive resource pack: facts, rules, procedures, constraints, allowed actions, escalation paths, source references, tool instructions, decision boundaries. Different from handing the agent raw chunks. CKF is especially relevant when IR does not terminate in answer generation but in actions.
21. Honest technical critique
Objection 1: LLM extraction can introduce error. True. CKF trades one problem for another: part of the generation error migrates to compile time. Mitigation: mandatory source trace, confidence, source_basis, validators, numeric guards, human review, intrinsic evaluation, compiler versioning.
Objection 2: recall loss. Yes. If compilation omits a rule or exception, downstream is limited. Mitigation: keep retrieval_chunks alongside structured objects, also index original text, measure coverage, fall back to source retrieval, detect low coverage.
Objection 3: schema too rigid. Possible, especially across heterogeneous domains. Mitigation: extensibility, optional fields, domain-specific sections, schema versioning, custom object types.
Objection 4: compilation cost. Yes. CKF only pays off when the document is reused, audited, or queried many times — or when errors are costly. Not ideal for disposable, simple, or low-criticality content.
22. Where CKF tends to be strong
Long documents, standards, policies, contracts, manuals, procedures, compliance, exceptions, multi-version corpora, high error cost, citation/audit requirements, decision- or workflow-executing agents.
23. Where CKF tends to be weak
Simple factual questions, short documents, ephemeral content, broad exploratory search, raw-recall-dominant tasks, very noisy corpora without good extraction, free-generation-acceptable tasks, scenarios with no reuse of the compiled package.
24. Research framing
Explicit typed knowledge objects with source-grounded provenance, generated at indexing time from human documents, improve downstream answer faithfulness, exception handling, and compositional correctness under constrained context budgets compared to passage-only RAG.
The relevant experiment isn't "CKF looks more organized" but: does CKF improve evidence recall/precision and answer faithfulness for compositional, procedural, and exception-sensitive queries — controlling for same documents, model, context budget, query set, evaluation rubric, and comparable retrieval infrastructure?
25. Specialist summary
CKF is best understood as an intermediate IR layer — document-grounded, typed, provenance-aware, multi-granular — for knowledge-intensive NLP. It turns documents into retrievable objects richer than chunks and less rigid than a full KG.
It promises better evidence selection, evidence assembly, grounding, citation, exception handling, procedural correctness, abstention, context efficiency, and agentic use. Its biggest weakness is compilation quality — so CKF's validity depends on rigorous intrinsic evaluation and, above all, on extrinsic gains in downstream tasks.
In one technical sentence: CKF is a semantic document compilation layer that produces typed, source-grounded, retrieval-ready knowledge objects to improve retrieval-augmented reasoning and reduce composition errors in document-based AI systems.
Closing
The document is not the knowledge. The knowledge has to be compiled. CKF is the same idea, whether you tell it to a child with a dinosaur book or to a researcher with a retrieval stack — it just gains resolution as the audience changes.