ResearchMay 12, 20269 min read

CKF: Toward a Compiled Intermediate Representation for Machine-Operable Knowledge

A strategic analysis of CKF as a semantic IR — beyond RAG and GraphRAG, alongside RDF/OWL and MCP, with provenance and validation as foundational primitives.

Paulo TomazinhoCreator of CKF

May 12, 2026

Talk to this article

This post exists as a CKF package. Load it into your favorite LLM and discuss, summarize or apply its ideas.

A strategic analysis of CKF as a semantic IR — beyond RAG and GraphRAG, alongside RDF/OWL and MCP, with provenance and validation as foundational primitives.

Author: Paulo Tomazinho, PhD

Affiliation: CKF Research

Artificial intelligence systems have become remarkably effective at generating language. Yet most contemporary architectures still operate primarily over textual representations rather than structured operational knowledge.

This distinction matters.

Current systems retrieve, summarize, and statistically reinterpret language. They do not truly operate over explicit cognitive structures such as principles, constraints, heuristics, assumptions, causal mechanisms, or decision pathways.

The concept behind Compiled Knowledge Format (CKF) is compelling precisely because it attempts to address this gap — building on an architectural intuition that Andrej Karpathy articulated clearly in his LLM Wiki pattern (April 2026): that raw sources should be compiled once into a persistent artifact rather than re-derived at every query. CKF extends that intuition toward a different consumer — not a human reading a Markdown wiki, but an agent or retrieval system reading a typed, schema-stable package.

Rather than treating knowledge as content, CKF proposes treating knowledge as an executable infrastructure layer.

A useful analogy emerges from software architecture:

Layer	Analogy
PDF / article / book	Human source code
Semantic parsing	Compiler
CKF	Cognitive intermediate representation
Agent / LLM	Execution runtime
MCP / APIs	Operational interface

This framing may be the most valuable contribution of the CKF proposal: the transition from "knowledge as document" to "knowledge as executable intermediate representation."

Beyond Retrieval: Why CKF Matters

Most current AI systems still rely heavily on Retrieval-Augmented Generation (RAG).

Traditional RAG pipelines retrieve text chunks and inject them into prompts. This improves grounding and reduces some hallucination, but it still forces models to reinterpret natural language repeatedly — and it fails systematically on a specific class of error we have termed composition hallucination: outputs that contradict information present and retrievable in the context, not because the model lacked the data, but because the structural relations between fragments were never made explicit. Full taxonomy in a companion article.

CKF proposes a different abstraction layer.

Instead of retrieving paragraphs, a CKF system could retrieve:

applicable principles
decision heuristics
operational constraints
assumptions
conflict structures
reasoning paths
causal relationships

This is fundamentally more powerful than retrieving text alone. The shift resembles the difference between searching raw source files and executing compiled semantic instructions.

However, the challenge is equally significant: while RAG is comparatively simple, scalable, and inexpensive, CKF introduces major complexity in semantic compilation, validation, ontology management, and maintenance. These costs are real and must be justified empirically, not asserted rhetorically.

CKF vs GraphRAG

GraphRAG already represents an important evolution beyond traditional retrieval systems.

Microsoft's GraphRAG approach combines graph extraction, network analysis, summarization, and hierarchical semantic organization to improve contextual retrieval and reasoning over large corpora.

But CKF appears to push beyond entity-relationship structures.

Where GraphRAG primarily organizes entities, relations, communities, and semantic neighborhoods, CKF aims to additionally encode heuristics, principles, decision logic, assumptions, causal mechanisms, and operational reasoning.

GraphRAG	CKF
Organizes semantic graphs	Organizes cognitive structures
Focuses on entities and relations	Focuses on operational reasoning
Improves contextual retrieval	Proposes cognitive execution

The question, however, is whether CKF can demonstrate measurable gains over already-functional GraphRAG implementations. Conceptual elegance alone is insufficient. Operational advantage must be empirically demonstrated. This is the central question driving the experimental program described below.

The Relationship with RDF, OWL, SKOS, and Semantic Web Standards

A major strategic question for CKF is whether it intends to compete with or interoperate with existing semantic technologies.

Technologies such as RDF, OWL, SKOS, and SPARQL already solve important portions of the semantic representation problem:

Technology	Primary Capability
RDF	Subject-predicate-object representation
OWL	Ontologies and logical inference
SPARQL	Semantic graph querying
SKOS	Taxonomies and controlled vocabularies

CKF becomes significantly more credible if positioned not as a replacement for these standards, but as a higher-level cognitive compilation layer. An ideal architecture could look like this:

Human Documents
        ↓
Semantic Extraction
        ↓
CKF (Cognitive IR)
        ↓
RDF / OWL / Knowledge Graphs
        ↓
SPARQL / GraphQL / Vector Search
        ↓
MCP Servers
        ↓
Agents

In this architecture, CKF is not the database. It is the intermediate representation — a translation layer between human-authored abstractions and executable machine operations. If CKF ignores existing semantic standards, it risks reinventing two decades of Semantic Web research. If it interoperates with them, it occupies a genuinely distinct position.

CKF and MCP: A Natural Convergence

The Model Context Protocol standardizes how AI systems interact with external tools, APIs, files, workflows, and services. CKF addresses a different layer of the stack.

MCP asks: How does the agent access systems and tools?

CKF asks: How is knowledge itself structured for machine operation?

Together, they compose without conflict. A hypothetical MCP-compatible CKF server might expose cognitive functions such as:

get_principles(context)
detect_conflicts(strategy)
trace_reasoning_path(goal)
validate_assumptions(plan)
retrieve_applicable_heuristics(case)

This begins to resemble executable cognition rather than prompt orchestration. However, the governance implications are substantial. A CKF operational layer would require provenance tracking, permission systems, auditability, sandboxing, uncertainty management, conflict detection, and semantic versioning. Without these safeguards, the system risks operationalizing false certainty at scale.

Where CKF Is Most Promising

CKF appears especially valuable in domains where knowledge is fundamentally operational rather than merely informational — medicine, law, engineering, governance, risk analysis, scientific reasoning, enterprise strategy, compliance systems, adaptive education, and corporate playbooks.

In these environments, retrieving text is often insufficient. The system must understand which rules apply, which exceptions invalidate conclusions, which principles conflict, which assumptions hold, and which reasoning pathway should be followed.

These are precisely the domains where composition hallucination — failing to integrate conditions, exceptions, and scope across document fragments — has the most consequential impact.

The Central Challenge: Semantic Compilation

The hardest problem in CKF is not representation. It is compilation.

Compiling human language into executable semantic structures is fundamentally different from compiling software. Natural language contains ambiguity, implicit context, contradiction, uncertainty, metaphor, tacit knowledge, probabilistic reasoning, irony, and incomplete assumptions.

A viable CKF compiler therefore cannot produce rigid assertions alone. It must encode uncertainty explicitly. For example:

rule: X
confidence: 0.72
source: document_1
evidence: original_excerpt
status: inferred
validated_by: human

Without probabilistic provenance, CKF risks becoming a "false precision machine" — structurally organizing uncertainty into apparently authoritative outputs. This is not a hypothetical risk; it is the natural failure mode of any compilation system that treats its output as authoritative without tracking how it was derived.

Validation Will Define Success or Failure

The defining question for CKF is simple: how do we know the compiled knowledge is correct?

A robust CKF ecosystem requires traceability to original sources, confidence scoring, conflict detection, human review workflows, semantic consistency testing, benchmarking, provenance management, and version control. Without validation infrastructure, semantic compilation becomes dangerous rather than useful.

The MVP Problem

One of the greatest risks for CKF is excessive ambition. The proposal implicitly touches semantic representation, cognitive runtimes, ontologies, reasoning systems, protocols, agent infrastructure, interoperability layers, and knowledge compilation simultaneously. Attempting all of these at once is unlikely to succeed.

The strongest strategic path is a constrained MVP targeting a narrow domain (compliance, legal decision support, enterprise playbooks) with a minimal semantic core (concepts, principles, heuristics, constraints, relationships, reasoning paths), and measuring performance against existing systems with clear metrics: retrieval precision, reasoning consistency, conflict detection, hallucination reduction, explainability, and traceability.

Limitations and Study in Progress

CKF is a proposal under empirical investigation, not a validated technology. Several properties of the proposal remain undemonstrated.

Efficacy vs raw text is under measurement. An initial pilot with ten questions across three formats (PDF raw, TXT raw, CKF) produced near-ceiling scores in faithfulness and completeness that did not differentiate between formats. The pilot used a single model family as both agent and judge — a design choice that subsequent analysis identified as introducing self-evaluation bias. A pre-registered confirmatory study is in preparation, addressing these limitations with an independent judge model, smaller context budgets, multi-hop questions, and a benchmark called COMPGAP designed specifically to isolate composition hallucinations.

Semantic diff has an unsolved component. Detecting logical equivalence between paraphrased knowledge units — distinguishing "the rule was reworded" from "the rule was substantively changed" — is an open research problem. Initial implementations will route ambiguous cases to human review.

Compilation cost amortizes under specific conditions. The economic argument assumes documents are consulted many times across their lifetime. For documents consulted rarely or updated continuously, raw text may be preferable. Characterizing the conditions where CKF is cost-effective is future work.

CKF as a Semantic Intermediate Representation

The most compelling interpretation of CKF is not as a universal knowledge format. It is as a semantic Intermediate Representation for machine-operable cognition — positioned between the layer where humans write knowledge and the layer where machines consume it.

For CKF to become a meaningful technology rather than an interesting proposal, three strategic decisions remain essential:

Interoperate with existing standards (RDF, OWL, GraphRAG, MCP) rather than competing against them.
Demonstrate measurable value in narrow domains before pursuing universality.
Treat provenance, uncertainty, and validation as foundational primitives, not optional metadata.

The specification and reference implementation are open under MIT license at the project repository. The confirmatory study is in preparation. The hypothesis remains under test.

References

ANTHROPIC. (2024). Model Context Protocol Specification. https://modelcontextprotocol.io
EDGE, D., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv. https://arxiv.org/abs/2404.16130
KARPATHY, A. (2026). LLM Wiki: A pattern for building personal knowledge bases using LLMs. https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
TOMAZINHO, P. (2026). Composition Hallucination in RAG, GraphRAG, and Agents. CKF Research Notes. https://compiledknowledgeformat.org/news/composition-hallucination-em-rag-graphrag-e-agentes
W3C. RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/rdf11-concepts/

CKFRAGGraphRAGMCPSemantic WebIntermediate Representation

Continue reading

ResearchJune 12, 202612 min read

CKF on the global map: how the Compiled Knowledge Format compares to RAG, Document AI, GraphRAG, and semantic standards

A comparative analysis between CKF and the main global alternatives for structuring documents, preparing data for LLMs, building RAG, creating knowledge graphs, and standardizing APIs.

ConceptJune 12, 202622 min read

CKF Explained at Five Levels: From a 10-Year-Old to an IR Specialist

The same idea — Compiled Knowledge Format — explained five times, each level zooming in: a 10-year-old, a teenager, a non-technical adult, a technical professional, and an Information Retrieval specialist.

ResearchMay 22, 202618 min read

CKF Project Review: From CKF-0.1 to CKF Compiler v1.03.1

A scientific retrospective of the CKF Compiler, tracing the journey from CKF-0.1 (≈10% semantic preservation) to v1.03.1 — the first balanced release that simultaneously preserves meaning, structure, retrieval surface, sanitation, metadata and traceability.