Back to news
Research9 min read

CKF: Toward a Compiled Intermediate Representation for Machine-Operable Knowledge

A strategic analysis of CKF as a semantic IR — beyond RAG and GraphRAG, alongside RDF/OWL and MCP, with provenance and validation as foundational primitives.

P
Paulo TomazinhoCreator of CKF
May 12, 2026

Talk to this article

This post exists as a CKF package. Load it into your favorite LLM and discuss, summarize or apply its ideas.

A strategic analysis of CKF as a semantic IR — beyond RAG and GraphRAG, alongside RDF/OWL and MCP, with provenance and validation as foundational primitives.

Author: Paulo Tomazinho, PhD

Affiliation: CKF Research


Artificial intelligence systems have become remarkably effective at generating language. Yet most contemporary architectures still operate primarily over textual representations rather than structured operational knowledge.

This distinction matters.

Current systems retrieve, summarize, and statistically reinterpret language. They do not truly operate over explicit cognitive structures such as principles, constraints, heuristics, assumptions, causal mechanisms, or decision pathways.

The concept behind Compiled Knowledge Format (CKF) is compelling precisely because it attempts to address this gap — building on an architectural intuition that Andrej Karpathy articulated clearly in his LLM Wiki pattern (April 2026): that raw sources should be compiled once into a persistent artifact rather than re-derived at every query. CKF extends that intuition toward a different consumer — not a human reading a Markdown wiki, but an agent or retrieval system reading a typed, schema-stable package.

Rather than treating knowledge as content, CKF proposes treating knowledge as an executable infrastructure layer.

A useful analogy emerges from software architecture:

LayerAnalogy
PDF / article / bookHuman source code
Semantic parsingCompiler
CKFCognitive intermediate representation
Agent / LLMExecution runtime
MCP / APIsOperational interface

This framing may be the most valuable contribution of the CKF proposal: the transition from "knowledge as document" to "knowledge as executable intermediate representation."


Beyond Retrieval: Why CKF Matters

Most current AI systems still rely heavily on Retrieval-Augmented Generation (RAG).

Traditional RAG pipelines retrieve text chunks and inject them into prompts. This improves grounding and reduces some hallucination, but it still forces models to reinterpret natural language repeatedly — and it fails systematically on a specific class of error we have termed composition hallucination: outputs that contradict information present and retrievable in the context, not because the model lacked the data, but because the structural relations between fragments were never made explicit. Full taxonomy in a companion article.

CKF proposes a different abstraction layer.

Instead of retrieving paragraphs, a CKF system could retrieve:

  • applicable principles
  • decision heuristics
  • operational constraints
  • assumptions
  • conflict structures
  • reasoning paths
  • causal relationships

This is fundamentally more powerful than retrieving text alone. The shift resembles the difference between searching raw source files and executing compiled semantic instructions.

However, the challenge is equally significant: while RAG is comparatively simple, scalable, and inexpensive, CKF introduces major complexity in semantic compilation, validation, ontology management, and maintenance. These costs are real and must be justified empirically, not asserted rhetorically.


CKF vs GraphRAG

GraphRAG already represents an important evolution beyond traditional retrieval systems.

Microsoft's GraphRAG approach combines graph extraction, network analysis, summarization, and hierarchical semantic organization to improve contextual retrieval and reasoning over large corpora.

But CKF appears to push beyond entity-relationship structures.

Where GraphRAG primarily organizes entities, relations, communities, and semantic neighborhoods, CKF aims to additionally encode heuristics, principles, decision logic, assumptions, causal mechanisms, and operational reasoning.

GraphRAGCKF
Organizes semantic graphsOrganizes cognitive structures
Focuses on entities and relationsFocuses on operational reasoning
Improves contextual retrievalProposes cognitive execution

The question, however, is whether CKF can demonstrate measurable gains over already-functional GraphRAG implementations. Conceptual elegance alone is insufficient. Operational advantage must be empirically demonstrated. This is the central question driving the experimental program described below.


The Relationship with RDF, OWL, SKOS, and Semantic Web Standards

A major strategic question for CKF is whether it intends to compete with or interoperate with existing semantic technologies.

Technologies such as RDF, OWL, SKOS, and SPARQL already solve important portions of the semantic representation problem:

TechnologyPrimary Capability
RDFSubject-predicate-object representation
OWLOntologies and logical inference
SPARQLSemantic graph querying
SKOSTaxonomies and controlled vocabularies

CKF becomes significantly more credible if positioned not as a replacement for these standards, but as a higher-level cognitive compilation layer. An ideal architecture could look like this:

Human Documents
        ↓
Semantic Extraction
        ↓
CKF (Cognitive IR)
        ↓
RDF / OWL / Knowledge Graphs
        ↓
SPARQL / GraphQL / Vector Search
        ↓
MCP Servers
        ↓
Agents

In this architecture, CKF is not the database. It is the intermediate representation — a translation layer between human-authored abstractions and executable machine operations. If CKF ignores existing semantic standards, it risks reinventing two decades of Semantic Web research. If it interoperates with them, it occupies a genuinely distinct position.


CKF and MCP: A Natural Convergence

The Model Context Protocol standardizes how AI systems interact with external tools, APIs, files, workflows, and services. CKF addresses a different layer of the stack.

MCP asks: How does the agent access systems and tools?

CKF asks: How is knowledge itself structured for machine operation?

Together, they compose without conflict. A hypothetical MCP-compatible CKF server might expose cognitive functions such as:

get_principles(context)
detect_conflicts(strategy)
trace_reasoning_path(goal)
validate_assumptions(plan)
retrieve_applicable_heuristics(case)

This begins to resemble executable cognition rather than prompt orchestration. However, the governance implications are substantial. A CKF operational layer would require provenance tracking, permission systems, auditability, sandboxing, uncertainty management, conflict detection, and semantic versioning. Without these safeguards, the system risks operationalizing false certainty at scale.


Where CKF Is Most Promising

CKF appears especially valuable in domains where knowledge is fundamentally operational rather than merely informational — medicine, law, engineering, governance, risk analysis, scientific reasoning, enterprise strategy, compliance systems, adaptive education, and corporate playbooks.

In these environments, retrieving text is often insufficient. The system must understand which rules apply, which exceptions invalidate conclusions, which principles conflict, which assumptions hold, and which reasoning pathway should be followed.

These are precisely the domains where composition hallucination — failing to integrate conditions, exceptions, and scope across document fragments — has the most consequential impact.


The Central Challenge: Semantic Compilation

The hardest problem in CKF is not representation. It is compilation.

Compiling human language into executable semantic structures is fundamentally different from compiling software. Natural language contains ambiguity, implicit context, contradiction, uncertainty, metaphor, tacit knowledge, probabilistic reasoning, irony, and incomplete assumptions.

A viable CKF compiler therefore cannot produce rigid assertions alone. It must encode uncertainty explicitly. For example:

rule: X
confidence: 0.72
source: document_1
evidence: original_excerpt
status: inferred
validated_by: human

Without probabilistic provenance, CKF risks becoming a "false precision machine" — structurally organizing uncertainty into apparently authoritative outputs. This is not a hypothetical risk; it is the natural failure mode of any compilation system that treats its output as authoritative without tracking how it was derived.


Validation Will Define Success or Failure

The defining question for CKF is simple: how do we know the compiled knowledge is correct?

A robust CKF ecosystem requires traceability to original sources, confidence scoring, conflict detection, human review workflows, semantic consistency testing, benchmarking, provenance management, and version control. Without validation infrastructure, semantic compilation becomes dangerous rather than useful.


The MVP Problem

One of the greatest risks for CKF is excessive ambition. The proposal implicitly touches semantic representation, cognitive runtimes, ontologies, reasoning systems, protocols, agent infrastructure, interoperability layers, and knowledge compilation simultaneously. Attempting all of these at once is unlikely to succeed.

The strongest strategic path is a constrained MVP targeting a narrow domain (compliance, legal decision support, enterprise playbooks) with a minimal semantic core (concepts, principles, heuristics, constraints, relationships, reasoning paths), and measuring performance against existing systems with clear metrics: retrieval precision, reasoning consistency, conflict detection, hallucination reduction, explainability, and traceability.


Limitations and Study in Progress

CKF is a proposal under empirical investigation, not a validated technology. Several properties of the proposal remain undemonstrated.

Efficacy vs raw text is under measurement. An initial pilot with ten questions across three formats (PDF raw, TXT raw, CKF) produced near-ceiling scores in faithfulness and completeness that did not differentiate between formats. The pilot used a single model family as both agent and judge — a design choice that subsequent analysis identified as introducing self-evaluation bias. A pre-registered confirmatory study is in preparation, addressing these limitations with an independent judge model, smaller context budgets, multi-hop questions, and a benchmark called COMPGAP designed specifically to isolate composition hallucinations.

Semantic diff has an unsolved component. Detecting logical equivalence between paraphrased knowledge units — distinguishing "the rule was reworded" from "the rule was substantively changed" — is an open research problem. Initial implementations will route ambiguous cases to human review.

Compilation cost amortizes under specific conditions. The economic argument assumes documents are consulted many times across their lifetime. For documents consulted rarely or updated continuously, raw text may be preferable. Characterizing the conditions where CKF is cost-effective is future work.


CKF as a Semantic Intermediate Representation

The most compelling interpretation of CKF is not as a universal knowledge format. It is as a semantic Intermediate Representation for machine-operable cognition — positioned between the layer where humans write knowledge and the layer where machines consume it.

For CKF to become a meaningful technology rather than an interesting proposal, three strategic decisions remain essential:

  1. Interoperate with existing standards (RDF, OWL, GraphRAG, MCP) rather than competing against them.
  2. Demonstrate measurable value in narrow domains before pursuing universality.
  3. Treat provenance, uncertainty, and validation as foundational primitives, not optional metadata.

The specification and reference implementation are open under MIT license at the project repository. The confirmatory study is in preparation. The hypothesis remains under test.


References

CKFRAGGraphRAGMCPSemantic WebIntermediate Representation

Continue reading