What CKF is not (and what it actually is)

Reference

What CKF is not (and what it actually is)

A reference page for engineers, researchers and decision-makers evaluating where the Compiled Knowledge Format fits in the current AI stack. CKF is frequently confused with technologies it actually complements. This page clarifies the boundaries, with citations to the canonical references for each adjacent technology.

CKF is not a vector database

Vector databases like Pinecone, Weaviate or PostgreSQL extensions like pgvector store embeddings — numerical representations of text — and enable similarity search over them. They answer the question "which chunks of text are semantically closest to this query?".

CKF does not store vectors and does not perform similarity search. A CKF package is a typed, structured document that can be indexed by a vector database (one entry per atomic unit, retrieval chunk, rule, or procedure), but CKF itself is the format of what gets indexed, not the index. The two layers compose: vector databases handle retrieval; CKF handles representation of the retrieved units.

When a team replaces unstructured chunks with CKF atomic units in their vector database, the database does not change. The content of each indexed row changes.

CKF is not a knowledge graph or graph database

GraphRAG from Microsoft Research, Neo4j, and similar graph-based systems represent knowledge as nodes and edges — entities connected by relations. They excel at multi-hop reasoning and global summarization across entity networks.

CKF can contain a graph layer (entities, concepts, relations) but is broader than a graph. A CKF package also contains content that graphs do not naturally model: normative content (conditional rules, exceptions, precedence), operational content (procedures, playbooks, patterns, anti-patterns), and cognitive scaffolding (mental models, canonical Q&A pairs). For domains where knowledge is normative or operational — law, medicine, compliance, engineering — the difference is between "can retrieve related concepts" and "can apply policy correctly."

CKF and graph stores compose: a CKF package can be ingested into a knowledge graph (entities and relations become nodes and edges), and the non-graph layers (rules, procedures, exceptions) remain available as typed structured data for the agent to consume directly.

CKF is not an ontology

RDF, OWL and SKOS — the foundational standards of the Semantic Web — define formal ontologies: shared vocabularies that describe entities and their relationships in machine-readable form. They aim for universal interoperability across the web.

CKF inherits from this tradition the discipline of typed structure, stable identifiers, and explicit relations. But CKF abandons three commitments that made formal ontologies hard to adopt: it does not require a universal upper ontology, it does not require manual authoring by trained ontologists, and it does not require alignment between organizations. A CKF package is local to a document or document set, automatically compiled by an LLM, and versioned alongside its source.

In short: CKF is what becomes practical for ontology-style structure when LLMs reduce the authoring cost to near zero. It is RDF without the Semantic Web's universalism — and without the adoption friction that has limited the Semantic Web to specific domains (life sciences, open government data) for two decades.

CKF is not a fine-tuning system

Fine-tuning and parameter-efficient techniques like LoRA modify the weights of a language model so it behaves differently for specific tasks. The knowledge becomes embedded in the model itself.

CKF does not train or modify any model. A CKF package is data, not weights. It is consumed at inference time, the same way RAG context is consumed. This has important consequences: CKF packages can be updated, audited, versioned and replaced without retraining anything. They can be used with any model that accepts text input. They preserve provenance back to the source document — fine-tuned weights do not.

The two approaches answer different questions. Fine-tuning answers "how should the model behave by default?". CKF answers "what knowledge should the model consult for this task?".

CKF is not a replacement for RAG

Retrieval-Augmented Generation, introduced by Lewis et al. at NeurIPS 2020, is the architecture pattern of retrieving external content at query time and feeding it to a generative model. RAG is an architecture. CKF is a content format that fits inside that architecture.

A standard RAG pipeline runs: documents → chunks → embeddings → similarity search → top-k chunks → context → model. CKF changes one box in that pipeline. Instead of unstructured chunks, the retrieval substrate becomes typed CKF units. Embeddings, similarity search, top-k retrieval, and the generative model are unchanged.

The hypothesis under test in the CKF research program is whether structured retrieval units produce measurable improvements over unstructured chunks in faithfulness, precision and resistance to composition hallucinations — particularly under retrieval pressure. If the hypothesis fails, RAG continues to work exactly as today. If it succeeds, RAG continues to work, just with better input.

CKF is not a replacement for GraphRAG

GraphRAG (Edge et al., 2024, Microsoft Research) extracts entity-relation graphs from a corpus and uses community detection to enable global sensemaking queries — "what are the main themes across this entire dataset?" — that flat RAG cannot answer well.

CKF and GraphRAG address overlapping but distinct gaps. GraphRAG focuses on the graph layer: who relates to whom, which clusters of entities form coherent communities, how to summarize across a large network. CKF focuses on the operational layer: which rule applies under which condition, which exception overrides which general principle, in which sequence procedures must execute.

A useful mental model: GraphRAG asks "what is connected to what?"; CKF asks "what should the agent do under which circumstances?". Many real-world systems benefit from both. A CKF package can include an entity graph as one of its layers, and the resulting graph can be processed by GraphRAG's community detection if global summarization is needed.

CKF is not a replacement for MCP

The Model Context Protocol by Anthropic is a protocol for connecting AI assistants to external systems. It defines how clients and servers communicate, what primitives they exchange (resources, prompts, tools), and how authentication and capability negotiation work. MCP is the plumbing: how an agent reaches for capabilities outside itself.

MCP is deliberately agnostic about the format of those capabilities. A resource served via MCP can be a PDF, a JSON file, a database record, a webpage, or any other content. This is correct design — MCP standardizes the protocol, not the payload.

CKF fills a different gap: the missing schema for what MCP resources should look like when they carry structured knowledge. A CKF package can be served as an MCP resource natively. An MCP-compatible agent receiving a CKF-formatted resource can reason about it without custom integration — typed sections, stable identifiers, explicit relations, and provenance are already there.

The relationship is layered: MCP is the protocol layer; CKF is one possible content format that flows over that protocol. The two compose without conflict.

CKF is not a document format for humans

The Portable Document Format (ISO 32000-2) explicitly defines PDF as "a digital form for representing electronic documents to enable users to exchange and view electronic documents independent of the environment in which they were created or in which they are viewed or printed." PDF was designed to preserve visual fidelity for human readers.

CKF preserves operational fidelity for machine readers. The two formats answer opposite questions:

PDF asks: "how should this document appear to a human?"
CKF asks: "how should this knowledge operate inside an agent?"

A human reading a PDF infers structure that the document does not explicitly encode — hierarchies, exceptions, scope, precedence, applicability conditions. A language model receiving the same PDF as raw text must reconstruct that structure at inference time, every time, from the prose. CKF moves the structural inference from inference time to compile time, and encodes the result as typed data.

CKF does not replace PDF. PDFs continue to exist for human distribution. CKF is the layer between PDF and agent — the format that compiles the human-readable document into machine-operable knowledge.

CKF is not a prompt library or system prompt format

LangChain Hub, Anthropic prompt libraries and similar resources collect reusable prompts and system instructions. They standardize how to ask a model to do things.

CKF does not contain prompts and does not specify how a model should be asked. A CKF package describes what is true about a domain, not how the agent should behave when applying that truth. The CKF design explicitly separates these concerns: knowledge lives in .ckf packages; agent behavior (system instructions, conversation patterns, refusal policies) lives in separate overlays.

This separation matters because the same knowledge can be consumed by agents with very different roles — a strict compliance checker, a permissive recommender, a Socratic tutor — without recompiling the knowledge. Prompts are about how to think; CKF is about what is the case.

So what is CKF?

CKF is an open file format for structured representations of knowledge, compiled from human documents and consumed by AI agents.

A CKF package is a typed document containing entities, concepts, conditional rules, exceptions, procedures, principles, heuristics, causal chains, mental models, retrieval-ready atomic units, and provenance back to the source. It is serialized as YAML, JSON or Markdown. It is produced by an automated compiler (typically an LLM with structured tool-calling) and consumed by retrieval systems, agents, or MCP-compatible clients.

The role CKF plays in the broader stack is the role of a compiled intermediate representation for knowledge. The analogy is to LLVM IR in software compilers: source code is human-readable but not directly executable; compilers transform it into an intermediate representation that is optimized, verifiable, and portable across backends; runtime systems then execute the IR.

Applied to knowledge:

Human documents (PDF, DOCX, manuals, policies, textbooks) play the role of source code.
CKF packages play the role of intermediate representation.
Agents, retrieval systems, MCP servers play the role of runtime.

Just as LLVM IR did not replace GCC or Clang but became the substrate on which both operate, CKF is not positioned to replace RAG, GraphRAG, vector databases or MCP. It is positioned as the format these systems can share when their input is structured knowledge rather than unstructured text.

What CKF compels its adoption to demonstrate empirically — and what the CKF research program is designed to test — is whether moving structural inference from inference time to compile time produces measurable gains in retrieval precision, response faithfulness, token efficiency, auditability, and resistance to a class of failure modes called composition hallucinations.

If the empirical results hold up under the pre-registered Study 2, CKF is the format that fits between every PDF, manual, policy or guideline and every agent that consumes them. If not, CKF will still have contributed a vocabulary — composition hallucination, schema-stable retrieval, compiled knowledge — that the field can use independently.