Introducing the Compiled Knowledge Format
An open format for compiling human documents into structured, agent-ready knowledge packages.
Talk to this article
This post exists as a CKF package. Load it into your favorite LLM and discuss, summarize or apply its ideas.
An open format for compiling human documents into structured, agent-ready knowledge packages.
Author: Paulo Tomazinho, PhD
Affiliation: CKF Research
Documents were designed for human readers. A PDF carries headers, narrative flow, footnotes, and layout conventions that assume a reader who can bridge the gap between paragraphs, follow an implicit argument, and reconstruct structure from prose. Those assumptions served knowledge communication well for decades.
The reader has changed.
Language model agents now consume documents as input — regulations, manuals, clinical guidelines, textbooks, policy documents — and are expected to reason over them, answer questions from them, and make decisions grounded in them. These agents are not human readers. They process text token by token, pay for every layout artifact they ingest, and must reconstruct at inference time the structural relationships between concepts that the document left implicit.
Andrej Karpathy described the architectural consequence in April 2026: raw sources should function as "source code," and LLMs should act as "compilers" that build persistent, structured artifacts for repeated consultation — rather than rediscovering knowledge from scratch on every query. The LLM Wiki pattern that followed this intuition targets human readers assisted by LLMs. The Compiled Knowledge Format (CKF) extends the same compiler analogy toward the case where the primary consumer is the agent itself.
What CKF proposes
CKF is an open format — MIT license, client-side, bring-your-own-key — for transforming source documents (PDF, DOCX, Markdown, plain text) into structured packages that agents and retrieval systems can consume natively.
A compiled .ckf package is a single file serializable as .ckf.md, .ckf.yaml, or .ckf.json. It contains a metadata header plus 22 typed sections: entities, concepts, conditional rules, heuristics, procedures, principles, anti-patterns, causal chains, atomic units, retrieval chunks, and others. Each item carries span-level provenance back to the source document, a source basis label (explicit, inferred, synthesized, author_opinion, uncertain), and a confidence score.
The source document is not replaced. It remains, available for human reading. The compiled package is a parallel artifact — same knowledge, different form, different audience.
The proposed advantages are three:
Structural stability. Typed sections impose schema on content that would otherwise be implicit in prose. An exception to a rule is encoded as an exception, not buried three paragraphs after the rule in a subordinate clause.
Provenance at the item level. Every extracted unit links to the exact span in the source. An agent citing a compiled package can trace each answer to a specific passage — not to a page number, to a line.
Amortized inference. Structural relationships extracted once at compile time do not need to be re-derived at every query. For documents consulted many times by many agents, the compile-once economics are favorable in principle.
Whether these advantages are large enough to justify the compilation cost in practice is an empirical question. The answer depends on document type, query complexity, context budget, and model family. It is the question the CKF research program is designed to answer.
What exists today
Compiler — available at compiledknowledgeformat.org/compiler. Accepts PDF, TXT, and Markdown. Supports five LLM providers via BYOK (OpenAI, Anthropic, Google, DeepSeek, OpenRouter). Produces .ckf.md, .ckf.yaml, and .ckf.json output. A no-login demo is available.
CKF Viewer — renders the compiled package as an annotated structured view with entity graph, allowing audit of what was extracted and its source provenance.
Science Lab — runs controlled experiments in three parallel conditions (pdf_raw, txt_raw, ckf) across configurable question batteries with a blind judge model. Produces statistical output with Wilcoxon paired signed-rank test and bootstrap confidence intervals. Available at /lab.
MCP Server — exposes CKF compilation, parsing, validation, and search as MCP tools via JSON-RPC over Streamable HTTP. Compatible with Claude Desktop, Cursor, Windsurf, and the Vercel AI SDK. Documented at /docs/mcp.
Public API — two endpoints for compilation and retrieval, with rate limits and authentication documented at /docs/api.
What is proposed but not yet built
Several capabilities described in CKF documentation and earlier writing are design proposals rather than implemented features:
Incremental compilation — when a source document changes, recompiling only affected units rather than the full document. Currently, the compiler reprocesses entire files.
Semantic diff and patch preservation — distinguishing paraphrastic rewrites (preserving human corrections) from substantive logical changes (routing to review). The mechanism for detecting logical equivalence reliably is an open research problem.
Regression testing for knowledge — running a predefined Q&A battery against each new compiled version before deploying to production. The Lab implements a related capability for benchmarking; the production-grade version is in design.
The distinction between these two sets matters. The first set can be evaluated today. The second set is proposed direction, not current capability.
What the evidence shows so far
An initial pilot ran ten questions in three conditions (PDF raw, TXT raw, CKF) using the Science Lab. Results showed near-ceiling scores in faithfulness and completeness across all three conditions — the benchmark did not differentiate between formats. Post-hoc analysis identified two design limitations: the question battery was not calibrated for retrieval pressure, and the same model family served as both agent and judge, introducing self-evaluation bias.
A pre-registered confirmatory study is in preparation with structural changes: an independent judge model from a different model family, smaller context budgets to create retrieval pressure, multi-hop questions requiring composition across fragments, paired counterfactual questions, and a benchmark — COMPGAP — designed specifically to isolate what we term composition hallucination: failures where the model has the relevant information but fails to integrate relations between fragments correctly.
Results from this study will be reported regardless of direction.
How CKF relates to adjacent technologies
CKF does not replace retrieval-augmented generation, GraphRAG, vector databases, knowledge graphs, the Model Context Protocol, or fine-tuning. A full treatment is in What CKF is not.
The shortest version: CKF is the schema of what those systems carry when their content is structured knowledge rather than unstructured text. A .ckf package can be indexed by a vector database, traversed by a graph algorithm, or served as a native MCP resource. The format composes with the existing stack.
Participating
The specification, reference compiler, and MCP server are open under MIT license at github.com/tomazinho/open-ckf-compiled-knowledge-format. Discussion happens on Discord. The pre-registration protocol and benchmark will be published when the confirmatory study is submitted.
If you compile a package from your own documents and find something that doesn't work — in the output schema, in the compiler behavior, in the MCP tools — that is actionable signal. Open an issue or bring it to Discord.