PDF vs .ckf — Why Agentic AI Needs a New Knowledge Format
An argument for why documents designed for human readers are a poor fit for machine readers, and what a machine-native format might look like.
Talk to this article
This post exists as a CKF package. Load it into your favorite LLM and discuss, summarize or apply its ideas.
An argument for why documents designed for human readers are a poor fit for machine readers, and what a machine-native format might look like.
Author: Paulo Tomazinho, PhD
Affiliation: CKF Research
"The document is not the knowledge."
For three decades, the document has been the atomic unit of recorded thought. We wrote books, then PDFs, then HTML, then Markdown — and we trusted the rectangle of text to carry meaning across time, software, and minds. That trust was always implicit. The reader was assumed to be human, patient, and capable of reconstructing structure from prose.
In 2026, the reader is no longer always human. It is increasingly a model: stochastic, tireless, expensive per token, and increasingly autonomous. When that reader encounters a PDF, it does not see a chapter. It sees a flat stream of characters interrupted by page numbers, ligatures, table fragments, and decorative noise. The format was designed for an entirely different kind of reader.
This essay argues that the Compiled Knowledge Format (CKF) addresses a structural gap between documents designed for human eyes and the agents that must now reason over them. It is a proposal under empirical investigation, not a validated solution. The methodology, caveats, and current state of evidence are described below.
Why this essay exists now
In late 2024, Anthropic published the Model Context Protocol, a specification for how language models call tools and discover external capabilities. Within twelve months, MCP had been adopted by OpenAI, Google DeepMind, and most major agent frameworks. Acting is half of cognition. MCP standardized the acting half.
The knowing half has no equivalent standard.
In April 2026, Andrej Karpathy published the LLM Wiki pattern, arguing that raw documents should function as "source code" and LLMs should act as "compilers" that build persistent, structured knowledge artifacts for repeated consultation — rather than rediscovering knowledge from scratch on every query. The post reached over 16 million views in weeks. Karpathy also coined the term "context engineering" for the observation that most production failures of LLM systems are not reasoning failures, they are context failures: the model is asked to think with the wrong material.
The substrate of that material — the file an agent ingests when it loads "the company handbook," "the regulation," "the textbook," "the case law" — is still, overwhelmingly, the PDF. A format finalized by Adobe in 1993 to make printed pages portable across operating systems. A format whose original specification dedicates more pages to font embedding and color management than to semantics, because semantics were never the point.
CKF is a bet that this substrate needs to change when the primary consumer is a machine reader. It does not replace PDFs for human readers; it proposes a parallel artifact for machine readers, building on Karpathy's compiler analogy and attempting to operationalize it as an open format.
A short history of the document
Every recording medium is shaped by its reader.
- The codex (~1st century) optimized for the human eye scanning bound pages — random access within a finite volume.
- The printing press (~1450) optimized for mechanical reproduction at scale, fixing layout as a feature rather than a constraint.
- The PDF (1993) optimized for display fidelity across devices: it freezes a page so it looks the same on any screen or printer.
- HTML (1991) optimized for hyperlinked browsing with separable structure and style.
- Markdown (2004) optimized for writers who wanted plaintext that survived rendering.
- RAG chunks (2020–) optimized for vector similarity search over arbitrary text, accepting structural amnesia as the cost.
Each format made sense for its reader. None of them was designed for an autonomous reasoner that ingests thousands of documents per session, must cite every claim, and pays per token in latency, money, and error risk.
The hidden costs of feeding PDFs to LLMs
Every PDF that enters an agent's context window incurs costs that compound across a workflow. These are not universal constants — they depend heavily on document type, parser quality, and task — but they are real enough that a substantial research literature has emerged to address them.
Structural noise. A significant fraction of tokens in a typical PDF are layout artifacts: running headers, footers, page numbers, figure captions, table cells exploded by extraction. Foundational work on layout-aware parsing — LayoutLM, Nougat, Marker — exists precisely because raw PDF text is hostile to language models.
Semantic dilution. A single fact is often distributed across a definition, an example, a footnote, and a summary. Research on long-context retrieval, including Lost in the Middle, shows that even when the right span is in context, models often fail to use it effectively.
Missing relations. Documents encode entities and propositions, but the relations between them — causal, procedural, conditional, temporal — live in the reader's head. An LLM asked "what happens if the policy lapses?" must reconstruct a graph the document never wrote down. This is the error class that knowledge-graph–augmented retrieval (GraphRAG) was designed to address. We have described a specific sub-class of this failure as composition hallucination: outputs that contradict information present and retrievable in context because structural relations between fragments were left implicit.
Retrieval degradation at scale. As corpora grow, embedding-only retrieval degrades on out-of-distribution domains, as the BEIR benchmark demonstrated. Hybrid pipelines are now standard not because they are elegant but because each modality patches the other's blind spots.
The compounding effect is real in the qualitative sense — practitioners who have shipped RAG systems in regulated domains consistently report these problems. The question of how large the effect is under what conditions is the empirical question that motivates the CKF Science Lab.
What knowledge looks like to an agent
For an autonomous reasoner, knowledge is not prose. It is executable context: a representation that supports look-up by intent, traversal by relation, citation by provenance, and selective loading by budget. It has at least the following properties:
- Typed claims. Every assertion is tagged with what kind of statement it is — a definition, a procedure, a constraint, a quantity, an exception.
- Named entities. Concepts and objects have stable identifiers so two passages referring to "the policyholder" can be aligned.
- Explicit relations. "X requires Y," "A precedes B," "C is a special case of D" are first-class edges, not paragraphs the model has to infer.
- Provenance. Every claim points back to its source span, with a hash and an offset, so any answer can be audited.
- Confidence and scope. Claims carry the conditions under which they hold and the source basis under which they were extracted.
A paragraph of prose has none of these. A .ckf package proposes to have all of them.
Introducing .ckf
The .ckf file is an open, AI-native package format. A .ckf package contains:
- a manifest with identity, version, source provenance, license, and compatibility flags
- 22 typed sections (entities, concepts, conditional rules, heuristics, procedures, principles, anti-patterns, causal chains, atomic units, retrieval chunks, and others), each item with stable identifier
- a provenance ledger — for every extracted unit, a span pointer back to the original source plus a content hash
- a source basis label for each item (
explicit,inferred,synthesized,author_opinion,uncertain) and a confidence score
The format is described in the Specification and the Protocol. A reference compiler is at /compiler.
The crucial property: .ckf is not a replacement for the source document. The PDF, the textbook, the regulation — they remain. The .ckf is the compiled form of their knowledge, the way an object file is the compiled form of a source file. You distribute the source for humans and the compiled artifact for machines.
PDF vs .ckf — condensed comparison
| Dimension | .ckf | |
|---|---|---|
| Primary reader | Human eye | Autonomous reasoner |
| Structure | Linear narrative + visual layout | 22 typed sections |
| Retrieval | Re-derived per query (chunk + embed) | Precomputed, addressable by type and ID |
| Provenance | Implicit (page numbers) | Explicit (span hash + source basis) |
| Updates | Reissue the file | Diff at the item level (proposed) |
| Relations | Implicit in prose | Explicit edges between typed units |
| Confidence | None | Per-item confidence score + source basis label |
MCP and CKF are complementary
A common first reaction is to ask whether .ckf overlaps with MCP. It does not. The two protocols address orthogonal halves of agent design.
┌───────────────────────────┐
│ AGENT │
└─────────────┬─────────────┘
│
┌───────────────┴───────────────┐
│ │
┌────▼────┐ ┌────▼────┐
│ MCP │ │ CKF │
│ verbs │ │ nouns │
│ layer │ │ layer │
└────┬────┘ └────┬────┘
│ │
tools, APIs, knowledge packages,
side effects typed claims, graph
MCP standardizes the verbs an agent can invoke. CKF proposes a standard for the nouns the agent reasons over. An MCP-only agent can act on the world but must re-derive what it knows on every prompt. A CKF-only agent has structured memory but no hands. A modern agentic system needs both.
Empirical state and methodology
The Science Lab runs controlled comparisons in three conditions: raw PDF text, raw TXT text, and compiled .ckf packages — same question battery, same model, same judge, same context budget, varying only the substrate.
An initial pilot with ten questions produced near-ceiling scores in faithfulness and completeness across all three conditions — the benchmark did not differentiate between formats. This outcome had two likely causes: the question battery was not calibrated for retrieval pressure (questions were answerable from any single retrieved chunk), and the same model family served as both agent and judge, introducing self-evaluation bias.
A pre-registered confirmatory study is in preparation with changes addressing both limitations: an independent judge model from a different model family, smaller context budgets designed to create retrieval pressure, multi-hop questions requiring composition across fragments, paired counterfactual questions, and a benchmark — COMPGAP — designed specifically to isolate composition hallucinations from other failure modes. Results will be reported regardless of direction.
Important caveats on any future results:
- Gains will not be uniform across document types. Highly narrative material (a novel) compresses differently than highly structured material (a tax code). CKF gains are expected to be largest where structure was always implicit.
- The comparison holds model, question set, judge, and budget constant. A PDF pipeline allowed to call its retriever many more times will close part of any gap at proportional cost.
- The judge model is held out from the generator model to reduce evaluator bias, following the practice recommended by Zheng et al..
Objections, honestly
"Isn't this just RAG with extra steps?" Conventional RAG is a runtime technique: at query time, embed, retrieve, inject. CKF moves structural work to compile time: the chunking, the graph, the index are precomputed. The extra steps are paid once, by the publisher, instead of every time, by every consumer. Whether that trade-off is favorable depends on how often the compiled artifact is consulted — an empirical question.
"Won't models eventually read PDFs perfectly?" Possibly. But even a hypothetical perfect reader still pays for every layout token it ingests, still re-derives the same structure on every query, and still has no item-level provenance to cite. Better readers reduce the noise; they do not change the structural asymmetry between a format designed for human eyes and one designed for machine consumption.
"Why YAML/JSON? Why not [other format]?" The protocol is serialization-agnostic at the model level. YAML is the canonical surface because it is human-readable, diff-friendly, and round-trips cleanly to JSON. JSON, CBOR, and other encodings are valid.
"What about copyright?" A .ckf derived from a copyrighted source inherits its licence. The format includes machine-readable licence terms and use restrictions. Compiling knowledge does not change its legal status.
Limitations
The empirical case for CKF is unresolved. The pilot produced ceiling effects that did not differentiate between formats. The confirmatory study will produce directional evidence, but its results are unknown at the time of writing.
Beyond empirical questions, several technical components of the CKF proposal remain design proposals rather than implemented features: incremental compilation, semantic diff for patch preservation, and regression testing for compiled packages. The distinction between what the current compiler does and what the full proposal envisions is documented explicitly in Docs.
Adoption also depends on interoperability that the protocol cannot guarantee unilaterally. For CKF to function as an intermediate representation between human documents and machine readers, ingestion tooling needs to mature across vector databases, graph stores, MCP clients, and retrieval frameworks. The specification is open under MIT, but adoption velocity is outside the project's control.
Closing
Documents were optimized for human readers. The claim of this essay is not that PDFs are bad documents — they are excellent documents, designed for their intended purpose. The claim is that when the primary consumer is an agent rather than a human, a format designed for agents may be more appropriate.
Whether that design advantage is large enough to matter empirically — and under what conditions — is the question the CKF research program is trying to answer.
The format specification, reference compiler, MCP server, and Science Lab are open at compiledknowledgeformat.org. Compile a package from your own documents and evaluate the output. The methodology for comparison is in the Lab. Independent replications that disagree with our findings are as valuable as those that confirm them.
References
- Anthropic. Introducing the Model Context Protocol. 2024.
- Karpathy, A. LLM Wiki. 2026.
- Xu et al. LayoutLM. KDD 2020.
- Blecher et al. Nougat. arXiv:2308.13418, 2023.
- Liu et al. Lost in the Middle. TACL, 2024.
- Edge et al. GraphRAG. arXiv:2404.16130, 2024.
- Thakur et al. BEIR. NeurIPS Datasets, 2021.
- Zheng et al. Judging LLM-as-a-Judge. NeurIPS, 2023.
- Tomazinho, P. Composition Hallucination. CKF Research, 2026.
- W3C. Semantic Web. Ongoing.
The format is experimental. The hypothesis is under test.