ResearchApril 26, 202618 min read

Composition Hallucination em RAG, GraphRAG e agentes: quando ter o contexto não basta

RAG falha de quatro formas: paramétrica, de retrieval, contextual e composicional. Este artigo nomeia e descreve o quarto tipo — Composition Hallucination — mostra por que aumentar chunks ou contexto não resolve, e propõe diagnóstico e implicações para quem constrói RAG, GraphRAG e agentes.

Paulo Tomazinho, PhDCKF Research

April 26, 2026

Talk to this article

This post exists as a CKF package. Load it into your favorite LLM and discuss, summarize or apply its ideas.

Composition Hallucination in RAG, GraphRAG, and Agents: When Having the Context Is Not Enough

Paulo Tomazinho, PhD CKF Research · CITAP AI Lab

Abstract

RAG fails in four distinct ways: parametric (the model fabricates), retrieval (the relevant chunk was not recovered), contextual (the chunk was ignored in a long context), and compositional (the chunks are present, but the model fails to integrate the relationship between them). This paper names and describes the fourth type — Composition Hallucination — shows why increasing chunks or context size does not resolve it, and proposes a diagnostic protocol and practical implications for those building RAG, GraphRAG, and agent systems.

Think of a recipe written across separate pages:

Page 1: "Add 2 tablespoons of sugar." and “Bake for 50 minutes.”
Page 7: "If the cake is for diabetics, replace sugar with sweetener."
Page 12: "Sweetener changes baking time: reduce by 10 minutes."

An assistant that read all three pages knows what each one says. But if asked "how long should a diabetic cake bake?", it needs to compose three pieces of information in sequence — and the relationship between them was never written down explicitly anywhere.

If it answers "Bake for 50 minutes" — it failed. Not because it didn't know the pages. Because it didn't integrate the relationships between them.

That is Composition Hallucination.

A RAG system receives the question:

"I am a manager. Can I approve an international expense of $3,000?"

The corpus contains both relevant clauses. Retrieval returned both chunks. Both are in the context. The model read each one. The response was:

"Yes, because managers can approve expenses up to $5,000."

The correct information was there. The correct chunk was retrieved. The context was short. The model could explain each sentence in isolation.

And the answer was wrong.

I began observing this phenomenon repeatedly while building assistants, intelligent tutors, RAG systems, and agents grounded in authored works, instructional materials, and structured knowledge bases. It did not resemble traditional hallucination. Investigating existing scientific taxonomies confirmed: none of them named precisely what I was observing in practice.

Ji et al. organize a broad survey on hallucination in natural language generation. Huang et al. propose a taxonomy focused on large language models. Maynez et al. consolidate the distinction between factuality and faithfulness in abstractive summarization. Liu et al. show that models can fail to use information positioned at certain points of long contexts — the phenomenon known as lost in the middle. Works such as FActScore, RAGTruth, and HaluEval advance the evaluation of factuality and faithfulness. These works are foundational.

But the error I was observing was not simply parametric, nor purely a retrieval failure, nor merely positional attention failure.

I began calling this phenomenon Composition Hallucination.

1. The error RAG does not resolve on its own

RAG emerged as an elegant response to an important limitation of language models: instead of relying solely on parametric memory, the system retrieves external information and injects it into the generation context. Lewis et al. formalized this architecture by combining parametric and retrievable non-parametric memory, opening the path for most modern question-answering systems over document corpora.

In practice, this led many teams to an implicit assumption:

If retrieval returns the right chunk, the model will be able to answer correctly.

This assumption is partially true, but incomplete.

In simple knowledge bases, where the answer depends on a single factual fragment, RAG often works well. The question points to a chunk. The retriever finds the chunk. The model transforms the chunk into an answer.

But real knowledge bases rarely work this way.

Legal documents, corporate policies, medical guidelines, educational materials, compliance systems, operational manuals, and institutional knowledge bases are composed of relationships: general rule, exception, exception to the exception, condition of applicability, precedence, temporal dependency, scope, hierarchy, normative conflict, update, restriction, permission, obligation, contraindication, required prior procedure.

When the answer depends on these relationships, retrieving the correct fragments is no longer sufficient.

The model must compose.

And this is precisely where Composition Hallucination occurs.

2. A minimal example

Consider the case above.

An internal policy has two clauses:

Managers may approve operational expenses up to $5,000.

And, in a separate passage:

International expenses require CFO approval, regardless of the amount.

The user asks:

I am a manager. Can I approve an international expense of $3,000?

A RAG system retrieves both passages. Both are in the context. The model understands each sentence individually.

Nevertheless, it responds:

Yes, because managers can approve expenses up to $5,000.

This response is not a parametric hallucination. The model did not fabricate an external fact.

It is not a retrieval hallucination. The relevant chunk was retrieved.

It is not necessarily a lost in the middle case. The chunks may be adjacent and the context may be short.

The error lies in the failure of composition between the clauses.

The model applied the general rule and ignored the override relationship of the exception.

The information was present. The relationship was implicit. The composition failed.

3. Why this deserves a new category

Existing taxonomies are useful, but they frequently classify the error by the external appearance of the response: is the answer factual or not? Is it faithful to the context or not? Does it contradict the source?

These questions remain important. But Composition Hallucination requires a different question:

Which cognitive operation failed?

The hypothesis is that, in this case, a specific operation fails: the relational integration of fragments. The model did not fail to access the information. It failed to correctly apply the relationship between pieces of information.

This changes the taxonomy.

We can distinguish at least four operational error types:

Error type	Where the failure occurs	Correct information in context?	General example
Parametric	Model's internal memory	No	Model fabricates a historical fact, author, method, or reference
Retrieval	Search, ranking, or chunk selection	No	System retrieves the general rule but not the exception
Contextual	Use of available context	Yes, but poorly utilized	Model ignores information present in long or poorly positioned context
Compositional	Implicit relationship between fragments	Yes	Model retrieves rule and exception but applies the wrong one

📌 Operational definition

Composition Hallucination occurs when a model produces a response that contradicts information present, retrieved, and readable in the context, because it fails to correctly compose the implicit relationships between retrieved fragments.

Three simultaneous conditions define the phenomenon:

Informational sufficiency — all necessary fragments are present in the context

Local readability — the model can correctly interpret each fragment in isolation

Relational failure — the answer requires composing fragments through implicit relationships (exception, precedence, dependency, scope, sequence, condition) that the model does not apply correctly

This distinction is practical, not merely conceptual.

Each error type requires a different intervention.

If the problem is parametric: use retrieval, tools, or knowledge updates. If the problem is retrieval: improve chunking, embeddings, reranking, query rewriting, or indexing. If the problem is contextual: improve the positioning of relevant chunks, reduce context, or use compression. If the problem is compositional: none of these interventions resolves it. Relationships must be represented explicitly, before retrieval.

4. Why larger context does not resolve it

An intuitive response would be: just expand the context window and include the entire document. The model will read everything and can compose.

This intuition is partially correct — and partially misleading.

Expanding context can reduce retrieval failures: if the entire document is present, no chunk will be lost to imperfect search. But it does not eliminate compositional failure.

Liu et al. demonstrate that language models exhibit performance degradation when using information positioned in the middle of long contexts. The lost in the middle phenomenon is real: model attention tends to concentrate at the beginning and end of context.

But even in short contexts, with relevant chunks explicitly adjacent, Composition Hallucination can occur. The failure is not one of access — it is one of relationship.

The relationship between a general rule and an exception, between a condition and a procedure, between a precedence and a dependency — these relationships are rarely explicit in the source text. They are implicit in the structure of the domain.

A larger model reduces error frequency. A larger context reduces error scope. But the structure of the problem remains.

5. Why GraphRAG does not fully resolve it

GraphRAG represents an important advance over traditional RAG. Instead of retrieving isolated text chunks, GraphRAG extracts entities and relationships and organizes them into a graph. This enables reasoning over semantic neighborhoods, concept communities, and entity hierarchies.

Edge et al. demonstrate that this approach significantly improves the quality of responses requiring global synthesis of long documents.

But GraphRAG primarily works with structural relationships between entities: A belongs to B, C instantiates D, E is of type F.

Composition Hallucination frequently occurs in relationships of a different kind: normative relationships (A overrides B when C), procedural relationships (A precedes B when D), conditional relationships (A applies except when E), scope relationships (F applies only to G), temporal relationships (H is no longer valid since I), precedence relationships (J has priority over K in context L).

Entity graphs capture the first type well. They capture the second less well.

Representing these normative and procedural relationships requires richer knowledge schemas than entity-relation-entity. It requires something closer to a typed compositional representation — which the CKF proposes through its 22 sections and explicit relationship fields.

6. High-risk domains

Composition Hallucination is not equally likely across all domains. It is especially likely where:

Legal and compliance documents: general rules with exceptions, exceptions to exceptions, precedence hierarchies, temporal validity, different jurisdictions, conditions of applicability.

Corporate policies: approvals conditioned by amount, type, category, role, country. Exceptions for special cases. Procedures depending on pre-conditions.

Medical and clinical guidelines: contraindications conditioned on comorbidities, drug interactions, protocols with mandatory sequences, dosages depending on multiple simultaneous factors.

Educational materials and teaching methodologies: pedagogical principles with conditions of application, strategies that work at certain moments but not others, implementation sequences that cannot be reversed, exceptions based on the student's cognitive state.

Technical knowledge bases: specifications with interdependent parameters, manuals with conditional procedures, documentation with versions and deprecations.

In all these domains, the right text is not sufficient. The relationship between the passages is what determines the correct answer.

7. The problem in agents

In conversational assistants, Composition Hallucination produces a wrong answer.

In agents, it produces a wrong action.

This difference is critical.

An agent does not merely respond. It decides, calls tools, executes workflows, sends messages, modifies records, approves requests, creates documents, triggers processes.

If the agent fails to compose rule, exception, and condition, it may:

approve what should be blocked
deny what should be permitted
execute steps out of order
ignore a pre-condition
apply an expired policy
treat an exception as a general rule
treat a recommendation as an obligation
treat a contextual permission as a universal permission

Composition Hallucination, in agents, is a governance failure — not merely a language failure.

8. How to diagnose in your system

A minimum protocol for identifying Composition Hallucination in an existing system:

Step 1 — Select questions that require composition. Choose questions that can only be answered correctly by integrating two or more corpus fragments through a non-explicit relationship.

Step 2 — Confirm that the fragments were retrieved. Verify in the final context that all relevant passages are present. If they are not, the error is retrieval, not compositional.

Step 3 — Confirm that each fragment is locally correct. Ask the model about each passage in isolation. If it answers each correctly but fails the composition, the phenomenon is identified.

Step 4 — Apply the counterfactual. Formulate a version of the question where the relationship between fragments is explicitly written in the context. If the model succeeds in this version and fails in the implicit version, the failure is compositional.

9. Interventions that work

Explicit representation of relationships: instead of storing text, store typed relationships. "Exception X overrides rule Y when condition Z is true." This is what the CKF does with fields such as conditional_rules, exceptions, overrides, and causal_chains.

Forced chain-of-thought: instruct the model to explicitly list the relevant relationships before generating the response. This does not eliminate the problem but reduces its frequency by making compositional reasoning visible and auditable.

Pre-computation of compositions: identify the most frequent questions requiring composition and pre-compute correct answers with relationships made explicit. Store as heuristics or explicit rules.

Evaluation with paired counterfactuals: include in any evaluation pipeline questions where the relationship is implicit and a counterfactual version where it is explicit. The performance gap measures the system's compositional vulnerability.

10. The COMPGAP benchmark

If Composition Hallucination is a real and operationally distinct category, we need benchmarks that isolate this failure.

The COMPGAP benchmark was designed specifically for this purpose. Each COMPGAP case:

guarantees that all relevant fragments are present in the context
verifies that both are individually retrievable and locally correct
includes a counterfactual case to isolate the phenomenon
controls position and context budget to separate Composition Hallucination from adjacent failures

The first cases cover corporate policy and healthcare domains. Upcoming cases will cover legal, compliance, and educational knowledge bases.

The benchmark is openly available at github.com/tomazinho/open-ckf-compiled-knowledge-format for replication, contribution, and independent validation.

11. Limitations

This paper introduces a conceptual category and a diagnostic protocol. Several open questions remain.

Empirical validation is pending. The COMPGAP benchmark is in early development. Large-scale measurement of how frequently Composition Hallucination accounts for RAG failures — as distinct from retrieval and parametric failures — has not yet been conducted at scale.

The boundary with adjacent phenomena is not always sharp. Some failures may involve both retrieval and composition. The diagnostic protocol in Section 8 is designed to isolate them, but edge cases exist.

Intervention effectiveness varies by domain and model. Forced chain-of-thought and explicit relationship representation reduce error frequency in the domains we have tested. Whether they generalize to all domains is an open question.

CKF as an intervention is a proposal under empirical investigation. The 22-section schema and explicit relationship fields are a structural proposal for reducing compositional vulnerability. A pre-registered confirmatory study (COMPGAP Study 2) is in preparation and will provide directional evidence.

12. The thesis

A significant portion of current failures in RAG, GraphRAG, assistants, and agents does not occur because AI cannot retrieve knowledge, but because it cannot correctly compose the knowledge it retrieves.

This does not make retrieval less important. It makes knowledge architecture more important.

The next step in reliable AI engineering will not only be improving embeddings, rerankers, or context windows. It will be representing better: rules, exceptions, dependencies, scopes, hierarchies, conflicts, precedences, procedures, validity conditions.

In other words: moving from bases of retrievable text to bases of compositional knowledge.

13. Conclusion

Composition Hallucination names a failure that many AI practitioners have already observed, but which had no precise vocabulary.

The model does not fail because it doesn't know. It fails because it doesn't compose.

This distinction changes everything: the diagnosis, the intervention, and the evaluation. More chunks do not solve a structural problem. More context does not resolve a relational failure. A larger model reduces the error, but does not eliminate the cause.

The most promising intervention is prior to retrieval: representing relationships explicitly so that the model does not need to infer them from scratch at every query.

Knowledge is not merely stored information. Knowledge is related information.

And without explicit relationships, even the best context may remain insufficient.

To test whether your system has this problem, use the diagnostic protocol in Section 8. The COMPGAP benchmark is available at github.com/tomazinho/open-ckf-compiled-knowledge-format.

References

Edge, D., Trinh, H., Cheng, N., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130.
Huang, L., Yu, W., Ma, W., et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.
Ji, Z., Lee, N., Frieske, R., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys.
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Liu, N. F., Lin, K., Hewitt, J., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics.
Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). On Faithfulness and Factuality in Abstractive Summarization. ACL.
Min, S., Krishna, K., Lyu, X., et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long-form Text Generation. EMNLP.
Niu, C., Wu, Y., Zhu, J., et al. (2024). RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. ACL.
Tomazinho, P. (2026). Compiled Knowledge Format: Specification v0.2. compiledknowledgeformat.org.

RAGGraphRAGHallucinationCKFAgentsKnowledge Representation

Continue reading

ResearchJune 12, 202612 min read

CKF on the global map: how the Compiled Knowledge Format compares to RAG, Document AI, GraphRAG, and semantic standards

A comparative analysis between CKF and the main global alternatives for structuring documents, preparing data for LLMs, building RAG, creating knowledge graphs, and standardizing APIs.

ConceptJune 12, 202622 min read

CKF Explained at Five Levels: From a 10-Year-Old to an IR Specialist

The same idea — Compiled Knowledge Format — explained five times, each level zooming in: a 10-year-old, a teenager, a non-technical adult, a technical professional, and an Information Retrieval specialist.

ResearchMay 22, 202618 min read

CKF Project Review: From CKF-0.1 to CKF Compiler v1.03.1

A scientific retrospective of the CKF Compiler, tracing the journey from CKF-0.1 (≈10% semantic preservation) to v1.03.1 — the first balanced release that simultaneously preserves meaning, structure, retrieval surface, sanitation, metadata and traceability.