GraphRAG: when flat-chunk retrieval hits 25.7% and what replacing it costs

On the MultiHop-RAG benchmark, flat-chunk retrieval answers temporal reasoning questions at 25.7% accuracy. The same corpus, same LLM, organized as a knowledge graph: 49.1%. That gap is not a rounding error — it is the failure mode flat RAG was designed to tolerate. 1

The failure flat RAG cannot escape

Standard RAG has one retrieval primitive: find the k chunks most similar to the query vector, hand them to the LLM. For single-hop factual lookups this is fast, cheap, and adequate. The architecture breaks when the answer requires connecting evidence scattered across multiple documents — what the literature calls multi-hop reasoning.

Consider a question like "Which contractors certified after 2021 have worked on Site A and also engaged with Vendor B?" A flat embedding search returns chunks about contractors, or about Site A, or about Vendor B. It has no mechanism to traverse the implied relationship chain. The LLM receives disconnected fragments and either hallucinates a coherent answer or correctly hedges to nothing useful.

This is not fixable by tuning chunk size, overlap, or embedding model. The information is in the relationships between entities, not in any individual chunk's text. Flat RAG has no representation of relationships.

What GraphRAG adds

GraphRAG (Edge et al., Microsoft Research, arXiv 2404.16130, Apr 2024) 2 restructures the retrieval index in two stages:

Stage 1 — graph construction (the expensive part). The corpus is chunked and each chunk is passed to an LLM with an entity-extraction prompt. The LLM returns named entities (Dorothy, Site A, Vendor B) and typed relationships (works-at, certified-by, engaged-with) as structured triplets. These are accumulated into a knowledge graph: nodes are entities, edges are relationships with a numeric strength score. A community-detection algorithm (Leiden) then clusters related entities into hierarchical communities, and a second LLM pass pre-generates a text summary for each community.

Stage 2 — query routing. Incoming queries are classified as local (specific entity lookup) or global (corpus-wide synthesis). Local queries traverse the graph from matched entity nodes, pulling multi-hop neighbors and their associated text. Global queries retrieve and aggregate community summaries, then synthesize across them. The result is that a temporal or relational question can now follow the graph's edge structure rather than chasing embedding similarity.

Measured tradeoffs

The February 2025 independent evaluation (arXiv 2502.11371) using Llama 3.1-70B against standard RAG with text-embedding-ada-002 provides the cleanest apples-to-apples numbers available: 1

Task	Flat RAG F1	GraphRAG Local F1	Delta
Single-hop QA (Natural Questions)	68.2%	65.4%	−2.8 pp
Multi-hop QA (HotpotQA)	63.9%	64.6%	+0.7 pp
MultiHop-RAG overall accuracy	65.8%	71.2%	+5.4 pp
Temporal sub-task	25.7%	49.1%	+23.4 pp

The pattern is unambiguous: GraphRAG loses on single-hop lookups, gains significantly on multi-hop and temporal reasoning. The temporal sub-task gap (+23.4 pp) is the clearest signal of what relationship traversal buys you. Community-GraphRAG Global mode performs worse than Local mode on factual QA (55.1% vs 65.4% F1 on NQ), because global summarization trades precision for diversity. For single-hop detail retrieval, it actively degrades.

Cargando gráfico…

Now the cost:

Embedding a 500-page corpus for vector RAG costs under $5 and takes minutes. 3 Building the GraphRAG index for the same corpus runs $50–$200 with GPT-4-class models, roughly 45 minutes on a single machine. About 58% of those tokens go to entity extraction — every chunk is passed through an LLM once for node/edge generation, then again for community summarization. 4

Microsoft's LazyGraphRAG (November 2024) 4 drops indexing cost to roughly 0.1% of full GraphRAG by deferring entity extraction to query time. Query latency goes up, but for infrequent-query corpora this trades expensive up-front indexing for acceptable per-query overhead. The repository's own README includes a standing warning: "GraphRAG indexing can be an expensive operation — start small." 5

github.com · Repositorio de GitHub

microsoft/graphrag

https://github.com/microsoft/graphrag

Cargando tarjeta de contenido…

Latency at query time also widens. A flat RAG retrieval is a single ANN lookup plus one LLM generation call. A GraphRAG Local query involves entity matching, graph traversal, text retrieval across multiple hops, and an LLM synthesis call. Global queries add community summary retrieval and a second synthesis pass. Expect 2–5x higher p95 latency for complex queries. There is no published end-to-end latency benchmark across both systems in controlled conditions as of June 2025 — claims in vendor blog posts are not reproducible.

GraphRAG v. flat RAG — key tradeoffs

As of Jun 2025, based on arXiv 2502.11371 (Llama 3.1-70B) and Microsoft cost data (Aug 2024)

Temporal QA accuracy lift

0.0+23.4%pp vs flat RAG

Single-hop F1 change

0.0−2.8%pp vs flat RAG

Full-index cost vs vector RAG

10–40×

LazyGraphRAG indexing cost

0.1% of full

Cargando tarjeta de estadísticas…

What it replaces and what it costs you

What it replaces: flat vector RAG, and specifically flat RAG's incapacity for multi-hop, relational, and corpus-global queries. It does not replace rerankers, hybrid lexical+dense retrieval, or fine-tuned embedding models — those remain orthogonal components that GraphRAG can incorporate.

Compute and $: full GraphRAG indexing at 10,000 documents with GPT-4o is a four-figure cost, paid once per index rebuild. LazyGraphRAG brings this to low two-figures but shifts cost to query time. Vector RAG equivalent is two-to-three figures for indexing, negligible per query.

Latency: meaningfully higher for complex queries. For latency-sensitive single-hop QA, flat RAG is strictly faster.

Index maintenance: re-indexing after corpus updates requires re-running entity extraction and community detection. Incremental update support exists but remains experimental in the open-source repository as of the v1.x releases.

When not to use it

GraphRAG is the wrong choice when:

Your queries are predominantly single-hop factual lookups. Flat RAG is faster, cheaper, and scores higher (NQ F1: 68.2% vs 65.4%).
Your corpus updates frequently and you cannot afford repeated full re-indexing.
Latency budgets are tight (sub-200 ms p95). GraphRAG's multi-step traversal is incompatible with real-time requirements at current infrastructure.
Your corpus is below ~100 documents. The graph structure has nothing to traverse; the overhead is pure waste.
You have no ground-truth multi-hop eval set. Without measuring the actual query distribution, you cannot tell whether the cost is justified. Do not assume your users ask multi-hop questions just because the domain sounds relational.

A note on the ICLR 2026 GraphRAG-Bench results (arXiv 2506.05690): that paper reports GraphRAG context relevance at 36.9%–54.6% vs vanilla RAG's 62.9%. The benchmark focuses on context precision for open-domain tasks, which systematically penalizes GraphRAG's community-summary approach. The finding is real but evaluates a specific retrieval quality metric, not end-to-end answer accuracy. Both results are useful; they measure different things. 4

The decision heuristic

Audit your query log. If more than 20–30% of queries require connecting entities across more than one document (multi-hop, temporal, relational), GraphRAG Community Local is worth benchmarking. If the majority are single-hop lookups against a stable corpus, flat RAG with a reranker will outperform at a fraction of the cost.

The February 2025 study shows a hybrid selection strategy — routing queries to RAG or GraphRAG based on query characteristics — improves the best single-method baseline by approximately 6.4 percentage points without requiring a new architecture. If you already run both, the routing layer may be the cheapest performance gain available. 1