In late 2022, retrieval-augmented generation felt like magic. You had a large language model that hallucinated freely, and suddenly you could anchor it to a corpus of real documents. Chunk the text, embed the chunks, store the vectors, retrieve the top-k nearest neighbors at query time, stuff them into the prompt. Done. It worked well enough that enterprises started shipping it into production by the millions of tokens per day.

That era is closing. Not because RAG was a bad idea—it was a genuinely good one—but because the problems that remain after deploying vanilla RAG are exactly the problems it was structurally incapable of solving. The chunk size dilemma is real. Context loss is real. The absence of any reasoning layer is a fundamental architectural gap that no amount of prompt engineering fully closes.

What replaces it is not a single technique. It is a paradigm: agentic retrieval. Agents that plan their retrieval strategy before executing it, call multiple tools in sequence, perform multi-hop reasoning across heterogeneous data sources, and adapt based on what they find. This article breaks down exactly why RAG hit its ceiling, what agentic retrieval looks like in practice, and how we built our production implementation—SynthGraph—at Ryshe.

Why Vanilla RAG Has Hit Its Ceiling

The Chunk Size Dilemma

Every RAG practitioner has lived this Goldilocks problem. Chunks too small and you lose the context that makes a passage meaningful—a sentence referencing "the above methodology" becomes orphaned, the pronoun resolution fails, the retrieved text is gibberish to the model without its neighbors. Chunks too large and your embedding loses specificity; the vector for a 2,000-token chunk about "machine learning for financial risk modeling" becomes semantically diluted to the point where it retrieves poorly against precise user queries.

The common compromise—512 to 1,024 tokens with overlapping windows—is not a solution. It is a negotiation with a broken constraint. Overlap helps, but it multiplies your index size and introduces retrieval redundancy. Hierarchical chunking (small chunks for retrieval, large chunks for context) helps more, but it layers complexity onto what was supposed to be a simple pipeline. At no point does any of this actually solve the problem: the chunk boundary is artificial, and real knowledge does not respect it.

Lost Context and Broken Reasoning Chains

Consider a user asking: "What were the downstream effects of the Fed's rate decision in Q3 2024 on our mortgage portfolio's duration risk, given our hedging strategy at the time?"

A vanilla RAG pipeline will retrieve the top-k chunks most semantically similar to that query. It will probably pull some chunks about the Fed rate decision, some chunks about the mortgage portfolio, maybe something about duration risk. What it will almost certainly not do is correctly reconstruct the causal chain across three temporally separated documents from different data sources, with the appropriate hedging context layered on top. The relevant information exists in the corpus. The pipeline cannot reason about how to connect it.

This is the lost context problem in its most damaging form: not that information is missing, but that the architecture has no mechanism to understand that multiple retrievals need to be composed into a coherent reasoning chain. Top-k retrieval is a single pass. It has no iteration, no reflection, no ability to ask "I found X, but I now need Y to complete my answer."

No Reasoning Layer

Vanilla RAG pipelines are, at their core, a retrieval step followed by a generation step. There is no layer in between that reasons about whether the retrieved context is sufficient, contradictory, temporally inconsistent, or missing a critical dependency. The model sees whatever the retriever surfaces and generates accordingly. When the retriever is wrong—and it will be wrong—the model has no mechanism to detect or correct this. It hallucinates confidently with retrieved scaffolding, which is in some ways more dangerous than hallucinating without it.

The failure mode is not that RAG retrieves bad documents. It is that RAG has no way to know it retrieved bad documents, and the model will use them anyway.

What Agentic Retrieval Actually Is

Agentic retrieval is what you get when you take the retrieval problem seriously as a reasoning problem rather than an indexing problem. Instead of a static pipeline—embed, store, retrieve, generate—you deploy an agent whose job is to figure out how to retrieve before it retrieves anything.

Planning Before Retrieval

An agentic retriever begins with a query decomposition step. Given a complex user question, the agent first identifies the atomic sub-questions that must be answered to address the whole. It determines which sub-questions are independent (parallelizable) and which are dependent (sequential). It selects the appropriate data sources and retrieval strategies for each. Only then does it execute retrieval.

This is not a fancy prompt. This is a structured reasoning loop with explicit intermediate state. The agent maintains a working memory of what it has retrieved, what it still needs, and what contradictions it has encountered. It can issue follow-up retrievals based on what it finds—a capability that is simply impossible in a single-pass pipeline.

Multi-Hop Reasoning

Multi-hop reasoning is the ability to chain retrievals where each retrieval is conditioned on the output of the previous one. The classic example from the research literature is: "Who was the CEO of the company that acquired DeepMind?" To answer this, you first retrieve "what company acquired DeepMind" (Google), then retrieve "who is the CEO of Google" (Sundar Pichai). Neither chunk alone answers the question. The chain is what matters.

In production enterprise settings, multi-hop chains routinely go five or six hops deep. A question about a client's regulatory exposure involves a chain through: the client's current holdings, the applicable regulatory framework version in effect at the relevant date, the specific rule language, internal legal interpretations, and prior compliance decisions. A single retrieval pass cannot traverse this chain. An agent can.

Tool Use and Heterogeneous Sources

Real enterprise knowledge is not a single vector index. It is a collection of SQL databases, document stores, APIs, graph databases, spreadsheets, Confluence wikis, and Slack archives. Agentic retrieval treats each of these as a tool. The agent decides at runtime which tool to invoke for each sub-question, formats the appropriate query for that tool, receives structured results, and incorporates them into its reasoning.

This means the agent might answer a complex question by: querying a vector index for relevant policy documents, calling a SQL tool to pull specific financial figures, querying a knowledge graph for entity relationships, and then synthesizing all three into a coherent response. Vanilla RAG cannot do any of this. It knows one tool and one tool only.

Knowledge Graphs + Vector Search: The Hybrid Architecture

The single most impactful architectural addition to an agentic retrieval system is the knowledge graph. This deserves its own section because it is frequently misunderstood.

Vector search is exceptionally good at semantic similarity retrieval. Given a query, it finds passages that mean something similar. What it cannot do is represent or traverse relationships between entities. It has no concept of "caused by," "reported to," "preceded by," or "contradicts." These are structural relationships, and they live in a graph, not a vector space.

A knowledge graph represents your domain as nodes (entities: people, companies, regulations, events, products) and edges (typed relationships between them). When a user asks "What regulations apply to our APAC derivatives desk given its current counterparty exposure," a vector search retrieves documents about regulations and documents about derivatives. A knowledge graph traversal finds the specific regulatory nodes connected to the specific entity types present in the counterparty graph, filtered by jurisdiction, and surfaces exactly the applicable rules.

Graph-Vector Fusion

The hybrid architecture fuses both. The agent first traverses the knowledge graph to identify the relevant entity subgraph—the set of nodes and relationships pertinent to the query. It then uses those entity identifiers to constrain vector search, retrieving only documents that reference the identified entities. The result is a retrieval set that is both semantically relevant and structurally appropriate. Precision goes up dramatically. Noise goes down.

Graph traversal also enables something vector search cannot: reasoning about what is missing. If the graph contains a node for a company with no connected financial disclosure documents, the agent can flag this gap explicitly rather than silently returning nothing. This is a qualitatively different failure mode—one that is useful to the user.

The Ryshe Architecture: SynthGraph

At Ryshe, we spent most of 2025 building and iterating on what we now call SynthGraph—our production agentic retrieval architecture. Here is how it is structured.

Layer 1: The Query Planner

The Query Planner receives the raw user query and produces a structured retrieval plan. The plan is a directed acyclic graph (DAG) where each node is a retrieval sub-task, edges encode dependencies, and each node is annotated with the tool(s) to use and the expected output format. The planner uses a fine-tuned planning model—not the expensive frontier model used for final synthesis—specifically trained on retrieval planning tasks across our client domains.

The planner distinguishes between three sub-task types: anchor retrievals (ground truth lookups that must succeed before dependent tasks run), exploratory retrievals (best-effort semantic search to expand context), and validation retrievals (cross-checks that verify claims made by earlier retrieval steps). This taxonomy alone eliminates a large class of hallucinations because anchor retrievals either succeed or the whole plan fails gracefully with an explicit error.

Layer 2: The Retrieval Executor

The Retrieval Executor runs the DAG produced by the planner. It manages concurrency for independent branches, handles tool invocation, and maintains the agent's working memory—a structured context object that accumulates retrieved evidence, tracks confidence scores, and records any contradictions found between sources.

The executor has access to four tool categories: VectorTool (semantic search over our client's embedded document corpus), GraphTool (Cypher queries against the knowledge graph), StructuredTool (SQL against relational data sources), and MetaTool (queries about the knowledge base itself—what documents exist, when they were last updated, what entity types are represented). MetaTool is underrated. The ability to ask "do I actually have data on this?" before committing to a retrieval chain prevents enormous amounts of downstream confusion.

Layer 3: The Critic

Before synthesis, every completed retrieval plan passes through the Critic. The Critic is a separate model call that reviews the working memory for: completeness (were all required sub-tasks answered?), consistency (do any retrieved facts directly contradict each other?), temporal validity (are all retrieved documents within the relevant time window?), and coverage (are there obvious gaps that should trigger a follow-up retrieval?).

When the Critic identifies issues, it either triggers a targeted follow-up retrieval or flags the issue explicitly for inclusion in the final response. This means SynthGraph can surface "I found conflicting information on this point from sources A and B" rather than silently picking one and hallucinating confidence.

Layer 4: The Synthesizer

Only after the Critic approves does the Synthesizer run. It receives the fully assembled, validated working memory and generates the final response. Because the evidence is pre-assembled and pre-validated, the Synthesizer can focus entirely on language quality—coherence, tone, appropriate hedging, citation formatting. The hard retrieval work is already done.

Benchmark Results: 40% Is a Floor

We benchmark SynthGraph against a well-tuned vanilla RAG baseline on a suite of tasks drawn from real enterprise use cases across three client verticals: financial services, legal, and life sciences. The vanilla RAG baseline uses hierarchical chunking, hybrid BM25 + dense retrieval with reciprocal rank fusion, and a state-of-the-art reranker. It is not a straw man. It is the best version of vanilla RAG that experienced engineers can build.

Our evaluation metrics:

Across 1,200 evaluation queries spanning the three verticals, SynthGraph shows the following improvements over the tuned RAG baseline:

The multi-hop accuracy number is striking. It tells you where vanilla RAG most completely breaks down. For single-hop factual lookups, the gap between RAG and SynthGraph is real but narrower. For questions that require assembling an answer from multiple, dependent retrievals—which constitute the majority of high-value enterprise queries—vanilla RAG degrades severely. SynthGraph does not, because it was designed for exactly this task.

The 40% headline figure refers to Answer Correctness across the full eval set, which is the metric that most directly tracks user-perceived quality. We stand behind it, and we run it continuously as we ship new model and graph versions. It has never dipped below 38%.

Implementation Guide

You do not need to replicate SynthGraph in full to capture most of the gains. The single highest-impact intervention is introducing a planning step. Here is what a minimal agentic retrieval implementation looks like in pseudo-code:

# --- Query Planner ---
def plan_retrieval(query: str, available_tools: list[Tool]) -> RetrievalPlan:
    planner_prompt = f"""
    You are a retrieval planner. Given the user query and available tools,
    decompose the query into atomic sub-questions. For each sub-question:
    - Identify the appropriate tool
    - Identify dependencies on other sub-questions
    - Classify as: anchor | exploratory | validation

    Query: {query}
    Tools: {[t.name for t in available_tools]}

    Return a JSON retrieval plan.
    """
    return parse_plan(planner_llm.complete(planner_prompt))


# --- Retrieval Executor ---
def execute_plan(plan: RetrievalPlan, tools: dict[str, Tool]) -> WorkingMemory:
    memory = WorkingMemory()
    dag = plan.to_dag()

    for batch in dag.topological_batches():  # parallel where possible
        results = asyncio.gather(*[
            tools[task.tool].invoke(task.query, task.params)
            for task in batch
        ])
        for task, result in zip(batch, results):
            memory.store(task.id, result, task.task_type)

    return memory


# --- Critic ---
def critique(memory: WorkingMemory) -> CriticReport:
    issues = []
    if not memory.all_anchors_resolved():
        issues.append(CriticalGap(memory.missing_anchors()))
    for pair in memory.find_contradictions(threshold=0.7):
        issues.append(Contradiction(pair))
    if memory.has_stale_documents(cutoff_days=90):
        issues.append(TemporalWarning(memory.stale_docs()))
    return CriticReport(issues=issues, approved=len(issues) == 0)


# --- Main Agent Loop ---
def agentic_retrieve(query: str) -> Response:
    plan = plan_retrieval(query, AVAILABLE_TOOLS)
    memory = execute_plan(plan, TOOL_REGISTRY)
    report = critique(memory)

    if not report.approved:
        # Attempt targeted follow-up retrievals for critical gaps
        for gap in report.critical_gaps():
            follow_up = plan_retrieval(gap.description, AVAILABLE_TOOLS)
            memory.merge(execute_plan(follow_up, TOOL_REGISTRY))
        report = critique(memory)  # re-evaluate

    return synthesizer_llm.complete(
        prompt=build_synthesis_prompt(query, memory, report)
    )

A few implementation notes from building this in production:

The Knowledge Graph Build: Practical Notes

The knowledge graph is often the part that intimidates teams. In practice, you do not need a fully hand-crafted ontology. We bootstrap our graphs using a combination of NER extraction over the existing document corpus, relation extraction using a small instruction-tuned model, and entity resolution to merge duplicates. The initial graph is noisy. It gets cleaned iteratively as we observe retrieval failures.

For most enterprise use cases, the entity types you need are limited: Person, Organization, Document, Regulation, Product, Event, Date. The edge types that matter most are temporal (PRECEDED_BY, SUPERSEDES), organizational (REPORTS_TO, SUBSIDIARY_OF), and documentary (REFERENCES, AMENDS, CONTRADICTS). Start there. You can get 80% of the graph value from these types alone.

We run our graphs in Neo4j for production, with a custom GraphTool layer that translates agent queries into parameterized Cypher. The parameterization is important—it prevents prompt injection via graph traversal, which is a real attack surface if your knowledge graph is built from untrusted input.

Future Directions

Agentic retrieval is itself not the final state. The trajectory is clear: retrieval systems will increasingly blur with reasoning systems. A few directions we are actively pursuing at Ryshe and watching closely in the research community:

Self-Improving Retrieval Plans

Every retrieval failure is a training signal. If the critic consistently flags gaps in a particular type of query, the planner should learn to anticipate those gaps and proactively include additional sub-tasks. We are building feedback loops where critic reports are used to fine-tune the planner on a rolling basis. Early results suggest this reduces critic escalations by roughly 20% per training cycle.

Retrieval Over Latent Space

The current architecture separates retrieval from reasoning: retrieve, then reason. Emerging research on intermediate token activation retrieval—where the model retrieves not documents but activation states from prior computations—suggests a future where this boundary dissolves. We are watching this space, though it remains largely research-stage for production enterprise applications.

Collaborative Multi-Agent Retrieval

Complex enterprise queries sometimes require domain-specific expertise that a single generalist agent cannot provide. We are experimenting with retrieval agent specialization—a coordinator agent routes sub-questions to specialist agents (legal, financial, technical) that have domain-specific tools, graphs, and fine-tuned planners. Early results are promising, though orchestration complexity increases substantially.

Uncertainty Quantification

The next major gap in production agentic retrieval systems is calibrated uncertainty. Current systems can say "I found conflicting information," but they cannot say "I am 80% confident in this answer based on the quality and coverage of available evidence." Proper uncertainty quantification—propagated through the retrieval chain and surfaced to the user—is the difference between a system you can act on and a system you have to second-guess. This is hard and mostly unsolved, but it is the right problem.

Conclusion

Vanilla RAG was the right answer for 2022. It got language models grounded, reduced hallucination rates dramatically, and enabled a generation of enterprise AI applications that would not otherwise have shipped. That is a real achievement and should not be minimized.

But the bar has moved. Enterprises are now asking their AI systems genuinely hard questions—questions that require assembling answers from multiple sources, traversing entity relationships, checking temporal consistency, and reasoning about what information is missing. Vanilla RAG cannot do these things. Not with better prompts, not with better embeddings, not with better rerankers. It cannot do them because it was not designed to.

Agentic retrieval was. The planning layer, the multi-hop executor, the knowledge graph, the critic—these are not complexity for its own sake. They are the minimum architecture required to answer questions that actually matter in production.

RAG is dead. Long live agentic retrieval.