RAG Harness

RAG Harness Engineering | Three AI APIs: See, Retrieve, Answer

RAG Harness Engineering means every visitor question triggers not a single AI prompt, but three independent AI API calls: vision, retrieval, answer. Chaining multiple sub-agents normally risks each stage poisoning the next, but the Harness architecture hands every stage the visitor's original question and a clear view of the initial task goal, making it fundamentally immune to contamination. Accumulated noise is stopped at the entrance of every hop.

A Renaissance horseman hauls hard on the reins at the edge of a cliff, the horse's front hooves skidding to a halt a moment before the fall, solid ground behind and a deep dark gorge ahead. The horse represents the LLM's natural generative force, a high-entropy gallop in itself. The reins are the harness, a set of deliberately applied engineering constraints. The decisive instant happens at the entrance of each hop: if any one Sub Agent gives way, the whole chain of reasoning plunges into the gorge of accumulated error. The rider does not suppress the horse but takes the direction back into his own hands, which is also the role of the Diving Agent, holding the Master Sitemap to decide the dive point for the whole team. The solid ground behind corresponds to the static cache segment, where the prompt, Master Sitemap, and chunks stand unmoving.

The Vision API deliberately decouples the seeing stage and does just one thing: it turns a visitor's uploaded image into a text description most models can read, then hands it to the following AI node. A multimodal model could answer questions about an image directly, yet this architecture first translates the image into a text cheat sheet. The point is to let every sub-agent dive through the sea of text in the retrieval stage holding the same cheat sheet plus the visitor's original question.

The Stage-by-Stage Decoupling of RAG Harness

In the RAG API, the Diving Agent (the dispatcher) holds the Master Sitemap and decides, based on the visitor's question, which direction to dive and which Sub Agent (the executor, in Diving-mode) to send. Each time it reaches a new depth and faces a new scene, the Sub Agent always holds the visitor's original question and judges what content to submit on that basis. Finally, once all retrieval results return to the Chat API, the response is produced through the persona and answer rules the site owner has defined.

The three AI nodes, Vision API, RAG API, and Chat API, are independent, and each can be freely configured with a different model. The cloud offers options like Mistral, OpenAI, Gemini, and xAI, and for local compute vLLM is recommended. Choosing a model is no longer a single global decision but a per-stage optimization: image recognition goes to a model good at vision, retrieval routing goes to a cheap small model, and the Chat API, the node that actually faces the visitor, deserves a more expressive model to finish the job.

Every AI Node Is a Clean Recipe

An AI node does not pile all of its context onto the downstream. What sub-agents hand off is only a clear diving direction and the re-injected original question. The Vision API passes out only the interpreted semantics of the image. Across the many descents of the Diving-mode loop, the RAG API stores each round's findings in an accumulation area, and Scuba Deep Dive (the acceptance check) judges whether retrieval is sufficient; only when it is does it hand the whole accumulation area to the Chat API. The downstream receives only the upstream team's results, never touching the process.

Beyond context isolation, each node's LLM task is also deliberately narrowed. On every call during retrieval, the LLM does not freely generate long text but outputs fixed JSON: first a single think to read the question and work out where to go now, then two answer fields, next_mode and next_act. This think serves two purposes at once: it is sent to the front end so the visitor sees what the AI is thinking right now, and it makes the small model converge on a direction before answering, so its later choice of mode and content is more accurate.

The whole task converges into a multiple-choice question that tests only whether the model knows what it is doing and can make a judgment, not its ability to generate long text. A clean recipe plus converged output, together, let a lightweight model like Llama 3B run the entire RAG pipeline.

Prompt Cache: From the Second Query On, Cost Drops to Loose Change

Across the whole pipeline, the biggest token consumer is input, not output, at a ratio of about 10 to 1. The largest shares — the prompt, the Master Sitemap, and every page or article chunk — are all static content, classified at the architecture level as the cache segment, and #ROLE, #RULES, #OUTPUT, and #SOUL.md are handled the same way. The site owner does not have to draw this boundary; static stays static and dynamic stays dynamic, and from the second query on, the largest cost item drops to loose change.

Main · Pipeline Flow Decoupled · DI Visitor Query Text + optional image + /vision_context (option) Vision API Context Recipe #ROLE cache #TASK cache /Master Sitemap cache /Image file Raw user query Diving Agent Read Master Sitemap Analyze query Select Diving-mode RAG API Context Recipe #ROLE cache #RULES cache #OUTPUT cache ### ROLE {{ agent_role }} ### RULES {{ agent_rules }} ### OUTPUT { "think": "...", "next_mode": "...", "next_act": "..." } /Branch Sitemap cache Raw user query /vision_context option LLMs are stateless Clean context per hop No full-site read Progressive disclosure Edge SLMs workable Diving-mode Loop ↻ Follow the descent line Agent picks mode per hop strict · fetch chunk recommend · list pages deep_dive · drill down investigate · rethink query option Scuba Deep Dive AI self-scores completeness Another dive Surface Chat API Assemble /rag_context Generate final response Chat API Context Recipe #SOUL.md cache #RULES cache /rag_context append + Diving-mode Loop ... + Scuba ... Raw user query /vision_context option SSE stream Chat Window Browser localStorage only no upload

The Minimization Discipline of RAG Harness

A clean recipe at every hop, every task narrowed to the minimum, a clear boundary between static and dynamic: not one of these designs is decorative. Each AI stage is built as an MVP, a Minimum Viable Product approach that achieves basic feasibility at the lowest cost. Each node does just enough: the recipe just clean enough, the model just small enough, the output just converged enough. The current practice configures several lightweight models across stages; in the future, if small multimodal models mature, running the whole chain on a single model holds up just as well.

The discipline of RAG Harness lies not in how the models are configured but in how finely each hop's task is constrained. The whole architecture keeps leaning toward the minimum because where this road finally leads is moving compute off the site owner's bill and into the user's device. You play games on your own graphics card and watch video on your own screen; asking an AI will, in the end, also return to inference inside the visitor's own device.

Related articles

The small model Llama 3.2 3B is a language model with only 3B parameters, about as small as they come. Ask it a question and it can only answer from its 3B of training data. It does not know what your site says, does not know which articles you published, and knows nothing about the content you have built up recently. Using it to run website Q&A should have been a fantasy.

llms.txt is a sitemap designed for AI to read, but its limit is that it has only one layer, which is not enough for an organized, structured site. Its standard format is the site name as an H1, a short summary, and below that each line is a title: description pointing to one page. But it cannot tell whether a line is a category page, a standalone page, an article, or a product page; every line is treated as the same kind of thing. The intention is not wrong, and the goal is to make a site easier for AI to read. The problem is not the description but that it flattens the site into a single layer, destroying the site's original narrative power and content context.

You think you are training the AI to understand your site? It should be the other way around: use an AI chatbot to train your site. Every time the AI answers wrong, it is telling you which article's title or description is unclear, helping you check your site's SEO blind spots. RAG Sitemap drops the black-box vector store and reads the titles, categories, and descriptions you already wrote in WordPress, generating a plain-text site map for the AI. You only fix things in the admin the way you already do SEO, with no new tool to learn and no algorithm to read minds.