RAG Harness

RAG Harness Engineering | Three AI APIs: See, Retrieve, Answer

RAG Harness Engineering means every visitor question triggers not a single AI prompt, but three independent AI API calls: vision, retrieval, answer. Chaining multiple sub-agents normally risks each stage poisoning the next, but the Harness architecture hands every stage the visitor's original question and a clear view of the initial task goal, making it fundamentally immune to contamination. Accumulated noise is stopped at the entrance of every hop.

The Vision API deliberately decouples the seeing stage and does just one thing: it turns a visitor's uploaded image into a text description most models can read, then hands it to the following AI node. A multimodal model could answer questions about an image directly, yet this architecture first translates the image into a text cheat sheet. The point is to let every sub-agent dive through the sea of text in the retrieval stage holding the same cheat sheet plus the visitor's original question.

The Stage-by-Stage Decoupling of RAG Harness

In the RAG API, the Diving Agent (the dispatcher) holds the Master Sitemap and decides, based on the visitor's question, which direction to dive and which Sub Agent (the executor, in Diving-mode) to send. Each time it reaches a new depth and faces a new scene, the Sub Agent always holds the visitor's original question and judges what content to submit on that basis. Finally, once all retrieval results return to the Chat API, the response is produced through the persona and answer rules the site owner has defined.

The three AI nodes, Vision API, RAG API, and Chat API, are independent, and each can be freely configured with a different model. The cloud offers options like Mistral, OpenAI, Gemini, and xAI, and for local compute vLLM is recommended. Choosing a model is no longer a single global decision but a per-stage optimization: image recognition goes to a model good at vision, retrieval routing goes to a cheap small model, and the Chat API, the node that actually faces the visitor, deserves a more expressive model to finish the job.

Every AI Node Is a Clean Recipe

An AI node does not pile all of its context onto the downstream. What sub-agents hand off is only a clear diving direction and the re-injected original question. The Vision API passes out only the interpreted semantics of the image. Across the many descents of the Diving-mode loop, the RAG API stores each round's findings in an accumulation area, and Scuba Deep Dive (the acceptance check) judges whether retrieval is sufficient; only when it is does it hand the whole accumulation area to the Chat API. The downstream receives only the upstream team's results, never touching the process.

Beyond context isolation, each node's LLM task is also deliberately narrowed. On every call during retrieval, the LLM does not freely generate long text but outputs fixed JSON: first a single think to read the question and work out where to go now, then two answer fields, next_mode and next_act. This think serves two purposes at once: it is sent to the front end so the visitor sees what the AI is thinking right now, and it makes the small model converge on a direction before answering, so its later choice of mode and content is more accurate.

The whole task converges into a multiple-choice question that tests only whether the model knows what it is doing and can make a judgment, not its ability to generate long text. A clean recipe plus converged output, together, let a lightweight model like Llama 3B run the entire RAG pipeline.

Prompt Cache: From the Second Query On, Cost Drops to Loose Change

Across the whole pipeline, the biggest token consumer is input, not output, at a ratio of about 10 to 1. The largest shares — the prompt, the Master Sitemap, and every page or article chunk — are all static content, classified at the architecture level as the cache segment, and #ROLE, #RULES, #OUTPUT, and #SOUL.md are handled the same way. The site owner does not have to draw this boundary; static stays static and dynamic stays dynamic, and from the second query on, the largest cost item drops to loose change.

The Minimization Discipline of RAG Harness

A clean recipe at every hop, every task narrowed to the minimum, a clear boundary between static and dynamic: not one of these designs is decorative. Each AI stage is built as an MVP, a Minimum Viable Product approach that achieves basic feasibility at the lowest cost. Each node does just enough: the recipe just clean enough, the model just small enough, the output just converged enough. The current practice configures several lightweight models across stages; in the future, if small multimodal models mature, running the whole chain on a single model holds up just as well.

The discipline of RAG Harness lies not in how the models are configured but in how finely each hop's task is constrained. The whole architecture keeps leaning toward the minimum because where this road finally leads is moving compute off the site owner's bill and into the user's device. You play games on your own graphics card and watch video on your own screen; asking an AI will, in the end, also return to inference inside the visitor's own device.

Related articles

一個渺小、穿著樸素希臘長袍的人形機器人，手提一盞小燈，在一座龐大宏偉的石柱長廊中自信地往深處走去，廊柱朝遠方無止盡延伸。渺小的機器人是 Llama 3B 這種垃圾級小模型，手中的小燈只照亮自己腳下，是它有限的世界知識。但牡步伐自信，因為真正在引路的是周圍的石柱秩序，不是手裡的燈。柱列朝深處延伸，對應 master → category → post 的漸進式披露。模型小不要緊，秩序夠清楚的時候，每一 hop 都收斂成一道選擇題。這幅畫的主角不是機器人，是廊柱本身，能力強弱不是關鍵，結構正確才是。

A Small Language Model (SLM) Actually Understood an Entire Website

The small model Llama 3.2 3B is a language model with only 3B parameters, about as small as they come. Ask it a question and it can only answer from its 3B of training data. It does not know what your site says, does not know which articles you published, and knows nothing about the content you have built up recently. Using it to run website Q&A should have been a fantasy.

A small figure holds only a single flat paper map, trapped inside a huge multi-level labyrinth of stone arches and staircases, endless stairs extending up, down, and in every direction. The flat paper map is the essence of llms.txt: a catalog with only titles and no body, meant for flat reading. The three-dimensional maze is the knowledge structure of a real site, with floors, circulation routes, and interconnected depth. The figure is tiny, and the mismatch in scale between tool and environment is exactly the situation of an AI holding a 2D list against 3D content. Light enters from one side, but without the right map, light cannot replace structure. What RAG Sitemap sets out to solve is making this flat plan three-dimensional, walking down the three layers of master, category, and post.

The Good-Faith Limits of llms.txt

llms.txt is a sitemap designed for AI to read, but its limit is that it has only one layer, which is not enough for an organized, structured site. Its standard format is the site name as an H1, a short summary, and below that each line is a title: description pointing to one page. But it cannot tell whether a line is a category page, a standalone page, an article, or a product page; every line is treated as the same kind of thing. The intention is not wrong, and the goal is to make a site easier for AI to read. The problem is not the description but that it flattens the site into a single layer, destroying the site's original narrative power and content context.

A humanoid robot in a scholar's robe points to a line in an open book, showing it to a human author at the desk who holds a quill, and the author leans in to look closely at that spot, with drafts and notes spread across the desk. The reversal of the two figures' roles is the key to this painting: the robot is not a student being examined but a copy editor giving the site a check-up. Its finger points to an unclear line of description, and that spot is the gap itself. The quill stays in the human's hand; the right to revise has not been handed over, and the AI only lets the person see where the writing is not clear enough. The book pulled half-out on the desk hints that this is not a large-scale rewrite but a point-by-point fine-tuning of titles, category descriptions, and article placement, all the moves a site owner makes in ordinary SEO.

Use an AI Chatbot to Train Your Site for SEO

You think you are training the AI to understand your site? It should be the other way around: use an AI chatbot to train your site. Every time the AI answers wrong, it is telling you which article's title or description is unclear, helping you check your site's SEO blind spots. RAG Sitemap drops the black-box vector store and reads the titles, categories, and descriptions you already wrote in WordPress, generating a plain-text site map for the AI. You only fix things in the admin the way you already do SEO, with no new tool to learn and no algorithm to read minds.