The Vision API deliberately decouples the seeing stage and does just one thing: it turns a visitor's uploaded image into a text description most models can read, then hands it to the following AI node. A multimodal model could answer questions about an image directly, yet this architecture first translates the image into a text cheat sheet. The point is to let every sub-agent dive through the sea of text in the retrieval stage holding the same cheat sheet plus the visitor's original question.
The Stage-by-Stage Decoupling of RAG Harness
In the RAG API, the Diving Agent (the dispatcher) holds the Master Sitemap and decides, based on the visitor's question, which direction to dive and which Sub Agent (the executor, in Diving-mode) to send. Each time it reaches a new depth and faces a new scene, the Sub Agent always holds the visitor's original question and judges what content to submit on that basis. Finally, once all retrieval results return to the Chat API, the response is produced through the persona and answer rules the site owner has defined.
The three AI nodes, Vision API, RAG API, and Chat API, are independent, and each can be freely configured with a different model. The cloud offers options like Mistral, OpenAI, Gemini, and xAI, and for local compute vLLM is recommended. Choosing a model is no longer a single global decision but a per-stage optimization: image recognition goes to a model good at vision, retrieval routing goes to a cheap small model, and the Chat API, the node that actually faces the visitor, deserves a more expressive model to finish the job.
Every AI Node Is a Clean Recipe
An AI node does not pile all of its context onto the downstream. What sub-agents hand off is only a clear diving direction and the re-injected original question. The Vision API passes out only the interpreted semantics of the image. Across the many descents of the Diving-mode loop, the RAG API stores each round's findings in an accumulation area, and Scuba Deep Dive (the acceptance check) judges whether retrieval is sufficient; only when it is does it hand the whole accumulation area to the Chat API. The downstream receives only the upstream team's results, never touching the process.
Beyond context isolation, each node's LLM task is also deliberately narrowed. On every call during retrieval, the LLM does not freely generate long text but outputs fixed JSON: first a single think to read the question and work out where to go now, then two answer fields, next_mode and next_act. This think serves two purposes at once: it is sent to the front end so the visitor sees what the AI is thinking right now, and it makes the small model converge on a direction before answering, so its later choice of mode and content is more accurate.
The whole task converges into a multiple-choice question that tests only whether the model knows what it is doing and can make a judgment, not its ability to generate long text. A clean recipe plus converged output, together, let a lightweight model like Llama 3B run the entire RAG pipeline.
Prompt Cache: From the Second Query On, Cost Drops to Loose Change
Across the whole pipeline, the biggest token consumer is input, not output, at a ratio of about 10 to 1. The largest shares — the prompt, the Master Sitemap, and every page or article chunk — are all static content, classified at the architecture level as the cache segment, and #ROLE, #RULES, #OUTPUT, and #SOUL.md are handled the same way. The site owner does not have to draw this boundary; static stays static and dynamic stays dynamic, and from the second query on, the largest cost item drops to loose change.
The Minimization Discipline of RAG Harness
A clean recipe at every hop, every task narrowed to the minimum, a clear boundary between static and dynamic: not one of these designs is decorative. Each AI stage is built as an MVP, a Minimum Viable Product approach that achieves basic feasibility at the lowest cost. Each node does just enough: the recipe just clean enough, the model just small enough, the output just converged enough. The current practice configures several lightweight models across stages; in the future, if small multimodal models mature, running the whole chain on a single model holds up just as well.
