Small Language Model

A Small Language Model (SLM) Actually Understood an Entire Website

The small model Llama 3.2 3B is a language model with only 3B parameters, about as small as they come. Ask it a question and it can only answer from its 3B of training data. It does not know what your site says, does not know which articles you published, and knows nothing about the content you have built up recently. Using it to run website Q&A should have been a fantasy.

一個渺小、穿著樸素希臘長袍的人形機器人,手提一盞小燈,在一座龐大宏偉的石柱長廊中自信地往深處走去,廊柱朝遠方無止盡延伸。渺小的機器人是 Llama 3B 這種垃圾級小模型,手中的小燈只照亮自己腳下,是它有限的世界知識。但牡步伐自信,因為真正在引路的是周圍的石柱秩序,不是手裡的燈。柱列朝深處延伸,對應 master → category → post 的漸進式披露。模型小不要緊,秩序夠清楚的時候,每一 hop 都收斂成一道選擇題。這幅畫的主角不是機器人,是廊柱本身,能力強弱不是關鍵,結構正確才是。

But once it is paired with the structured index map that RAG Sitemap generates in one click, Llama 3.2 3B can answer a question about a mid-sized WordPress site. Gemini Flash Lite or GPT nano runs this system without breaking a sweat. A small model can read an entire site not because it is smart but because we no longer hand it chaos; we turn the site into a low-entropy map with a clear structure that can be judged at a glance.

Why Can a Small Model Do It?

The biggest difference between a large model and a small one is not IQ but world knowledge. A 3B model that can merely speak normally cannot memorize the whole web the way a 70B model can, and however much world knowledge a model absorbs, it still runs into the problem of going out of date. But when the scope of the question is locked inside your site, a small model's reasoning ability is not bad at all. The real question is: even with reasoning ability, how does it know where the answer is?

RAG Sitemap solves this, and almost without friction. Your category index, page hierarchy, and article structure are already an organized knowledge map, and by reading WordPress's existing content structure directly, RAG Sitemap converts it in one click into a plain-text navigation map an AI can read, with no need to learn what a vector is and no need to attach any database. The model does not have to understand the whole site every time it answers; it only has to walk along this ready-made map to know where to look for the answer.

Leading a Small Model into the Existing Context: The Immersive Navigation of RAG Sitemap

The same site's content can be handed to an AI in two dimensions. Vector retrieval deconstructs it, slicing it into a litter of context-stripped paper scraps and throwing them into a high-dimensional abstract space, leaving the model to assemble blindly by similarity in the dark. RAG Sitemap, by contrast, fully preserves the organic hierarchy a human carefully arranged when publishing. The small model does not have to grope through chaos; it only has to stand at the crossroads with its eyes open, follow the signs, make choices, and arrive at the answer intuitively within an unbroken flow of order.

RAG Sitemap pathfinding topology model
Llama 3B Parsing Hot & Thirsty Desserts Beverages Main Course Pour-Over Italian Coffee Americano Espresso Shakerato Macchiato Layer 1: Main Category Layer 2: Subcategory Layer 3: Final Node

Bergson's two modes of knowing: external "analysis" and internal "intuition"

Analysis: Embedding
Staying outside, reducing to symbols.

Bergson noted that analysis is the observer staying outside the thing and reducing it to rigid symbols and spatial representations. This is just like traditional vector retrieval: it cuts flowing content into cold chunks, flattened into directionless mathematical coordinates. It suspends the text's vitality, leaving the small model to grind through distance calculations from the outside, barely piecing together knowledge fragments stripped of organic context inside a jigsaw-like maze of similarity.

Intuition: RAG Sitemap
Intuition: an intellectual sympathy that enters the thing.

The intuition Bergson prized breaks through all symbolic mediation and throws itself directly inside the object, producing an intellectual sympathy that grasps its unique flow of life. RAG Sitemap is exactly this path of intuition. It does not deconstruct and does not distort; it lets the small model immerse directly in the existing context the site owner wove. The model flows and reads along the ready-made intent, intuitively embracing the soul of the site's knowledge as a whole.

Related articles

AI entropy-reduction engineering is the umbrella term for every design that gets an LLM moving. From prompt to context to agent to harness, all of them narrow the range of prediction and lower the uncertainty of an answer; only the scope of their influence differs. This is because an LLM operates on a naturally high-entropy linguistic medium, and the core engineering of an AI application is to perform entropy reduction through structured input and external knowledge, lowering uncertainty and improving output quality.

A vector database is not a requirement for RAG; it is only one way to feed data to an AI. When data is inherently messy and lacks clear boundaries, vectorization helps a model guess semantic relevance from large amounts of text, and that has its value. But when content already has order, the question is no longer how to force relevance out of chaos, but how to let the AI see the most important interpretive clues first. Effective RAG does not have to slice the full text, compress it into vectors, and then guess the answer; it can instead organize content into a path the AI understands layer by layer, lowering contextual uncertainty first and then expanding the detail.

RAG Harness Engineering means every visitor question triggers not a single AI prompt, but three independent AI API calls: vision, retrieval, answer. Chaining multiple sub-agents normally risks each stage poisoning the next, but the Harness architecture hands every stage the visitor's original question and a clear view of the initial task goal, making it fundamentally immune to contamination. Accumulated noise is stopped at the entrance of every hop.