AI-on-Chip

AI-on-Chip: Moving AI Inference onto the User's Device

"AI-on-Chip" means that when every device has a small AI model carved into a chip, the model is no longer software that must be loaded but a compute chip always on standby. The LLM inference an application needs can run locally on the visitor's device, bringing the site owner's AI compute cost to zero. This is the end goal of RAG Chatbot.

A group of people circle a central light, each receiving and cupping a flame of their own, light spreading from one place into many separate palms. The central light is the cloud API of the past decade, where every inference had to come back and pay the bill. The light passed to each pair of hands corresponds to the trajectory of the NPU, chip is model, and the Chrome Prompt API, with inference moved back onto the visitor's own device. Each flame is close in size, meaning the edge small model is already capable enough to carry a site's navigation task. The posture of hands cupping a flame is privacy and non-disclosure; privacy holds naturally under this architecture. The distances between people are even: this is not a new center replacing the old one but the center dissolving entirely.

Because the chatbot on a site is a free public utility offered to visitors, the AI cost falls entirely on the site owner. So we start from the smallest model and break tasks down as finely as possible, letting every AI node be handled by the lightest small model, while site knowledge is supplied by the RAG Sitemap. This architecture already runs small models smoothly today, and once edge chips become common and "chip is model" arrives, bringing the site owner's operating cost to zero will simply follow.

Why This RAG Harness Architecture Bets on the Smallest Model

This architecture starts from the smallest model not as a technical compromise but as the answer a real problem forces out: keeping it affordable for the site owner over the long run. The chatbot on a site is not the owner's personal ChatGPT; it is a public utility serving visitors for free. An ordinary site owner does not have the deep pockets of Google or OpenAI, able to let hundreds of millions of people call a frontier model for free. The visitors' AI bill falls entirely on the owner, and the larger the site, the higher the cost. This cost curve decides whether a chatbot can keep running on a site over the long term.

The RAG Chatbot we built leans toward the smallest model at every AI node, because once a task is broken down finely enough, each AI stage is left with only simple work that even the cheapest small model can handle. This product treats 2024's Llama 3.2 3B as the baseline spec for development and testing, the way a game studio picks an entry-level graphics card as the minimum bar to polish against: if even the weakest model runs, swapping in a stronger one only makes the customer experience better. The latest small models at the same parameter scale are already several times stronger than back then, and the small models a site owner can use in the future will only be stronger than today's.

Device Hardware Is Also Leaning Toward Small Models

The NPU, a small chip designed specifically for AI inference, quietly entered consumer CPUs as early as 2023, but its performance was too low then to run anything meaningful. In the last year or two, compute has crossed the threshold and small models can truly run locally; today a low-power, space-efficient NPU is standard in new devices. On Windows, the Copilot+ PC ships with Phi Silica, a 3B small model; Apple has a Neural Engine in its chips from iPhone to Mac; and in the Android camp, Qualcomm and MediaTek each build NPUs into their phone chips. Devices are starting to grow small models that can infer locally, and this trajectory holding means that handing a site's AI chatbot compute to the visitor's device, bringing the owner's cost to zero, is not wishful thinking but a direction with a real foundation.

But an opportunity appearing does not mean you can rely on it today. For a site that must serve arbitrary visitors, the problem comes down to one thing: a site cannot assume the visitor's browser has a device-side model it can call directly. First, the entry point is not yet complete. On the web, the only path that touches a device model is Chrome's built-in model, which is still experimental, off by default, and must be turned on manually; Safari and Firefox have no equivalent yet. Second, even when a browser supports it, the model still has to download several gigabytes in the background before it can be used. Your chatbot faces arbitrary browsers and cannot assume the device in the visitor's hand both opens this door and happens to have the model there.

So the direction is set and the opportunity has appeared, but today is still a transition. Before this entry point reaches every device, RAG Chatbot still runs on free cloud APIs and cheap small models, and that combination works today. The device side is not what we depend on now; it is the destination this architecture reserved for it from the start. What was missing was never the model's ability, since for the navigation role the cheapest small model has long been enough. What is not yet in place is the entry point that lets a model infer locally on the visitor's own device, which has not yet appeared in every device. The next question is where this entry point will come from, and the most thorough answer is to make the model no longer software that must be loaded.

AI-on-Chip

Move AI inference back onto the user's device, and the site owner's compute cost goes to zero.

Today
Cloud API
推理在雲端完成,答案送回訪客裝置 inference

Inference happens in the cloud, and the site owner pays the bill. Break the task down finely enough and a cheap small model can handle it; this combination works today.

Inference location · CloudInference speed · FastSite owner's bill · Paid
Transition
Browser built-in model
模型內建進瀏覽器,每次推理都要載進記憶體 LLM RAM Loaded at inference

The model is built into the browser, downloaded once and shared by every site, and inference returns to the visitor's device. But the model is still software, and every inference must load several GB of weights into memory.

Inference location · Visitor downloads modelInference speed · SlowSite owner's bill · Zero
Endgame
AI-on-Chip
權重刻進晶片電路,不載入、不佔記憶體

Weights are no longer data loaded into memory but circuits etched onto a chip. No loading, no memory footprint; AI stands by at all times like Wi-Fi.

Inference location · Visitor LLM chipInference speed · Very fastSite owner's bill · Zero

The premise unchanged across all three stages

The model only reads and navigates; site knowledge is supplied by the RAG Sitemap. Precisely because knowledge does not rely on being built into the model, the cheapest small model is enough, and that is what lets this architecture run all the way onto edge chips.

The More Thorough Step: Making the Model the Chip Itself

Let the weights no longer be data loaded into memory but circuits etched onto a chip. Taking no memory and needing no loading, inference is as fast as a multiplication table carved into your mind, ready the moment you open your mouth. Taalas is already doing this: its first chip, HC1, etches Llama 3.1 8B into silicon. But that is only the opening act; they chose an 8B model because it is small and convenient as an MVP. The technology itself keeps advancing toward stronger models generation by generation, and the only question left is when "chip is model" reaches end devices.

What gets etched into the chip will be a small model, because the essence of an edge device is small, light, and low-power. On consumer phones and laptops, user needs are all over the map, so the model in the chip must be able to field anything: a general model that handles text, images, and speech. Etching a dedicated chip for each function would walk back into fragmentation, where each one must be developed anew at a cost the market cannot bear. Only a small-model chip that all applications can share has enough economic value to be put into every phone and every laptop, becoming a standard component like a memory module. When it is everywhere, developers finally have the motivation to build on it, and scale and an ecosystem grow from that base. A model etched into a chip cannot be changed, but this is not a flaw; devices are replaced generation by generation anyway, and the model updates with the device generation, just as CPUs, cameras, batteries, and Wi-Fi modules step up in spec each generation.

The Endgame: AI on Standby at the User's Device, the Owner's Bill at Zero

When every device has a small model etched into a chip, AI stands by in the background like Wi-Fi, and that is exactly the day this RAG Chatbot is waiting for. A small model etched into a chip has limited world knowledge, but edge scenarios never needed world knowledge; what they need is knowledge of the application and the site in front of them. Site knowledge is supplied by the RAG Sitemap, and the model only has to understand and navigate. From that moment, it can run directly on the user's edge model, with no online API needed at all, and the visitor's question is handled locally on their own device.

On the site owner's side, all that is needed is to run the content well, install RAG Chatbot, and define the site's personality (SOUL.md). The rest of the AI inference cost goes to zero, and the operating cost is squeezed further toward free. You play games on your own graphics card and watch video on your own screen; asking an AI will, in the end, also return to inference on your own device.

On the Software Side, a Step Has Already Been Taken

Etching the model into a chip is the endpoint because the weights become circuits directly, with nothing to move in and out of memory, fast and power-efficient to the limit. But before that day arrives, the software side has already taken a step forward. Chrome's Prompt API lets the browser build in a small model, downloaded once and shared by all sites, so the visitor need not supply a model and the question is computed on their own device. This is exactly the access method this RAG Chatbot is waiting for, treating "built into the browser" as the transition before "chip is model"; and technically this method already runs today.

But being able to run does not mean it is time. The browser downloads a gigabyte-scale model in the background while ordinary applications are still measured in megabytes; it eats several gigabytes of disk in one bite, something most users never consented to, and when space runs short it can be cleared automatically and must be re-downloaded when needed. Yet these are rough edges that will be smoothed: downloads can get leaner, consent flows clearer, storage further optimized. But one thing cannot be smoothed away. As long as the model is software, every inference must load those several gigabytes of weights into memory, and RAM that was never enough always loses a slice. Worries about privacy and disk space will settle as the technology matures, but the memory tax will not, because it is not a flaw of any particular version but the fate of a software model. What truly crosses the barrier is making the model no longer loaded software but a model etched into a chip.

Related articles

The small model Llama 3.2 3B is a language model with only 3B parameters, about as small as they come. Ask it a question and it can only answer from its 3B of training data. It does not know what your site says, does not know which articles you published, and knows nothing about the content you have built up recently. Using it to run website Q&A should have been a fantasy.

You think you are training the AI to understand your site? It should be the other way around: use an AI chatbot to train your site. Every time the AI answers wrong, it is telling you which article's title or description is unclear, helping you check your site's SEO blind spots. RAG Sitemap drops the black-box vector store and reads the titles, categories, and descriptions you already wrote in WordPress, generating a plain-text site map for the AI. You only fix things in the admin the way you already do SEO, with no new tool to learn and no algorithm to read minds.

llms.txt is a sitemap designed for AI to read, but its limit is that it has only one layer, which is not enough for an organized, structured site. Its standard format is the site name as an H1, a short summary, and below that each line is a title: description pointing to one page. But it cannot tell whether a line is a category page, a standalone page, an article, or a product page; every line is treated as the same kind of thing. The intention is not wrong, and the goal is to make a site easier for AI to read. The problem is not the description but that it flattens the site into a single layer, destroying the site's original narrative power and content context.