I wanted to test whether I could add a local chatbot to arj.no/chat/ that answers questions about my website content and not the whole internet. Since my blog is built using Hugo, a static site generator, I don’t have any server in the background that can host the Large Language Model (LLM). Instead I looked for a simple solution with standard web technologies and files.
Running a model in the browser with WebLLM
After discussing with CoPilot, I ended up using WebLLM, which allows language models to run inside the browser. It relies on WebGPU, which makes local AI processing possible, if you have Edge or Chrome available.
WebLLM downloads a compressed model to the browser, which runs locally. In this setup, users can choose smaller/faster or larger/better models (for example Llama 3.2 1B/3B or Phi-3.5 Mini). After the first download, the model is cached locally so it does not need to be downloaded every time.
The chat app itself is plain JavaScript loaded from static files:
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
const engine = await webllm.CreateMLCEngine("Llama-3.2-3B-Instruct-q4f16_1-MLC", {
initProgressCallback: (p) => console.log(p.text),
});
Replies are streamed token-by-token, giving a word-by-word effect. A token is the basic unit a language model reads and writes. Roughly, one token ≈ ¾ of an English word (so “chatbot” might be one or two tokens). Models have a maximum number of tokens they can process at once, which is why RAG inserts only the most relevant chunks rather than all available text. In my code, the token-by-token parsing is done with this javascript snippet:
const stream = await engine.chat.completions.create({
messages: chatHistory,
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? "";
// append delta to the displayed bubble
}
Grounding answers in local documents with RAG
The whole point of this experiment was to have a chat that can answer questions about my blog. However, the model does not know anything about my content, so the next step was to add Retrieval-Augmented Generation (RAG). This is a technique that finds relevant text from a knowledge base and inserts it into the prompt before asking the LLM to answer. This grounds the model’s reply in specific documents rather than relying solely on what it “remembers” from training. It has two phases:
Offline — build a knowledge base. A Python script scans my content (English posts, Norwegian posts, and a docs/ folder with PDFs to my articles), extracts text, splits it into smaller chunks (here, about 220 words). Breaking documents into chunks makes it practical to find the most relevant passage for a given question instead of sending entire articles to the model. Then it computes a Term Frequency-Inverse Document Frequency TF-IDF index. The result is saved as a JSON file the browser can read quickly.
Online — retrieve and inject context. When a user asks a question, JavaScript finds the most relevant text pieces using TF-IDF similarity. Those pieces are then inserted into the prompt sent to WebLLM:
[Source 1: welcome.md]
...retrieved chunk text...
---
Question: <user question>
The model then answers with this local context available. After each reply, the interface shows a collapsible Sources section so users can see where the answer came from.
Why TF-IDF instead of embeddings?
I chose TF-IDF (Term Frequency–Inverse Document Frequency) mainly for simplicity, transparency, and maintainability. Many RAG systems use embeddings, a technique that converts text into a list of numbers so that similar meanings end up close together in a mathematical “space”. Embeddings can give better retrieval quality than TF-IDF but require an extra model and more computational resources.
TF-IDF is much simpler. It is a simple scoring method that measures how important a word is to a particular document. Words that appear often in one chunk but rarely elsewhere score higher, making them useful signals for matching a question to relevant content. For a personal website corpus, this is often good enough, easy to understand, and lightweight to run.
Need GPU support
One problem with this experiment, is that it requires WebGPU. If a browser does not support it, users will see a warning instead of the chat flow. On my Ubuntu system running on a Lenovo laptop with GPU, Chrome didn’t start properly because I needed Vulkan drivers. WebGPU uses Vulkan as its GPU backend on Linux, so without the drivers installed the browser cannot initialise a GPU adapter and WebLLM never loads. The fix was to install the Mesa Vulkan driver for my Intel GPU:
sudo apt install mesa-vulkan-drivers
For some reason, Chrome still didn’t work, so I had to force it to use the Vulkan drivers through the terminal:
google-chrome --enable-features=Vulkan,UseSkiaRenderer
Then, everything worked well.
Improving answer quality
The chat is available at arj.no/chat/. It works best for factual questions that are explicitly covered by the indexed content. After testing it, I found several cases where answers were poor:
- Questions like “Who is Alexander Jensenius?” returned raw lowercased chunk text, because the TF-IDF chunks strip punctuation and casing.
- “What are the titles of his books?” retrieved a blog post about Springer library access rather than the about page, leading to a completely wrong answer.
- “What is he researching?” produced a vague, incomplete reply.
- “What are musical gestures?” triggered the LLM to copy a garbled PDF chunk rather than synthesise an answer.
The root cause was twofold. First, TF-IDF retrieval alone is not reliable enough for short personal-profile queries and the about page can easily lose rank to adjacent posts if the query words appear many times in common content. Second, a 1B parameter model will copy verbatim text rather than reformulate it if the system prompt does not explicitly ask it not to.
I asked CoPilot for help, and it updated the implementation in several ways:
Hardcoded deterministic answers for well-known intents. For queries recognised as bio, work, research, books, and gesture, the chatbot returns a pre-written clean sentence immediately, without calling the LLM at all. This avoids both retrieval errors and model hallucination for the most predictable questions.
Extended intent detection. The pattern matching that decides the “intent” of a query was broadened (e.g.
/what.*he.*research/instead of just/researching/) so follow-up questions like “what is he researching?” are caught correctly.TF-IDF score boost for the about page on book-related queries. A small scoring bonus was added whenever the query contains the word book or monograph and the candidate chunk comes from the about page. This prevents loosely related blog posts from outranking the actual biographical text.
Stronger system and augmentation prompts. The LLM’s system message now explicitly says: “Write clean, complete sentences with proper capitalisation — never copy raw lowercase text from the context. Synthesise the relevant facts into a readable answer.” The same instruction is repeated in the per-query augmentation prompt so it is in the model’s immediate context window.
These are cheap to apply: no re-indexing, no extra model, no new infrastructure. For a personal site with predictable common questions, adding a small set of hardcoded answers to complement RAG works well.
Lessons learned
It has been interesting to implement a small, local LLM-based system. It is far from perfect, but by going through the steps (albeit with lots of help from the much more powerful CoPilot agents) has helped me get a much better understanding of how to develop such systems myself.
It would require some tweaking to make this into a helpful tool, but, on the positive side, I would know what it is doing and the sources that it is basing its arguments on.
Thanks to CoPilot for helping implement this setup!
