Retrieval-Augmented Generation (RAG)¶
Small local models know general English. They do not know your game's lore, your product's changelog, or the contents of a customer's ticket history. RAG is the standard trick for fixing that: before the model answers, find the handful of most-relevant chunks of your data, and paste them into the prompt. This page explains how Tryll's RAG works end to end and when to use it.
The moving parts¶
flowchart LR
subgraph Session["Session"]
ESS["Embedded string storage<br>(per-session, named)"]
end
subgraph Agent["Agent turn"]
U["User message"] --> R["Retrieve node<br>embed + search"]
R -- "chunks" --> KC["Knowledge on the turn"]
KC --> P["Projection<br>weaves chunks into prompt"]
P --> G["Generate node"]
end
R -. "reads" .-> ESS
Tryll splits RAG into two phases:
- Index once. Your data is embedded and stored in an embedded string storage — a named, per-session HNSW index over the embeddings.
- Retrieve every turn. The
Retrievenode embeds the current user message, finds the top-K closest chunks, and attaches them to the current turn. The downstreamGeneratenode's projection picks them up and rewrites the user turn to include them.
Nothing else changes. There is no special "RAG model", no separate
server. A RAG agent is an agent with a Retrieve node in front of a
Generate node.
What an embedded string storage actually stores¶
Two things:
- a records file (
*.kb.json) — an array of{ id, text, metadata? }entries, each a chunk you want retrievable; - optionally an on-disk index file (
*.usearch) — the HNSW graph for fast top-K lookup.
The index is rebuilt from scratch when it is missing, older than the records file, or for a different embedding model. When it is present and fresh, storage creation is near-instant.
You can create a storage two ways:
- Path A — config on disk. Ship a tiny
*.jsonwith pointers to a records file and an optional index file. Use this when the data lives alongside the server. See Embedded String Storage. - Path B — inline strings. Pass a
strings[]array and anembedding_modelon the wire. Use this when the data is produced at runtime. Fine for hundreds or low thousands of chunks; for bigger corpora, go with Path A and avoid re-embedding every reconnect.
Cosine distance, top-K, and the threshold¶
The search algorithm is HNSW with cosine distance: for two unit
vectors, the distance is 1 − cosine_similarity, so:
| Distance | Similarity |
|---|---|
| 0.0 | identical direction |
| 0.2 | quite close |
| 0.5 | loosely related |
| 1.0 | orthogonal (unrelated) |
| 2.0 | opposite direction |
The Retrieve node returns the top k closest chunks. Setting a
threshold drops anything with distance above it — i.e., the
threshold is an upper bound on distance (lower bound on similarity).
If no chunks survive, the node exits via not_found instead of
found. Your graph decides what to do in that case — the most common
pattern is to still route to Generate and let the template render
gracefully with no chunks (empty {{#knowledge}} sections collapse
to nothing, leaving only system_prompt and user_message).
Where the chunks land in the prompt¶
Tryll does not fuse retrieved text into the system prompt and hope.
You write a Mustache template on the
downstream Generate node's template param and pick a placement.
The projection stage builds a context from the current turn's data:
| Variable | Contents |
|---|---|
{{#knowledge}}…{{/knowledge}} |
Iterates over every retriever's result block. Inside: {{name}} (source label) and {{#chunks}}…{{/chunks}} (each chunk's {{id}}, {{text}}, {{distance}}). |
{{#knowledge_<source>}}…{{/knowledge_<source>}} |
Direct per-retriever iteration — e.g. {{#knowledge_rag}}. |
{{user_message}} |
The raw user message (only meaningful for in_place_of_user placement). |
The placement param on GenerateNode controls where the rendered text appears:
| Value | Effect |
|---|---|
before_user_as_system |
Rendered text becomes a system turn before the user message. (recommended default) |
before_user_as_user |
Rendered text becomes a user turn before the user message. |
in_place_of_user |
Replaces the user turn entirely. Must include {{user_message}}. |
after_user_as_system / after_user_as_user |
Appended after the user message. |
Why this is worth caring about: the single biggest lever on RAG quality is how clearly the model can tell the difference between "here is context" and "here is the question". The template gives you full control over that framing, per-model, without touching any node code.
For step-by-step examples see the Use Mustache Templates how-to.
Chunking is your problem (mostly)¶
Tryll does not chunk your source data. A record in the records file is a chunk — one thing the index can return. How you split your source text into records is up to you, and it is the part of RAG with the biggest quality impact. Rough guidelines:
- Keep chunks small enough that the top-K you intend to retrieve fits well inside the model's context window, with room for the conversation history and the answer.
- Prefer chunks with natural semantic boundaries: a paragraph, a Markdown heading section, one FAQ entry, one lore snippet. Chunks that straddle topics confuse the retriever.
- Include stable context inside the chunk itself — a section title, an entity name — so the chunk is meaningful in isolation. The retriever cannot see the surrounding document.
When RAG helps, and when it hurts¶
RAG helps when:
- the answer depends on facts the base model cannot know (private data, recent events, internal documents),
- the corpus is large enough that you cannot paste it whole into the prompt, but small enough that the right few chunks will usually be among the top K,
- you want to give the user a feeling of "the assistant has read the docs".
RAG hurts when:
- the question is general reasoning and the retrieved chunks are unrelated noise — the model will often try to use them anyway and produce a worse answer than it would have without them,
- your corpus is tiny (a few hundred lines total). It is cheaper to paste the whole thing into the system prompt than to embed and retrieve.
- you expect the model to synthesise across many chunks — with top-K=3 and a small model, synthesis quality will be limited.
A sensible default for a first RAG agent: k=3, threshold in the
0.3–0.6 range (tune by watching the retriever log), placement
= before_user_as_system, a template like
{{#knowledge}}{{name}}:\n{{#chunks}}- {{text}}\n{{/chunks}}{{/knowledge}},
and a system_prompt that says "Use the context above to answer the
user's question. If the context does not contain the answer, say so."
Edges and pitfalls¶
- Wire / config embedding-model mismatch fails with error 8002. Match the two, or leave one of them unset.
- Stale indexes are rebuilt. If you bump your records file, the next storage creation detects the mtime and re-embeds. No cache poisoning.
- Per-session ownership. Storages die with the session. For a
stable corpus across reconnects, re-create from the same config on
connect; the on-disk
.usearchmakes that cheap. - Knowledge does not persist across turns. Each turn, only the most recently retrieved chunks are projected. Anything retrieved on earlier turns is not re-surfaced to the model.