Retrieval-Augmented Generation (RAG)¶

Small local models know general English. They do not know your game's lore, your product's changelog, or the contents of a customer's ticket history. RAG is the standard trick for fixing that: before the model answers, find the handful of most-relevant chunks of your data, and paste them into the prompt. This page explains how Tryll's RAG works end to end and when to use it.

The moving parts¶

flowchart LR
    subgraph Session["Session"]
        ESS["Embedded string storage<br>(per-session, named)"]
    end
    subgraph Agent["Agent turn"]
        U["User message"] --> R["Retrieve node<br>embed + search"]
        R -- "chunks" --> KC["Knowledge on the turn"]
        KC --> P["Projection<br>weaves chunks into prompt"]
        P --> G["Generate node"]
    end
    R -. "reads" .-> ESS

Tryll splits RAG into two phases:

Index once. Your data is embedded and stored in an embedded string storage — a named, per-session HNSW index over the embeddings.
Retrieve every turn. The Retrieve node embeds the current user message, finds the top-K closest chunks, and attaches them to the current turn. The downstream Generate node's projection picks them up and rewrites the user turn to include them.

Nothing else changes. There is no special "RAG model", no separate server. A RAG agent is an agent with a Retrieve node in front of a Generate node.

What an embedded string storage actually stores¶

Two things:

a records file (*.kb.json) — an array of { id, text, metadata? } entries, each a chunk you want retrievable;
optionally an on-disk index file (*.usearch) — the HNSW graph for fast top-K lookup.

The index is rebuilt from scratch when it is missing, older than the records file, or for a different embedding model. When it is present and fresh, storage creation is near-instant.

You can create a storage two ways:

Path A — config on disk. Ship a tiny *.json with pointers to a records file and an optional index file. Use this when the data lives alongside the server. See Embedded String Storage.
Path B — inline strings. Pass a strings[] array and an embedding_model on the wire. Use this when the data is produced at runtime. Fine for hundreds or low thousands of chunks; for bigger corpora, go with Path A and avoid re-embedding every reconnect.

Cosine distance, top-K, and the threshold¶

The search algorithm is HNSW with cosine distance: for two unit vectors, the distance is 1 − cosine_similarity, so:

Distance	Similarity
0.0	identical direction
0.2	quite close
0.5	loosely related
1.0	orthogonal (unrelated)
2.0	opposite direction

The Retrieve node returns the top k closest chunks. Setting a threshold drops anything with distance above it — i.e., the threshold is an upper bound on distance (lower bound on similarity).

If no chunks survive, the node exits via not_found instead of found. Your graph decides what to do in that case — the most common pattern is to still route to Generate and let the template render gracefully with no chunks (empty {{#knowledge}} sections collapse to nothing, leaving only system_prompt and user_message).

Where the chunks land in the prompt¶

Tryll does not fuse retrieved text into the system prompt and hope. You write a Mustache template on the downstream Generate node's template param and pick a placement.

The projection stage builds a context from the current turn's data:

Variable	Contents
`{{#knowledge}}…{{/knowledge}}`	Iterates over every retriever's result block. Inside: `{{name}}` (source label) and `{{#chunks}}…{{/chunks}}` (each chunk's `{{id}}`, `{{text}}`, `{{distance}}`).
`{{#knowledge_<source>}}…{{/knowledge_<source>}}`	Direct per-retriever iteration — e.g. `{{#knowledge_rag}}`.
`{{user_message}}`	The raw user message (only meaningful for `in_place_of_user` placement).

The placement param on GenerateNode controls where the rendered text appears:

Value	Effect
`before_user_as_system`	Rendered text becomes a system turn before the user message. (recommended default)
`before_user_as_user`	Rendered text becomes a user turn before the user message.
`in_place_of_user`	Replaces the user turn entirely. Must include `{{user_message}}`.
`after_user_as_system` / `after_user_as_user`	Appended after the user message.

Why this is worth caring about: the single biggest lever on RAG quality is how clearly the model can tell the difference between "here is context" and "here is the question". The template gives you full control over that framing, per-model, without touching any node code.

For step-by-step examples see the Use Mustache Templates how-to.

Chunking is your problem (mostly)¶

Tryll does not chunk your source data. A record in the records file is a chunk — one thing the index can return. How you split your source text into records is up to you, and it is the part of RAG with the biggest quality impact. Rough guidelines:

Keep chunks small enough that the top-K you intend to retrieve fits well inside the model's context window, with room for the conversation history and the answer.
Prefer chunks with natural semantic boundaries: a paragraph, a Markdown heading section, one FAQ entry, one lore snippet. Chunks that straddle topics confuse the retriever.
Include stable context inside the chunk itself — a section title, an entity name — so the chunk is meaningful in isolation. The retriever cannot see the surrounding document.

When RAG helps, and when it hurts¶

RAG helps when:

the answer depends on facts the base model cannot know (private data, recent events, internal documents),
the corpus is large enough that you cannot paste it whole into the prompt, but small enough that the right few chunks will usually be among the top K,
you want to give the user a feeling of "the assistant has read the docs".

RAG hurts when:

the question is general reasoning and the retrieved chunks are unrelated noise — the model will often try to use them anyway and produce a worse answer than it would have without them,
your corpus is tiny (a few hundred lines total). It is cheaper to paste the whole thing into the system prompt than to embed and retrieve.
you expect the model to synthesise across many chunks — with top-K=3 and a small model, synthesis quality will be limited.

A sensible default for a first RAG agent: k=3, threshold in the 0.3–0.6 range (tune by watching the retriever log), placement = before_user_as_system, a template like {{#knowledge}}{{name}}:\n{{#chunks}}- {{text}}\n{{/chunks}}{{/knowledge}}, and a system_prompt that says "Use the context above to answer the user's question. If the context does not contain the answer, say so."

Edges and pitfalls¶

Wire / config embedding-model mismatch fails with error 8002. Match the two, or leave one of them unset.
Stale indexes are rebuilt. If you bump your records file, the next storage creation detects the mtime and re-embeds. No cache poisoning.
Per-session ownership. Storages die with the session. For a stable corpus across reconnects, re-create from the same config on connect; the on-disk .usearch makes that cheap.
Knowledge does not persist across turns. Each turn, only the most recently retrieved chunks are projected. Anything retrieved on earlier turns is not re-surfaced to the model.