Models and Inference Engines¶
Tryll is a model host first and a workflow engine second. This page is about the host side: what counts as a "model", how the server talks to the hardware, what the KV cache is doing under your agents, and how to think about context size and quantisation when you pick one.
Two kinds of models¶
Tryll treats two kinds of models as first-class citizens:
- Language models — generate text. These are what
GenerateandToolCallnodes use. - Embedding models — turn text into fixed-length float vectors.
Retrieveuses one to embed the current user message before searching an embedded string storage.
Every entry in the catalog (models.json) declares which kind it is.
Asking for an embedding model where a language model is expected — or
vice versa — fails fast with a clear error. See
Model Management for the full
catalog schema.
Inference engines: how the model actually runs¶
An inference engine is the runtime that executes the model against your CPU or GPU. Tryll currently ships with three:
| Engine | Backend | Status |
|---|---|---|
Mock |
In-process stub | Used for tests; no real inference. |
LlamaCpp |
llama.cpp | Default. GGUF files; CPU + Vulkan / CUDA / ROCm GPU. |
OnnxGenAI, WindowsML, OpenVino, TensorRtLlm |
Reserved | Enum slots exist; implementations land in future releases. |
You pick the engine per session with
ConfigureSession. Changing it
affects only later CreateAgent calls; existing agents keep the
engine they were built with.
A model can appear in the catalog with multiple variants, each targeting a different engine:
{
"name": "Llama-3.2-3B-Instruct",
"variants": [
{ "engine": "LlamaCpp", "source": "bartowski/Llama-3.2-3B-Instruct-GGUF",
"file": "Llama-3.2-3B-Instruct-Q4_K_M.gguf",
"kv_cache_type": "q8_0" },
{ "engine": "OnnxGenAI", "source": "microsoft/…", "file": "…" }
]
}
The server picks the variant that matches the session's engine at load time. If no variant matches, the load fails with an error code in the 4xxx range.
The KV cache: why generation is fast after the first token¶
A language model's prompt can be thousands of tokens long. Naively re-running the full prompt through the model on every new token would be unworkable. Instead, transformer models keep an intermediate per-token state — the key/value cache, or KV cache — that encodes "everything the model has seen so far". Adding one more token costs one more forward pass, not N.
Tryll keeps one KV cache per node, per agent. Each turn, the node compares the new prompt against the cached token sequence:
- Unchanged prefix is kept.
- Stale tail (tokens that differ) is evicted.
- New suffix is processed in batches.
So the cost of "remembering the conversation" is the cost of appending the new user turn's tokens — not the whole history — as long as you are not rewriting the past.
KV cache dtype¶
Each llama.cpp variant can declare a kv_cache_type:
| Value | Bytes/element | Notes |
|---|---|---|
f16 |
2 | Default when nothing is set on CPU-only builds. Highest quality. |
q8_0 |
~1 | Tryll's default for configured variants. Indistinguishable quality for most chat use. |
q4_0 |
~0.5 | Halves memory vs. q8_0; small quality hit. Useful on memory-constrained GPUs. |
The cache size dominates VRAM / RAM for long conversations. Moving
from f16 to q8_0 roughly halves it; from q8_0 to q4_0 halves
it again.
Context size and why 4k is not 4k¶
Every language model has a maximum context size — the total
number of tokens the model can attend to at once (prompt + generation
buffer). Tryll reserves a slice at the end for the generation itself
(generation_reserve), and uses the rest as the high water mark
for projection. Tokens beyond the high water mark get trimmed by
token-budget projection before the
model ever sees them.
So a model nominally advertised as "4k context" will, in practice, give you something like 3.5k of usable history — more than enough for most dialog, but worth knowing about when you design a RAG prompt.
Embedding models¶
Embedding models are much smaller and simpler: a stateless encoder that turns text into fixed-dimension vectors, with mean pooling to produce one vector per input. The server calls the model once per query and once per chunk at index-build time; that is the whole story.
An embedding model is required whenever you create an
embedded string storage —
either sent on the wire (Path B) or declared in the
*.kb.json config (Path A). The two sides must match.
The loaded-model pool¶
Language models are expensive to load and take tens of MB to tens of GB in memory. The server loads each model once per process and reference-counts it:
- Pinned retention — set by
LoadModelRequest. The model stays in memory until you explicitly unload it and no active contexts reference it. - OnDemand retention (default) — set by the lazy
GetLanguageModelpath. The server evicts the model as soon as the last agent using it is destroyed. This happens automatically after everyDestroyAgentRequest.
This pair lets you either nail a model in place ("this is the chat
model, always keep it warm") or let the pool breathe when a session
tears an agent down. See
Lifetime and Ownership → Models
for the full reference-counting story, including what happens when
UnloadModelRequest arrives while agents are still using the model.
Downloading and the "Local" status¶
Models you did not bundle with the server can be pulled from Hugging
Face via DownloadModelRequest. The server streams
DownloadProgress frames and finishes with a
DownloadComplete(success=true). After that the model has status
Local and can be loaded. See
Model Management for the full
lifecycle and the distinction between Local (disk) and Loaded
(memory).
Edges and pitfalls¶
- No model swap mid-turn. The model a context points to is resolved at agent-creation time. Pinning or unloading a different model while an agent is running is safe; unloading the model a running agent depends on is not — use pinning if you need the guarantee.
- KV cache is per node, per agent. Two
Generatenodes using the same model in the same graph still each pay their own KV cost. Budget memory accordingly. - Embedding and language models are not interchangeable. Trying
to point a
Retrievenode at a language model (or aGeneratenode at an embedding model) fails validation with a clear error. - Context-window claims are rough. Official context sizes assume no generation buffer; the high-water mark is what actually governs your prompt budget.