Models and Inference Engines¶

Tryll is a model host first and a workflow engine second. This page is about the host side: what counts as a "model", how the server talks to the hardware, what the KV cache is doing under your agents, and how to think about context size and quantisation when you pick one.

Two kinds of models¶

Tryll treats two kinds of models as first-class citizens:

Language models — generate text. These are what Generate and ToolCall nodes use.
Embedding models — turn text into fixed-length float vectors. Retrieve uses one to embed the current user message before searching an embedded string storage.

Every entry in the catalog (models.json) declares which kind it is. Asking for an embedding model where a language model is expected — or vice versa — fails fast with a clear error. See Model Management for the full catalog schema.

Inference engines: how the model actually runs¶

An inference engine is the runtime that executes the model against your CPU or GPU. Tryll currently ships with three:

Engine	Backend	Status
`Mock`	In-process stub	Used for tests; no real inference.
`LlamaCpp`	llama.cpp	Default. GGUF files; CPU + Vulkan / CUDA / ROCm GPU.
`OnnxGenAI`, `WindowsML`, `OpenVino`, `TensorRtLlm`	Reserved	Enum slots exist; implementations land in future releases.

You pick the engine per session with ConfigureSession. Changing it affects only later CreateAgent calls; existing agents keep the engine they were built with.

A model can appear in the catalog with multiple variants, each targeting a different engine:

{
  "name": "Llama-3.2-3B-Instruct",
  "variants": [
    { "engine": "LlamaCpp", "source": "bartowski/Llama-3.2-3B-Instruct-GGUF",
      "file": "Llama-3.2-3B-Instruct-Q4_K_M.gguf",
      "kv_cache_type": "q8_0" },
    { "engine": "OnnxGenAI", "source": "microsoft/…", "file": "…" }
  ]
}

The server picks the variant that matches the session's engine at load time. If no variant matches, the load fails with an error code in the 4xxx range.

The KV cache: why generation is fast after the first token¶

A language model's prompt can be thousands of tokens long. Naively re-running the full prompt through the model on every new token would be unworkable. Instead, transformer models keep an intermediate per-token state — the key/value cache, or KV cache — that encodes "everything the model has seen so far". Adding one more token costs one more forward pass, not N.

Tryll keeps one KV cache per node, per agent. Each turn, the node compares the new prompt against the cached token sequence:

Unchanged prefix is kept.
Stale tail (tokens that differ) is evicted.
New suffix is processed in batches.

So the cost of "remembering the conversation" is the cost of appending the new user turn's tokens — not the whole history — as long as you are not rewriting the past.

KV cache dtype¶

Each llama.cpp variant can declare a kv_cache_type:

Value	Bytes/element	Notes
`f16`	2	Default when nothing is set on CPU-only builds. Highest quality.
`q8_0`	~1	Tryll's default for configured variants. Indistinguishable quality for most chat use.
`q4_0`	~0.5	Halves memory vs. q8_0; small quality hit. Useful on memory-constrained GPUs.

The cache size dominates VRAM / RAM for long conversations. Moving from f16 to q8_0 roughly halves it; from q8_0 to q4_0 halves it again.

Context size and why 4k is not 4k¶

Every language model has a maximum context size — the total number of tokens the model can attend to at once (prompt + generation buffer). Tryll reserves a slice at the end for the generation itself (generation_reserve), and uses the rest as the high water mark for projection. Tokens beyond the high water mark get trimmed by token-budget projection before the model ever sees them.

So a model nominally advertised as "4k context" will, in practice, give you something like 3.5k of usable history — more than enough for most dialog, but worth knowing about when you design a RAG prompt.

Embedding models¶

Embedding models are much smaller and simpler: a stateless encoder that turns text into fixed-dimension vectors, with mean pooling to produce one vector per input. The server calls the model once per query and once per chunk at index-build time; that is the whole story.

An embedding model is required whenever you create an embedded string storage — either sent on the wire (Path B) or declared in the *.kb.json config (Path A). The two sides must match.

The loaded-model pool¶

Language models are expensive to load and take tens of MB to tens of GB in memory. The server loads each model once per process and reference-counts it:

Pinned retention — set by LoadModelRequest. The model stays in memory until you explicitly unload it and no active contexts reference it.
OnDemand retention (default) — set by the lazy GetLanguageModel path. The server evicts the model as soon as the last agent using it is destroyed. This happens automatically after every DestroyAgentRequest.

This pair lets you either nail a model in place ("this is the chat model, always keep it warm") or let the pool breathe when a session tears an agent down. See Lifetime and Ownership → Models for the full reference-counting story, including what happens when UnloadModelRequest arrives while agents are still using the model.

Downloading and the "Local" status¶

Models you did not bundle with the server can be pulled from Hugging Face via DownloadModelRequest. The server streams DownloadProgress frames and finishes with a DownloadComplete(success=true). After that the model has status Local and can be loaded. See Model Management for the full lifecycle and the distinction between Local (disk) and Loaded (memory).

Edges and pitfalls¶

No model swap mid-turn. The model a context points to is resolved at agent-creation time. Pinning or unloading a different model while an agent is running is safe; unloading the model a running agent depends on is not — use pinning if you need the guarantee.
KV cache is per node, per agent. Two Generate nodes using the same model in the same graph still each pay their own KV cost. Budget memory accordingly.
Embedding and language models are not interchangeable. Trying to point a Retrieve node at a language model (or a Generate node at an embedding model) fails validation with a clear error.
Context-window claims are rough. Official context sizes assume no generation buffer; the high-water mark is what actually governs your prompt budget.