Model Management¶
The Tryll server treats models as first-class, named resources. Every
model it can load is declared in a single catalog file, models.json,
that the server reads at startup. Clients discover, download, load, and
unload models over the wire protocol; the server
handles all disk I/O, HuggingFace downloads, and backend-specific
loading.
This page is the reference for:
- The
models.jsonschema that tells the server what exists and how to find it. - The model lifecycle wire-protocol flow:
ListModels→DownloadModel→LoadModel/UnloadModel. - The status enum reported per-model.
- The retention modes that control memory occupancy.
Server-wide model settings (download directory, catalog path) live in
server-config.json.
Development shortcut
During prototyping you can skip the explicit ListModels →
DownloadModel steps by enabling allow_auto_model_downloading on
the session — CreateAgent then downloads missing models automatically.
See Enable Auto Model Downloading.
models.json¶
Default location: data/models.json, configurable via
models_catalog_path in server-config.json. The server parses this
file once at startup.
Top-level structure¶
Model descriptor fields¶
| Field | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Human-readable model name. This is the identifier used everywhere on the wire: CreateAgentRequest.default_model_name, DownloadModelRequest.model_name, LoadModelRequest.model_name. Must be unique within the catalog. |
default_sampling |
object | No | Default sampling params for this model. Node-level NodeParam overrides take precedence. Absent fields fall back to built-in defaults listed below. |
variants |
array | Yes | One entry per supported inference engine. |
default_sampling fields¶
| Field | Type | Default | Description |
|---|---|---|---|
temperature |
float | 0.7 |
Sampling temperature; higher = more random. |
top_p |
float | 1.0 |
Nucleus sampling threshold. 1.0 disables. |
top_k |
int | 0 |
Top-K sampling cut-off. 0 disables. |
min_p |
float | 0.05 |
Min-P filter. 0.0 disables. |
repeat_penalty |
float | 1.0 |
Penalty for repeating tokens. 1.0 disables. |
presence_penalty |
float | 0.0 |
OpenAI-style presence penalty. 0.0 disables. |
frequency_penalty |
float | 0.0 |
OpenAI-style frequency penalty. 0.0 disables. |
max_tokens |
int | 2048 |
Maximum generated tokens per turn. |
seed |
uint32 | 0 |
RNG seed. 0 selects a random seed per turn. |
variants entry fields¶
| Field | Type | Required | Description |
|---|---|---|---|
engine |
string | Yes | Which inference engine this variant targets. Currently supported: "llama-cpp". Must match a registered engine; variants for other engines are silently ignored when the session is configured for a different engine. |
local_path |
string | No | Absolute or server-relative path to a directory containing the model file(s) on disk. Combined with files[0] to resolve the full path. Takes priority over path and downloads.json. Use this for user-supplied models. |
path |
string | No | Legacy relative path under models_download_dir. Fallback when neither local_path nor a downloads.json entry resolves. |
huggingface_repo |
string | No | HuggingFace repository slug ("owner/repo"). Required for downloads. Empty means the model cannot be downloaded over the wire — it must already be on disk via local_path. |
files |
array of string | No | Filenames for this variant. For HuggingFace, these are downloaded from huggingface_repo; for local, resolved relative to local_path. Empty disables both download and resolution. |
context_size |
int | No | Override the engine's default context window (in tokens). 0 or absent uses the engine default — currently 8192 for llama.cpp. |
kv_cache_type |
string | No | llama.cpp KV-cache dtype: "f16", "q8_0", or "q4_0". Default "q8_0" (≈half the KV VRAM vs F16). Ignored by other engines. |
Minimal example — user-provided model¶
<!-- sample only; not a committed file -->
{
"models": [
{
"name": "My Local Model",
"variants": [
{
"engine": "llama-cpp",
"local_path": "C:/models",
"files": ["my-model.gguf"]
}
]
}
]
}
Downloadable example — HuggingFace + tuned sampling¶
<!-- sample only; not a committed file -->
{
"models": [
{
"name": "Llama 3.2 3B Instruct (Q4_K_M)",
"default_sampling": {
"temperature": 0.6,
"top_p": 0.9,
"top_k": 50,
"min_p": 0.05,
"repeat_penalty": 1.2
},
"variants": [
{
"engine": "llama-cpp",
"huggingface_repo": "bartowski/Llama-3.2-3B-Instruct-GGUF",
"files": ["Llama-3.2-3B-Instruct-Q4_K_M.gguf"]
}
]
}
]
}
Embedding models¶
Models with "purpose": "embedding" are declared the same way and
referenced by name from
embedded string storages and from
Retrieve node params. Their sampling fields are
ignored; only the file resolution and engine fields matter.
Lifecycle¶
The wire-protocol flow for putting a model under a running agent:
sequenceDiagram
participant C as Client
participant S as Server
C->>S: ListModelsRequest
S-->>C: ListModelsResponse([ModelInfo])
alt status == Absent
C->>S: DownloadModelRequest(name)
S-->>C: DownloadProgress × N
S-->>C: DownloadComplete(success=true)
end
C->>S: LoadModelRequest(name)
S-->>C: LoadModelResponse
C->>S: CreateAgentRequest(default_model_name=name, ...)
S-->>C: CreateAgentResponse(agent_id)
note over C,S: Agent runs turns...
C->>S: UnloadModelRequest(name)
S-->>C: Ack
Key rules:
- You do not have to call
LoadModelRequestexplicitly. An agent whose graph references a not-yet-loaded model triggers on-demand load atCreateAgenttime. LoadModelRequestpins the model. The server keeps the model resident untilUnloadModelRequest, regardless of agent count.UnloadModelRequeston a shared model is polite. If any agent still uses the model, it stays resident; it is freed when the last user goes away.
See How to pin and unpin models for the end-to-end walkthrough.
Model status¶
The ModelInfo.status field in ListModelsResponse reports where a
model is right now:
| Value | Int | Meaning |
|---|---|---|
Absent |
0 | Known in catalog but not on disk and not downloading. |
Local |
1 | User-provided path resolved (local_path); on disk; can be loaded. |
Downloading |
2 | Transfer in progress. DownloadProgress frames are flowing. |
Loaded |
3 | Currently resident in memory. |
Downloaded |
4 | On disk from a HuggingFace download; can be loaded. |
Ordinals match the ModelStatus enum on the
wire protocol and ETryllModelStatus in the
Unreal client.
Retention modes¶
Tryll decides what to keep in memory based on a simple rule:
| Retention | How it is triggered | When the model is freed |
|---|---|---|
| Pinned | LoadModelRequest |
On UnloadModelRequest (delayed until no agent uses it). |
| OnDemand | Implicit: an agent's graph references the model, no prior LoadModelRequest |
When the last agent that uses it is destroyed. |
Use Pinned for the single hot model on the machine — the latency cost of the first load is paid once at startup. Use OnDemand for secondary models where memory matters more than first-turn latency.
These labels match the client-facing names used in the glossary (pinned retention, on-demand retention).
See Lifetime and Ownership → Models for the reference-counting story that connects Pinned / OnDemand to what nodes inside an agent actually hold.
Download record — downloads.json¶
The server maintains a downloads.json file inside
models_download_dir — its download ledger — to track completed
HuggingFace downloads. This file is server-managed; do not edit
it by hand. Its contents feed ModelInfo.status == Downloaded at
startup so that a previously downloaded model can be loaded without
re-downloading.
Structure (for reference only):
{
"entries": [
{
"name": "Llama 3.2 3B Instruct (Q4_K_M)",
"engine": 1,
"folder": "bartowski--Llama-3.2-3B-Instruct-GGUF",
"files": ["Llama-3.2-3B-Instruct-Q4_K_M.gguf"],
"total_bytes": 2019377152
}
]
}
engine is the InferenceEngine enum ordinal on the
wire protocol (1 = LlamaCpp).
Errors¶
| Code | Cause |
|---|---|
6001 |
Download failed (HTTP, interrupted, checksum). |
6002 |
No variant for the active engine in the catalog. |
6003 |
Disk full in the download directory. |
6004 |
A download for this model is already active. |
Client bindings¶
- C++:
Tryll::TryllClient::ListModels/DownloadModel/LoadModel/UnloadModel—TryllClient.h - Python:
tryll_client.TryllClient.list_models/download_model/load_model/unload_model—client.py - Unreal:
UTryllSubsystem::RequestListModels/RequestDownloadModel/RequestLoadModel/RequestUnloadModel—TryllSubsystem.h
Related¶
- Server Configuration — server-wide settings that govern model storage.
- Concept: Models and inference engines
- How to use your own local model
- How to pin and unpin models
- Agent Parameters — how agents reference models.
- Glossary