Model Management¶

The Tryll server treats models as first-class, named resources. Every model it can load is declared in a single catalog file, models.json, that the server reads at startup. Clients discover, download, load, and unload models over the wire protocol; the server handles all disk I/O, HuggingFace downloads, and backend-specific loading.

This page is the reference for:

The models.json schema that tells the server what exists and how to find it.
The model lifecycle wire-protocol flow: ListModels → DownloadModel → LoadModel / UnloadModel.
The status enum reported per-model.
The retention modes that control memory occupancy.

Server-wide model settings (download directory, catalog path) live in server-config.json.

Development shortcut

During prototyping you can skip the explicit ListModels → DownloadModel steps by enabling allow_auto_model_downloading on the session — CreateAgent then downloads missing models automatically. See Enable Auto Model Downloading.

`models.json`¶

Default location: data/models.json, configurable via models_catalog_path in server-config.json. The server parses this file once at startup.

Top-level structure¶

<!-- sample only; not a committed file -->
{
    "models": [
        { ... },
        { ... }
    ]
}

Model descriptor fields¶

Field	Type	Required	Description
`name`	string	Yes	Human-readable model name. This is the identifier used everywhere on the wire: `CreateAgentRequest.default_model_name`, `DownloadModelRequest.model_name`, `LoadModelRequest.model_name`. Must be unique within the catalog.
`default_sampling`	object	No	Default sampling params for this model. Node-level `NodeParam` overrides take precedence. Absent fields fall back to built-in defaults listed below.
`variants`	array	Yes	One entry per supported inference engine.

`default_sampling` fields¶

Field	Type	Default	Description
`temperature`	float	`0.7`	Sampling temperature; higher = more random.
`top_p`	float	`1.0`	Nucleus sampling threshold. `1.0` disables.
`top_k`	int	`0`	Top-K sampling cut-off. `0` disables.
`min_p`	float	`0.05`	Min-P filter. `0.0` disables.
`repeat_penalty`	float	`1.0`	Penalty for repeating tokens. `1.0` disables.
`presence_penalty`	float	`0.0`	OpenAI-style presence penalty. `0.0` disables.
`frequency_penalty`	float	`0.0`	OpenAI-style frequency penalty. `0.0` disables.
`max_tokens`	int	`2048`	Maximum generated tokens per turn.
`seed`	uint32	`0`	RNG seed. `0` selects a random seed per turn.

`variants` entry fields¶

Field	Type	Required	Description
`engine`	string	Yes	Which inference engine this variant targets. Currently supported: `"llama-cpp"`. Must match a registered engine; variants for other engines are silently ignored when the session is configured for a different engine.
`local_path`	string	No	Absolute or server-relative path to a directory containing the model file(s) on disk. Combined with `files[0]` to resolve the full path. Takes priority over `path` and `downloads.json`. Use this for user-supplied models.
`path`	string	No	Legacy relative path under `models_download_dir`. Fallback when neither `local_path` nor a `downloads.json` entry resolves.
`huggingface_repo`	string	No	HuggingFace repository slug (`"owner/repo"`). Required for downloads. Empty means the model cannot be downloaded over the wire — it must already be on disk via `local_path`.
`files`	array of string	No	Filenames for this variant. For HuggingFace, these are downloaded from `huggingface_repo`; for local, resolved relative to `local_path`. Empty disables both download and resolution.
`context_size`	int	No	Override the engine's default context window (in tokens). `0` or absent uses the engine default — currently 8192 for `llama.cpp`.
`kv_cache_type`	string	No	`llama.cpp` KV-cache dtype: `"f16"`, `"q8_0"`, or `"q4_0"`. Default `"q8_0"` (≈half the KV VRAM vs F16). Ignored by other engines.

Minimal example — user-provided model¶

<!-- sample only; not a committed file -->
{
    "models": [
        {
            "name": "My Local Model",
            "variants": [
                {
                    "engine":     "llama-cpp",
                    "local_path": "C:/models",
                    "files":      ["my-model.gguf"]
                }
            ]
        }
    ]
}

Downloadable example — HuggingFace + tuned sampling¶

<!-- sample only; not a committed file -->
{
    "models": [
        {
            "name": "Llama 3.2 3B Instruct (Q4_K_M)",
            "default_sampling": {
                "temperature":    0.6,
                "top_p":          0.9,
                "top_k":          50,
                "min_p":          0.05,
                "repeat_penalty": 1.2
            },
            "variants": [
                {
                    "engine":           "llama-cpp",
                    "huggingface_repo": "bartowski/Llama-3.2-3B-Instruct-GGUF",
                    "files":            ["Llama-3.2-3B-Instruct-Q4_K_M.gguf"]
                }
            ]
        }
    ]
}

Embedding models¶

Models with "purpose": "embedding" are declared the same way and referenced by name from embedded string storages and from Retrieve node params. Their sampling fields are ignored; only the file resolution and engine fields matter.

Lifecycle¶

The wire-protocol flow for putting a model under a running agent:

sequenceDiagram
    participant C as Client
    participant S as Server
    C->>S: ListModelsRequest
    S-->>C: ListModelsResponse([ModelInfo])
    alt status == Absent
        C->>S: DownloadModelRequest(name)
        S-->>C: DownloadProgress × N
        S-->>C: DownloadComplete(success=true)
    end
    C->>S: LoadModelRequest(name)
    S-->>C: LoadModelResponse
    C->>S: CreateAgentRequest(default_model_name=name, ...)
    S-->>C: CreateAgentResponse(agent_id)
    note over C,S: Agent runs turns...
    C->>S: UnloadModelRequest(name)
    S-->>C: Ack

Key rules:

You do not have to call LoadModelRequest explicitly. An agent whose graph references a not-yet-loaded model triggers on-demand load at CreateAgent time.
LoadModelRequest pins the model. The server keeps the model resident until UnloadModelRequest, regardless of agent count.
UnloadModelRequest on a shared model is polite. If any agent still uses the model, it stays resident; it is freed when the last user goes away.

See How to pin and unpin models for the end-to-end walkthrough.

Model status¶

The ModelInfo.status field in ListModelsResponse reports where a model is right now:

Value	Int	Meaning
`Absent`	0	Known in catalog but not on disk and not downloading.
`Local`	1	User-provided path resolved (`local_path`); on disk; can be loaded.
`Downloading`	2	Transfer in progress. `DownloadProgress` frames are flowing.
`Loaded`	3	Currently resident in memory.
`Downloaded`	4	On disk from a HuggingFace download; can be loaded.

Ordinals match the ModelStatus enum on the wire protocol and ETryllModelStatus in the Unreal client.

Retention modes¶

Tryll decides what to keep in memory based on a simple rule:

Retention	How it is triggered	When the model is freed
Pinned	`LoadModelRequest`	On `UnloadModelRequest` (delayed until no agent uses it).
OnDemand	Implicit: an agent's graph references the model, no prior `LoadModelRequest`	When the last agent that uses it is destroyed.

Use Pinned for the single hot model on the machine — the latency cost of the first load is paid once at startup. Use OnDemand for secondary models where memory matters more than first-turn latency.

These labels match the client-facing names used in the glossary (pinned retention, on-demand retention).

See Lifetime and Ownership → Models for the reference-counting story that connects Pinned / OnDemand to what nodes inside an agent actually hold.

Download record — `downloads.json`¶

The server maintains a downloads.json file inside models_download_dir — its download ledger — to track completed HuggingFace downloads. This file is server-managed; do not edit it by hand. Its contents feed ModelInfo.status == Downloaded at startup so that a previously downloaded model can be loaded without re-downloading.

Structure (for reference only):

{
    "entries": [
        {
            "name":        "Llama 3.2 3B Instruct (Q4_K_M)",
            "engine":      1,
            "folder":      "bartowski--Llama-3.2-3B-Instruct-GGUF",
            "files":       ["Llama-3.2-3B-Instruct-Q4_K_M.gguf"],
            "total_bytes": 2019377152
        }
    ]
}

engine is the InferenceEngine enum ordinal on the wire protocol (1 = LlamaCpp).

Errors¶

Code	Cause
`6001`	Download failed (HTTP, interrupted, checksum).
`6002`	No variant for the active engine in the catalog.
`6003`	Disk full in the download directory.
`6004`	A download for this model is already active.

Client bindings¶

C++: Tryll::TryllClient::ListModels / DownloadModel / LoadModel / UnloadModel — TryllClient.h
Python: tryll_client.TryllClient.list_models / download_model / load_model / unload_model — client.py
Unreal: UTryllSubsystem::RequestListModels / RequestDownloadModel / RequestLoadModel / RequestUnloadModel — TryllSubsystem.h

Server Configuration — server-wide settings that govern model storage.
Concept: Models and inference engines
How to use your own local model
How to pin and unpin models
Agent Parameters — how agents reference models.
Glossary