Skip to content

Model Management

The Tryll server treats models as first-class, named resources. Every model it can load is declared in a single catalog file, models.json, that the server reads at startup. Clients discover, download, load, and unload models over the wire protocol; the server handles all disk I/O, HuggingFace downloads, and backend-specific loading.

This page is the reference for:

  • The models.json schema that tells the server what exists and how to find it.
  • The model lifecycle wire-protocol flow: ListModelsDownloadModelLoadModel / UnloadModel.
  • The status enum reported per-model.
  • The retention modes that control memory occupancy.

Server-wide model settings (download directory, catalog path) live in server-config.json.

Development shortcut

During prototyping you can skip the explicit ListModelsDownloadModel steps by enabling allow_auto_model_downloading on the session — CreateAgent then downloads missing models automatically. See Enable Auto Model Downloading.


models.json

Default location: data/models.json, configurable via models_catalog_path in server-config.json. The server parses this file once at startup.

Top-level structure

<!-- sample only; not a committed file -->
{
    "models": [
        { ... },
        { ... }
    ]
}

Model descriptor fields

Field Type Required Description
name string Yes Human-readable model name. This is the identifier used everywhere on the wire: CreateAgentRequest.default_model_name, DownloadModelRequest.model_name, LoadModelRequest.model_name. Must be unique within the catalog.
default_sampling object No Default sampling params for this model. Node-level NodeParam overrides take precedence. Absent fields fall back to built-in defaults listed below.
variants array Yes One entry per supported inference engine.

default_sampling fields

Field Type Default Description
temperature float 0.7 Sampling temperature; higher = more random.
top_p float 1.0 Nucleus sampling threshold. 1.0 disables.
top_k int 0 Top-K sampling cut-off. 0 disables.
min_p float 0.05 Min-P filter. 0.0 disables.
repeat_penalty float 1.0 Penalty for repeating tokens. 1.0 disables.
presence_penalty float 0.0 OpenAI-style presence penalty. 0.0 disables.
frequency_penalty float 0.0 OpenAI-style frequency penalty. 0.0 disables.
max_tokens int 2048 Maximum generated tokens per turn.
seed uint32 0 RNG seed. 0 selects a random seed per turn.

variants entry fields

Field Type Required Description
engine string Yes Which inference engine this variant targets. Currently supported: "llama-cpp". Must match a registered engine; variants for other engines are silently ignored when the session is configured for a different engine.
local_path string No Absolute or server-relative path to a directory containing the model file(s) on disk. Combined with files[0] to resolve the full path. Takes priority over path and downloads.json. Use this for user-supplied models.
path string No Legacy relative path under models_download_dir. Fallback when neither local_path nor a downloads.json entry resolves.
huggingface_repo string No HuggingFace repository slug ("owner/repo"). Required for downloads. Empty means the model cannot be downloaded over the wire — it must already be on disk via local_path.
files array of string No Filenames for this variant. For HuggingFace, these are downloaded from huggingface_repo; for local, resolved relative to local_path. Empty disables both download and resolution.
context_size int No Override the engine's default context window (in tokens). 0 or absent uses the engine default — currently 8192 for llama.cpp.
kv_cache_type string No llama.cpp KV-cache dtype: "f16", "q8_0", or "q4_0". Default "q8_0" (≈half the KV VRAM vs F16). Ignored by other engines.

Minimal example — user-provided model

<!-- sample only; not a committed file -->
{
    "models": [
        {
            "name": "My Local Model",
            "variants": [
                {
                    "engine":     "llama-cpp",
                    "local_path": "C:/models",
                    "files":      ["my-model.gguf"]
                }
            ]
        }
    ]
}

Downloadable example — HuggingFace + tuned sampling

<!-- sample only; not a committed file -->
{
    "models": [
        {
            "name": "Llama 3.2 3B Instruct (Q4_K_M)",
            "default_sampling": {
                "temperature":    0.6,
                "top_p":          0.9,
                "top_k":          50,
                "min_p":          0.05,
                "repeat_penalty": 1.2
            },
            "variants": [
                {
                    "engine":           "llama-cpp",
                    "huggingface_repo": "bartowski/Llama-3.2-3B-Instruct-GGUF",
                    "files":            ["Llama-3.2-3B-Instruct-Q4_K_M.gguf"]
                }
            ]
        }
    ]
}

Embedding models

Models with "purpose": "embedding" are declared the same way and referenced by name from embedded string storages and from Retrieve node params. Their sampling fields are ignored; only the file resolution and engine fields matter.


Lifecycle

The wire-protocol flow for putting a model under a running agent:

sequenceDiagram
    participant C as Client
    participant S as Server
    C->>S: ListModelsRequest
    S-->>C: ListModelsResponse([ModelInfo])
    alt status == Absent
        C->>S: DownloadModelRequest(name)
        S-->>C: DownloadProgress × N
        S-->>C: DownloadComplete(success=true)
    end
    C->>S: LoadModelRequest(name)
    S-->>C: LoadModelResponse
    C->>S: CreateAgentRequest(default_model_name=name, ...)
    S-->>C: CreateAgentResponse(agent_id)
    note over C,S: Agent runs turns...
    C->>S: UnloadModelRequest(name)
    S-->>C: Ack

Key rules:

  • You do not have to call LoadModelRequest explicitly. An agent whose graph references a not-yet-loaded model triggers on-demand load at CreateAgent time.
  • LoadModelRequest pins the model. The server keeps the model resident until UnloadModelRequest, regardless of agent count.
  • UnloadModelRequest on a shared model is polite. If any agent still uses the model, it stays resident; it is freed when the last user goes away.

See How to pin and unpin models for the end-to-end walkthrough.


Model status

The ModelInfo.status field in ListModelsResponse reports where a model is right now:

Value Int Meaning
Absent 0 Known in catalog but not on disk and not downloading.
Local 1 User-provided path resolved (local_path); on disk; can be loaded.
Downloading 2 Transfer in progress. DownloadProgress frames are flowing.
Loaded 3 Currently resident in memory for its model kind (language, STT, TTS, or embedding).
Downloaded 4 On disk from a HuggingFace download; can be loaded.

Ordinals match the ModelStatus enum on the wire protocol and ETryllModelStatus in the Unreal client.

Loaded is per model kind

Loaded means the model is resident in the server's in-memory cache for its specific kind. A language model and an STT model are tracked independently — a voice-input session that pins an STT model will show it as Loaded regardless of which language engine was selected. ConfigureSession carries separate engine fields for language, STT, TTS, and embedding; ListModels uses the matching engine per entry to determine status.


Retention modes

Tryll decides what to keep in memory based on a simple rule:

Retention How it is triggered When the model is freed
Pinned LoadModelRequest On UnloadModelRequest (delayed until no agent uses it).
OnDemand Implicit: an agent's graph references the model, no prior LoadModelRequest When the last agent that uses it is destroyed.

Use Pinned for the single hot model on the machine — the latency cost of the first load is paid once at startup. Use OnDemand for secondary models where memory matters more than first-turn latency.

These labels match the client-facing names used in the glossary (pinned retention, on-demand retention).

See Lifetime and Ownership → Models for the reference-counting story that connects Pinned / OnDemand to what nodes inside an agent actually hold.


Download record — downloads.json

The server maintains a downloads.json file inside models_download_dir — its download ledger — to track completed HuggingFace downloads. This file is server-managed; do not edit it by hand. Its contents feed ModelInfo.status == Downloaded at startup so that a previously downloaded model can be loaded without re-downloading.

Structure (for reference only):

{
    "entries": [
        {
            "name":        "Llama 3.2 3B Instruct (Q4_K_M)",
            "engine":      1,
            "folder":      "bartowski--Llama-3.2-3B-Instruct-GGUF",
            "files":       ["Llama-3.2-3B-Instruct-Q4_K_M.gguf"],
            "total_bytes": 2019377152
        }
    ]
}

engine is the InferenceEngine enum ordinal on the wire protocol (1 = LlamaCpp).


Errors

Code Cause
6001 Download failed (HTTP, interrupted, checksum).
6002 No variant for the active engine in the catalog.
6003 Disk full in the download directory.
6004 A download for this model is already active.

Client bindings

  • C++: Tryll::TryllClient::ListModels / DownloadModel / LoadModel / UnloadModelTryllClient.h
  • Python: tryll_client.TryllClient.list_models / download_model / load_model / unload_modelclient.py
  • Unreal: UTryllSubsystem::RequestListModels / RequestDownloadModel / RequestLoadModel / RequestUnloadModelTryllSubsystem.h