Skip to content

Model Management

The Tryll server treats models as first-class, named resources. Every model it can load is declared in a single catalog file, models.json, that the server reads at startup. Clients discover, download, load, and unload models over the wire protocol; the server handles all disk I/O, HuggingFace downloads, and backend-specific loading.

This page is the reference for:

  • The models.json schema that tells the server what exists and how to find it.
  • The model lifecycle wire-protocol flow: ListModelsDownloadModelLoadModel / UnloadModel.
  • The status enum reported per-model.
  • The retention modes that control memory occupancy.

Server-wide model settings (download directory, catalog path) live in server-config.json.

Development shortcut

During prototyping you can skip the explicit ListModelsDownloadModel steps by enabling allow_auto_model_downloading on the session — CreateAgent then downloads missing models automatically. See Enable Auto Model Downloading.


models.json

Default location: data/models.json, configurable via models_catalog_path in server-config.json. The server parses this file once at startup.

Top-level structure

<!-- sample only; not a committed file -->
{
    "models": [
        { ... },
        { ... }
    ]
}

Model descriptor fields

Field Type Required Description
name string Yes Human-readable model name. This is the identifier used everywhere on the wire: CreateAgentRequest.default_model_name, DownloadModelRequest.model_name, LoadModelRequest.model_name. Must be unique within the catalog.
default_sampling object No Default sampling params for this model. Node-level NodeParam overrides take precedence. Absent fields fall back to built-in defaults listed below.
variants array Yes One entry per supported inference engine.

default_sampling fields

Field Type Default Description
temperature float 0.7 Sampling temperature; higher = more random.
top_p float 1.0 Nucleus sampling threshold. 1.0 disables.
top_k int 0 Top-K sampling cut-off. 0 disables.
min_p float 0.05 Min-P filter. 0.0 disables.
repeat_penalty float 1.0 Penalty for repeating tokens. 1.0 disables.
presence_penalty float 0.0 OpenAI-style presence penalty. 0.0 disables.
frequency_penalty float 0.0 OpenAI-style frequency penalty. 0.0 disables.
max_tokens int 2048 Maximum generated tokens per turn.
seed uint32 0 RNG seed. 0 selects a random seed per turn.

variants entry fields

Field Type Required Description
engine string Yes Which inference engine this variant targets. Currently supported: "llama-cpp". Must match a registered engine; variants for other engines are silently ignored when the session is configured for a different engine.
local_path string No Absolute or server-relative path to a directory containing the model file(s) on disk. Combined with files[0] to resolve the full path. Takes priority over path and downloads.json. Use this for user-supplied models.
path string No Legacy relative path under models_download_dir. Fallback when neither local_path nor a downloads.json entry resolves.
huggingface_repo string No HuggingFace repository slug ("owner/repo"). Required for downloads. Empty means the model cannot be downloaded over the wire — it must already be on disk via local_path.
files array of string No Filenames for this variant. For HuggingFace, these are downloaded from huggingface_repo; for local, resolved relative to local_path. Empty disables both download and resolution.
context_size int No Override the engine's default context window (in tokens). 0 or absent uses the engine default — currently 8192 for llama.cpp.
kv_cache_type string No llama.cpp KV-cache dtype: "f16", "q8_0", or "q4_0". Default "q8_0" (≈half the KV VRAM vs F16). Ignored by other engines.

Minimal example — user-provided model

<!-- sample only; not a committed file -->
{
    "models": [
        {
            "name": "My Local Model",
            "variants": [
                {
                    "engine":     "llama-cpp",
                    "local_path": "C:/models",
                    "files":      ["my-model.gguf"]
                }
            ]
        }
    ]
}

Downloadable example — HuggingFace + tuned sampling

<!-- sample only; not a committed file -->
{
    "models": [
        {
            "name": "Llama 3.2 3B Instruct (Q4_K_M)",
            "default_sampling": {
                "temperature":    0.6,
                "top_p":          0.9,
                "top_k":          50,
                "min_p":          0.05,
                "repeat_penalty": 1.2
            },
            "variants": [
                {
                    "engine":           "llama-cpp",
                    "huggingface_repo": "bartowski/Llama-3.2-3B-Instruct-GGUF",
                    "files":            ["Llama-3.2-3B-Instruct-Q4_K_M.gguf"]
                }
            ]
        }
    ]
}

Embedding models

Models with "purpose": "embedding" are declared the same way and referenced by name from embedded string storages and from Retrieve node params. Their sampling fields are ignored; only the file resolution and engine fields matter.


Lifecycle

The wire-protocol flow for putting a model under a running agent:

sequenceDiagram
    participant C as Client
    participant S as Server
    C->>S: ListModelsRequest
    S-->>C: ListModelsResponse([ModelInfo])
    alt status == Absent
        C->>S: DownloadModelRequest(name)
        S-->>C: DownloadProgress × N
        S-->>C: DownloadComplete(success=true)
    end
    C->>S: LoadModelRequest(name)
    S-->>C: LoadModelResponse
    C->>S: CreateAgentRequest(default_model_name=name, ...)
    S-->>C: CreateAgentResponse(agent_id)
    note over C,S: Agent runs turns...
    C->>S: UnloadModelRequest(name)
    S-->>C: Ack

Key rules:

  • You do not have to call LoadModelRequest explicitly. An agent whose graph references a not-yet-loaded model triggers on-demand load at CreateAgent time.
  • LoadModelRequest pins the model. The server keeps the model resident until UnloadModelRequest, regardless of agent count.
  • UnloadModelRequest on a shared model is polite. If any agent still uses the model, it stays resident; it is freed when the last user goes away.

See How to pin and unpin models for the end-to-end walkthrough.


Model status

The ModelInfo.status field in ListModelsResponse reports where a model is right now:

Value Int Meaning
Absent 0 Known in catalog but not on disk and not downloading.
Local 1 User-provided path resolved (local_path); on disk; can be loaded.
Downloading 2 Transfer in progress. DownloadProgress frames are flowing.
Loaded 3 Currently resident in memory.
Downloaded 4 On disk from a HuggingFace download; can be loaded.

Ordinals match the ModelStatus enum on the wire protocol and ETryllModelStatus in the Unreal client.


Retention modes

Tryll decides what to keep in memory based on a simple rule:

Retention How it is triggered When the model is freed
Pinned LoadModelRequest On UnloadModelRequest (delayed until no agent uses it).
OnDemand Implicit: an agent's graph references the model, no prior LoadModelRequest When the last agent that uses it is destroyed.

Use Pinned for the single hot model on the machine — the latency cost of the first load is paid once at startup. Use OnDemand for secondary models where memory matters more than first-turn latency.

These labels match the client-facing names used in the glossary (pinned retention, on-demand retention).

See Lifetime and Ownership → Models for the reference-counting story that connects Pinned / OnDemand to what nodes inside an agent actually hold.


Download record — downloads.json

The server maintains a downloads.json file inside models_download_dir — its download ledger — to track completed HuggingFace downloads. This file is server-managed; do not edit it by hand. Its contents feed ModelInfo.status == Downloaded at startup so that a previously downloaded model can be loaded without re-downloading.

Structure (for reference only):

{
    "entries": [
        {
            "name":        "Llama 3.2 3B Instruct (Q4_K_M)",
            "engine":      1,
            "folder":      "bartowski--Llama-3.2-3B-Instruct-GGUF",
            "files":       ["Llama-3.2-3B-Instruct-Q4_K_M.gguf"],
            "total_bytes": 2019377152
        }
    ]
}

engine is the InferenceEngine enum ordinal on the wire protocol (1 = LlamaCpp).


Errors

Code Cause
6001 Download failed (HTTP, interrupted, checksum).
6002 No variant for the active engine in the catalog.
6003 Disk full in the download directory.
6004 A download for this model is already active.

Client bindings

  • C++: Tryll::TryllClient::ListModels / DownloadModel / LoadModel / UnloadModelTryllClient.h
  • Python: tryll_client.TryllClient.list_models / download_model / load_model / unload_modelclient.py
  • Unreal: UTryllSubsystem::RequestListModels / RequestDownloadModel / RequestLoadModel / RequestUnloadModelTryllSubsystem.h