Pin and Unpin Models¶
Keep a model warm in memory or explicitly release it. Useful when you want the first turn to be fast, or when memory pressure forces you to pick which model lives.
Prerequisites
- A session connected and configured.
- A model listed in
models.jsonwith statusLocalorDownloaded— see Use Your Own Local Model.
When to pin¶
| Situation | Pin? |
|---|---|
| Single chat model used all session long | Yes — avoids the first-turn load delay. |
| Many agents with different models, short-lived | No — let OnDemand eviction recycle memory as agents close. |
| Embedding model for RAG | Yes — embeddings run every turn; you do not want it evicted. |
| Model used only once during startup (e.g. a warm-up probe) | No — let it fall out when you destroy the probing agent. |
The two retention policies behave like this:
Pinned— set byLoadModelRequest/ Blueprint Request Load Model. Only unloaded when you sendUnloadModelRequestand no active contexts still reference the model.OnDemand— the default. Set implicitly when you create an agent and the model is not already loaded. The server runsEvictUnusedOnDemandafter everyDestroyAgent, freeing any on-demand model whose last user just went away.
Steps — pin a model¶
Call Request Load Model on UTryllSubsystem with the model
name. Bind On Load Model Complete to wait for
bSuccess = true.
After a successful pin, the model stays in memory across agent create / destroy cycles until you explicitly unload it.
Steps — unpin a model¶
UnloadModelRequest is safe even if the model is currently in use
by an active agent — the server completes the current turn first and
drops the model only when the last context is gone. You get an
Ack as soon as the unload is scheduled.
Verify it worked¶
Check status with ListModels:
Status = Loaded→ model is in memory (pinned or on-demand with active users).Status = LocalorDownloaded→ model is on disk but not in memory.
Server-side log lines:
[info] LoadModelRequest "My Local Model" → Pinned
[info] UnloadModelRequest "My Local Model" → freed (contexts=0)
If you see freed (contexts=N) with N > 0, the unload was queued
behind active contexts; it will complete as soon as those agents
finish.
Common pitfalls¶
- Pinning and never unpinning. Pinned models survive session
tear-downs (they are global). A long-lived server can accumulate
pins if clients forget to unload. Pair every
LoadModelRequestwith a matchingUnloadModelRequeston shutdown. - Expecting pin to change load cost of subsequent sessions.
Correct — pinning is global. A second client connecting will find
the model already
Loaded. - Trying to pin an
Absentmodel. Pin requires the model to be at leastLocal. If it isAbsent, callDownloadModelfirst.