Tryll¶

On-device small-language-model inference for games and tools. Tryll is a standalone C++ server that runs GGUF models locally and exposes them to your application through first-party clients for C++, Python, and Unreal Engine 5. Build chat, retrieval-augmented generation, and tool-calling agents — all running on the user's machine, all orchestrated by workflow graphs you compose yourself.

Start here¶

Python

Pip-install the client and see streaming tokens in 20 lines.

First inference in Python →
C++

Clone, configure, build, and run a tiny chat demo.

First inference in C++ →
Unreal Engine 5

Drop a UTryllAgentComponent on an actor and print streaming text to the output log.

First inference in Unreal →

What Tryll gives you¶

A local model host. One server process, many clients. The server loads models once and shares them across every session that needs them. CPU fallback, Vulkan / CUDA / ROCm optional. Model management →
Workflow graphs as first-class objects. Your agents are not prompt templates — they are typed graphs of nodes (Generate, Retrieve, ToolCall, CannedResponse, HumanMessageGuardrail) wired by exit routes. Workflows, Graphs, and Nodes →
Streaming, RAG, and tool calls built in. Token-by-token streaming, HNSW vector search via USearch, and detection-only tool calling across four model-family format grammars.
Three first-party clients. Thin, idiomatic wrappers over the same FlatBuffers wire protocol. C++ · Python · Unreal

Explore the docs¶

Getting Started — tutorials for your first inference on each platform.
How-to Guides — task recipes: RAG assistants, tool calls, streaming to UI, local model setup, guardrails.
Concepts — the mental model: architecture, agents, workflows, RAG, projection.
Reference — every field, every enum, every error code.

Feature highlights¶

Streaming first. Every Generate node streams tokens to the client as they arrive. Stream answers to a UI →
Retrieval-augmented generation. Build an embedded vector index from your own data and ground the model in it. Create a simple RAG assistant →
Tool calling across model families. ChatML, Llama-3, Mistral, and Generic formats, plus client-side notifications. Define and handle tool calls →
Guardrails and canned responses. Short-circuit jailbreaks before the model runs. Use canned responses and guardrails →
Explicit model pool. Pin models to keep them hot or let on-demand eviction recycle memory. Pin and unpin models →

Status and versioning¶

Tryll is in pre-release — expect breaking changes until the 1.0 API freeze. Release notes are not published yet. Report problems via the repository issue tracker.