Architecture at a Glance¶
Tryll is a C++ TCP server that runs small language models on the user's own machine and exposes them to games and apps through a binary protocol. This page is the ten-minute mental model: what runs where, what owns what, and what the hot path looks like when a user hits "send".
Every other concept page zooms into one layer of this picture. Read this one first.
The five layers¶
flowchart TB
subgraph Client["Client process (your game / app / script)"]
CL["tryll-client<br>(C++ / Python / Unreal)"]
end
subgraph Server["tryll-server process"]
S[Session<br>one per TCP connection]
A[Agent<br>one per CreateAgent]
G[Graph<br>nodes + routes]
M[Model pool<br>shared models & KV caches]
end
subgraph Hardware["Local hardware"]
HW[CPU / GPU<br>llama.cpp · ONNX · WindowsML]
end
CL -- "TCP :9100<br>FlatBuffers framing" --> S
S --> A
A --> G
G --> M
M --> HW
- The server is a single C++ process. One instance, many clients. It runs small language models on-device through llama.cpp and uses USearch for vector search; everything else — the workflow engine, the wire protocol, logging — is built inside the server.
- A session is one TCP connection. It owns the client's agents, string storages, and embedded string storages. When the socket closes, everything in the session goes with it.
- An agent is one conversation. It carries its own workflow graph, its own dialog history, and its own default model. A single session can own several agents in parallel.
- The graph is a set of nodes wired by exit routes. The agent walks it once per user turn — see Workflows, Graphs, and Nodes.
- Models are loaded once per process by the server and shared across every agent that needs them. The KV cache, by contrast, is per-node, because each node sees a different slice of the dialog.
One turn end to end¶
The hot path when a user sends a message:
sequenceDiagram
participant C as Client
participant S as Server
participant A as Agent
participant N as Node (e.g. Generate)
participant M as Model
C->>S: SendMessageRequest(text)
S->>A: dispatch to the agent
A->>N: execute current node
N->>M: generate / stream tokens
M-->>N: tokens
N-->>S: AnswerText(delta, is_final=false)
S-->>C: AnswerText frame
Note over N,M: repeats until stop
N-->>A: exit route (default / found / …)
A->>A: follow route to next node or END
A-->>S: TurnComplete(status)
S-->>C: TurnComplete frame
A turn is always atomic at the agent level. A second SendMessage for
the same agent while a turn is live is rejected with
3001; in-flight work keeps
running.
Concurrency, in one line¶
Turns on one session run serially — if agent A is mid-turn, agent B on the same session waits. Inference is serialised across the whole server, because there is only one GPU / CPU to talk to. Non-inference work (routing, retrieval, pattern checks) can happen in parallel.
The wire protocol¶
Clients speak to the server over TCP (default port 9100) using length-prefixed FlatBuffers messages. The shape every message takes — and the end-to-end session flow — is documented in Wire Protocol.
The session lifecycle, in one line:
The first three client libraries (C++, Python, Unreal) are thin wrappers over this protocol. Nothing else special happens on the client side — the intelligence lives in the server.
Why it is shaped this way¶
Three product constraints drove the split:
- On-device. A game or tool that wants a local model cannot afford a Python runtime or a separate inference microservice per platform. A single C++ server binary plus tiny clients keeps every integration small.
- Workflow over prompt. Turning a raw model into useful behaviour (RAG, guardrails, tool calls, canned refusals, …) needs composition. Tryll makes the composition first-class by giving you a graph instead of a prompt template.
- Shared, not duplicated, compute. Models are big, KV caches are bigger. The server loads each model once and lets every agent share it; evict rules are explicit (model management).
Edges and pitfalls¶
- A session is owned by its socket. If the TCP connection drops, the server destroys every agent and storage created through it. Client libraries surface this as a connection-changed event — reconnect and re-create what you need.
- A KV cache is per node, not per agent. Two nodes in the same graph that use the same model still have separate caches; that is intentional, because they see different prompt prefixes. See Projection and Token Budgets.
- The server never runs user-supplied code. Tool calls are detection-only — see Tool Calling.