Architecture at a Glance¶

Tryll is a C++ TCP server that runs small language models on the user's own machine and exposes them to games and apps through a binary protocol. This page is the ten-minute mental model: what runs where, what owns what, and what the hot path looks like when a user hits "send".

Every other concept page zooms into one layer of this picture. Read this one first.

The five layers¶

flowchart TB
    subgraph Client["Client process (your game / app / script)"]
        CL["tryll-client<br>(C++ / Python / Unreal)"]
    end
    subgraph Server["tryll-server process"]
        S[Session<br>one per TCP connection]
        A[Agent<br>one per CreateAgent]
        G[Graph<br>nodes + routes]
        M[Model pool<br>shared models &amp; KV caches]
    end
    subgraph Hardware["Local hardware"]
        HW[CPU / GPU<br>llama.cpp · ONNX · WindowsML]
    end
    CL -- "TCP :9100<br>FlatBuffers framing" --> S
    S --> A
    A --> G
    G --> M
    M --> HW

The server is a single C++ process. One instance, many clients. It runs small language models on-device through llama.cpp and uses USearch for vector search; everything else — the workflow engine, the wire protocol, logging — is built inside the server.
A session is one TCP connection. It owns the client's agents, string storages, and embedded string storages. When the socket closes, everything in the session goes with it.
An agent is one conversation. It carries its own workflow graph, its own dialog history, and its own default model. A single session can own several agents in parallel.
The graph is a set of nodes wired by exit routes. The agent walks it once per user turn — see Workflows, Graphs, and Nodes.
Models are loaded once per process by the server and shared across every agent that needs them. The KV cache, by contrast, is per-node, because each node sees a different slice of the dialog.

One turn end to end¶

The hot path when a user sends a message:

sequenceDiagram
    participant C as Client
    participant S as Server
    participant A as Agent
    participant N as Node (e.g. Generate)
    participant M as Model

    C->>S: SendMessageRequest(text)
    S->>A: dispatch to the agent
    A->>N: execute current node
    N->>M: generate / stream tokens
    M-->>N: tokens
    N-->>S: AnswerText(delta, is_final=false)
    S-->>C: AnswerText frame
    Note over N,M: repeats until stop
    N-->>A: exit route (default / found / …)
    A->>A: follow route to next node or END
    A-->>S: TurnComplete(status)
    S-->>C: TurnComplete frame

A turn is always atomic at the agent level. A second SendMessage for the same agent while a turn is live is rejected with 3001; in-flight work keeps running.

Concurrency, in one line¶

Turns on one session run serially — if agent A is mid-turn, agent B on the same session waits. Inference is serialised across the whole server, because there is only one GPU / CPU to talk to. Non-inference work (routing, retrieval, pattern checks) can happen in parallel.

The wire protocol¶

Clients speak to the server over TCP (default port 9100) using length-prefixed FlatBuffers messages. The shape every message takes — and the end-to-end session flow — is documented in Wire Protocol.

The session lifecycle, in one line:

SessionReady → ConfigureSession → (CreateStringStorage…)* → CreateAgent → SendMessage …

The first three client libraries (C++, Python, Unreal) are thin wrappers over this protocol. Nothing else special happens on the client side — the intelligence lives in the server.

Why it is shaped this way¶

Three product constraints drove the split:

On-device. A game or tool that wants a local model cannot afford a Python runtime or a separate inference microservice per platform. A single C++ server binary plus tiny clients keeps every integration small.
Workflow over prompt. Turning a raw model into useful behaviour (RAG, guardrails, tool calls, canned refusals, …) needs composition. Tryll makes the composition first-class by giving you a graph instead of a prompt template.
Shared, not duplicated, compute. Models are big, KV caches are bigger. The server loads each model once and lets every agent share it; evict rules are explicit (model management).

Edges and pitfalls¶

A session is owned by its socket. If the TCP connection drops, the server destroys every agent and storage created through it. Client libraries surface this as a connection-changed event — reconnect and re-create what you need.
A KV cache is per node, not per agent. Two nodes in the same graph that use the same model still have separate caches; that is intentional, because they see different prompt prefixes. See Projection and Token Budgets.
The server never runs user-supplied code. Tool calls are detection-only — see Tool Calling.