Skip to content

Use Voice Input

Voice Input is the Tryll subsystem that streams raw PCM audio to the server, runs it through a speech-to-text (STT) model, and delivers transcripts back to your application. This page covers the full lifecycle: creating a VoiceInput handle, starting and finishing utterances, streaming audio, and cleaning up. It ends with links to hotwords biasing and model selection.

Prerequisites

  • A connected session configured with an STT engine:
// C++ — enable Sherpa-ONNX for STT
client.ConfigureSession(Tryll::InferenceEngine::Mock, /* LLM engine */
                        Tryll::InferenceEngine::SherpaOnnx);
# Python
client.configure_session(
    engine=InferenceEngine.LlamaCpp,
    stt_engine=InferenceEngine.SherpaOnnx,
)
// Unity (C#)
await client.ConfigureSessionAsync(new SessionConfig
{
    Engine    = TryllInferenceEngine.LlamaCpp,
    SttEngine = TryllInferenceEngine.SherpaOnnx,
});
// Unreal (C++)
FTryllSessionConfig Cfg;
Cfg.Engine    = ETryllInferenceEngine::LlamaCpp;
Cfg.SttEngine = ETryllInferenceEngine::SherpaOnnx;
Subsystem->ConfigureSession(Cfg, OnDone);

1. Create a VoiceInput handle

A VoiceInput is a stateful server-side object that holds the loaded STT model and the current utterance state. Create one per microphone source; most games need only one.

C++

#include <tryll/VoiceInput.h>

Tryll::VoiceInputConfig cfg;
cfg.modelName        = "Parakeet TDT 0.6B v2 (int8)";
cfg.inputFormat      = { .sampleRate = 48000, .channels = 1, .bitsPerSample = 16 };
cfg.vadThreshold     = 0.5f;   // Silero VAD speech-probability threshold
cfg.vadMinSilenceMs  = 500;    // silence that ends a segment (ms)
cfg.vadSpeechPadMs   = 250;    // padding around detected speech (ms)

auto vi = client.CreateVoiceInput(cfg);

Python

voice_input_id = client.create_voice_input(
    model_name="Parakeet TDT 0.6B v2 (int8)",
    sample_rate=48000, channels=1, bits_per_sample=16,
)

Unity (C#)

var cfg = new VoiceInputConfig
{
    ModelName       = "Parakeet TDT 0.6B v2 (int8)",
    InputFormat     = new AudioFormat { SampleRate = 48000, Channels = 1, BitsPerSample = 16 },
    VadThreshold    = 0.5f,
    VadMinSilenceMs = 500,
    VadSpeechPadMs  = 250,
};
var (voice, err) = await client.CreateVoiceInputAsync(cfg);
if (!err.IsOk) { /* handle */ }

Unreal (C++)

FTryllVoiceInputConfig Config;
Config.ModelName              = TEXT("Parakeet TDT 0.6B v2 (int8)");
Config.InputFormat.SampleRate = 48000;

Subsystem->CreateVoiceInput(Config,
    [](TSharedPtr<FTryllVoiceInput> Voice, FTryllError Err)
    {
        if (!Err.IsOk()) { /* handle */ return; }
        // Store Voice — it is now ready for BeginUtterance.
    });

The server loads the model (or reuses a cached copy) and returns a voice_input_id. This id is embedded in the handle objects in C++, Unity, and Unreal; it is returned directly in Python.

2. Set up transcript callbacks

Register a callback before starting utterances so you receive transcripts as they arrive. Tryll delivers four update kinds:

Kind When
SpeechStart VAD detects rising edge; no text yet
Partial In-progress hypothesis from online (streaming) STT
SegmentFinal Engine committed a chunk; utterance still open
UtteranceFinal Last update for this Begin/End cycle

C++

vi->SetOnTranscriptUpdate([](const Tryll::TranscriptUpdate& u) {
    if (u.kind == Tryll::TranscriptUpdateKind::UtteranceFinal)
        ProcessFinalTranscript(u.text);
});

Unity (C#)

voice.OnTranscriptUpdate += update =>
{
    if (update.Kind == TranscriptUpdateKind.UtteranceFinal)
        ProcessFinalTranscript(update.Text);
};

Unreal (C++)

Voice->OnTranscriptUpdate.AddLambda([](const FTryllTranscriptUpdate& U)
{
    if (U.Kind == ETryllTranscriptUpdateKind::UtteranceFinal)
        ProcessFinalTranscript(U.Text);
});

3. Begin an utterance

BeginUtterance opens a recording window and arms the VAD.

C++

Tryll::UtteranceOptions opts;
opts.autoSendAgentId    = myAgent->Id();  // 0 for transcribe-only
opts.autoFinishOnSilence = true;
opts.maxUtteranceMs     = 30000;
vi->BeginUtterance(opts);

Python — Python currently requires constructing the request manually via the wire codec (encode_begin_utterance_request in tryll_client.codec).

Unity (C#)

var opts = new UtteranceOptions
{
    AutoSendAgentId    = agentId,   // 0 = transcribe-only
    AutoFinishOnSilence = true,
    MaxUtteranceMs     = 30000,
};
voice.BeginUtterance(opts);

Unreal (C++)

FTryllUtteranceOptions Opts;
Opts.AutoSendAgentId     = AgentId;  // 0 = transcribe-only
Opts.bAutoFinishOnSilence = true;
Opts.MaxUtteranceMs      = 30000;
Voice->BeginUtterance(Opts);

When autoSendAgentId is non-zero the server automatically forwards the final transcript as a SendMessageRequest to that agent when the utterance closes — no extra round-trip needed.

4. Stream audio

Send PCM chunks as fast as they arrive. The buffer is fire-and-forget: no acknowledgement, no backpressure from the server.

C++

// Called from your audio capture callback at ~20 ms intervals.
vi->SendAudioBuffer(pcmChunk.data(), pcmChunk.size());

Unity (C#)

// Called from Unity's OnAudioFilterRead or a microphone polling coroutine.
voice.SendAudioBuffer(pcmBytes);

Unreal (C++)

// Called from the audio capture thread.
Voice->SendAudioBuffer(PcmChunk);

Audio must be raw signed-16-bit PCM at the sample rate declared in inputFormat. The server resamples to the model's expected rate as needed. No encoding (opus, mp3, etc.) is supported on the voice path.

5. End or cancel an utterance

If autoFinishOnSilence is true the server closes the segment automatically. You can also close it explicitly:

vi->EndUtterance();   // C++ — commit; produces UtteranceFinal
vi->CancelUtterance(); // C++ — discard; no UtteranceFinal fired
voice.EndUtterance();    // Unity (C#)
voice.CancelUtterance(); // Unity (C#)
Voice->EndUtterance();    // Unreal (C++)
Voice->CancelUtterance(); // Unreal (C++)

6. Input modes

Push-to-talk

Call BeginUtterance when the player presses a button and EndUtterance when they release it. Disable autoFinishOnSilence so VAD does not cut short a deliberate pause mid-sentence.

Hands-free (VAD-driven)

Keep autoFinishOnSilence = true and leave the utterance open. The server fires UtteranceFinal after each detected pause and automatically reopens for the next one.

7. Destroy the handle

Always destroy the VoiceInput when you are done with it. This frees the model slot on the server.

vi->Destroy();  // C++
voice.Dispose();  // Unity (C#)
Subsystem->RequestDestroyVoiceInput(Voice);  // Unreal (C++)

Next steps

  • CreateVoiceInputRequest — wire message sent to start a VoiceInput session.
  • BeginUtteranceRequest — wire message sent to open an audio capture window.
  • WireTranscriptUpdate — server push carrying incremental and final transcript text.