Use Voice Input¶

Voice Input is the Tryll subsystem that streams raw PCM audio to the server, runs it through a speech-to-text (STT) model, and delivers transcripts back to your application. This page covers the full lifecycle: creating a VoiceInput handle, starting and finishing utterances, streaming audio, and cleaning up. It ends with links to hotwords biasing and model selection.

Prerequisites¶

A connected session configured with an STT engine:

// C++ — enable Sherpa-ONNX for STT
client.ConfigureSession(Tryll::InferenceEngine::Mock, /* LLM engine */
                        Tryll::InferenceEngine::SherpaOnnx);

# Python
client.configure_session(
    engine=InferenceEngine.LlamaCpp,
    stt_engine=InferenceEngine.SherpaOnnx,
)

// Unity (C#)
await client.ConfigureSessionAsync(new SessionConfig
{
    Engine    = TryllInferenceEngine.LlamaCpp,
    SttEngine = TryllInferenceEngine.SherpaOnnx,
});

// Unreal (C++)
FTryllSessionConfig Cfg;
Cfg.Engine    = ETryllInferenceEngine::LlamaCpp;
Cfg.SttEngine = ETryllInferenceEngine::SherpaOnnx;
Subsystem->ConfigureSession(Cfg, OnDone);

1. Create a VoiceInput handle¶

A VoiceInput is a stateful server-side object that holds the loaded STT model and the current utterance state. Create one per microphone source; most games need only one.

C++

#include <tryll/VoiceInput.h>

Tryll::VoiceInputConfig cfg;
cfg.modelName        = "Parakeet TDT 0.6B v2 (int8)";
cfg.inputFormat      = { .sampleRate = 48000, .channels = 1, .bitsPerSample = 16 };
cfg.vadThreshold     = 0.5f;   // Silero VAD speech-probability threshold
cfg.vadMinSilenceMs  = 500;    // silence that ends a segment (ms)
cfg.vadSpeechPadMs   = 250;    // padding around detected speech (ms)

auto vi = client.CreateVoiceInput(cfg);

Python

voice_input_id = client.create_voice_input(
    model_name="Parakeet TDT 0.6B v2 (int8)",
    sample_rate=48000, channels=1, bits_per_sample=16,
)

Unity (C#)

var cfg = new VoiceInputConfig
{
    ModelName       = "Parakeet TDT 0.6B v2 (int8)",
    InputFormat     = new AudioFormat { SampleRate = 48000, Channels = 1, BitsPerSample = 16 },
    VadThreshold    = 0.5f,
    VadMinSilenceMs = 500,
    VadSpeechPadMs  = 250,
};
var (voice, err) = await client.CreateVoiceInputAsync(cfg);
if (!err.IsOk) { /* handle */ }

Unreal (C++)

FTryllVoiceInputConfig Config;
Config.ModelName              = TEXT("Parakeet TDT 0.6B v2 (int8)");
Config.InputFormat.SampleRate = 48000;

Subsystem->CreateVoiceInput(Config,
    [](TSharedPtr<FTryllVoiceInput> Voice, FTryllError Err)
    {
        if (!Err.IsOk()) { /* handle */ return; }
        // Store Voice — it is now ready for BeginUtterance.
    });

The server loads the model (or reuses a cached copy) and returns a voice_input_id. This id is embedded in the handle objects in C++, Unity, and Unreal; it is returned directly in Python.

2. Set up transcript callbacks¶

Register a callback before starting utterances so you receive transcripts as they arrive. Tryll delivers four update kinds:

Kind	When
`SpeechStart`	VAD detects rising edge; no text yet
`Partial`	In-progress hypothesis from online (streaming) STT
`SegmentFinal`	Engine committed a chunk; utterance still open
`UtteranceFinal`	Last update for this Begin/End cycle

C++

vi->SetOnTranscriptUpdate([](const Tryll::TranscriptUpdate& u) {
    if (u.kind == Tryll::TranscriptUpdateKind::UtteranceFinal)
        ProcessFinalTranscript(u.text);
});

Unity (C#)

voice.OnTranscriptUpdate += update =>
{
    if (update.Kind == TranscriptUpdateKind.UtteranceFinal)
        ProcessFinalTranscript(update.Text);
};

Unreal (C++)

Voice->OnTranscriptUpdate.AddLambda([](const FTryllTranscriptUpdate& U)
{
    if (U.Kind == ETryllTranscriptUpdateKind::UtteranceFinal)
        ProcessFinalTranscript(U.Text);
});

3. Begin an utterance¶

BeginUtterance opens a recording window and arms the VAD.

C++

Tryll::UtteranceOptions opts;
opts.autoSendAgentId    = myAgent->Id();  // 0 for transcribe-only
opts.autoFinishOnSilence = true;
opts.maxUtteranceMs     = 30000;
vi->BeginUtterance(opts);

Python — Python currently requires constructing the request manually via the wire codec (encode_begin_utterance_request in tryll_client.codec).

Unity (C#)

var opts = new UtteranceOptions
{
    AutoSendAgentId    = agentId,   // 0 = transcribe-only
    AutoFinishOnSilence = true,
    MaxUtteranceMs     = 30000,
};
voice.BeginUtterance(opts);

Unreal (C++)

FTryllUtteranceOptions Opts;
Opts.AutoSendAgentId     = AgentId;  // 0 = transcribe-only
Opts.bAutoFinishOnSilence = true;
Opts.MaxUtteranceMs      = 30000;
Voice->BeginUtterance(Opts);

When autoSendAgentId is non-zero the server automatically forwards the final transcript as a SendMessageRequest to that agent when the utterance closes — no extra round-trip needed.

4. Stream audio¶

Send PCM chunks as fast as they arrive. The buffer is fire-and-forget: no acknowledgement, no backpressure from the server.

C++

// Called from your audio capture callback at ~20 ms intervals.
vi->SendAudioBuffer(pcmChunk.data(), pcmChunk.size());

Unity (C#)

// Called from Unity's OnAudioFilterRead or a microphone polling coroutine.
voice.SendAudioBuffer(pcmBytes);

Unreal (C++)

// Called from the audio capture thread.
Voice->SendAudioBuffer(PcmChunk);

Audio must be raw signed-16-bit PCM at the sample rate declared in inputFormat. The server resamples to the model's expected rate as needed. No encoding (opus, mp3, etc.) is supported on the voice path.

5. End or cancel an utterance¶

If autoFinishOnSilence is true the server closes the segment automatically. You can also close it explicitly:

vi->EndUtterance();   // C++ — commit; produces UtteranceFinal
vi->CancelUtterance(); // C++ — discard; no UtteranceFinal fired

voice.EndUtterance();    // Unity (C#)
voice.CancelUtterance(); // Unity (C#)

Voice->EndUtterance();    // Unreal (C++)
Voice->CancelUtterance(); // Unreal (C++)

6. Input modes¶

Push-to-talk¶

Call BeginUtterance when the player presses a button and EndUtterance when they release it. Disable autoFinishOnSilence so VAD does not cut short a deliberate pause mid-sentence.

Hands-free (VAD-driven)¶

Keep autoFinishOnSilence = true and leave the utterance open. The server fires UtteranceFinal after each detected pause and automatically reopens for the next one.

7. Destroy the handle¶

Always destroy the VoiceInput when you are done with it. This frees the model slot on the server.

vi->Destroy();  // C++

voice.Dispose();  // Unity (C#)

Subsystem->RequestDestroyVoiceInput(Voice);  // Unreal (C++)

Next steps¶

Bias Voice Input with Hotwords — nudge the STT decoder toward game-specific proper nouns and spell names without retraining.
Pin and Unpin Models — keep the STT model warm between sessions to avoid reload latency.
Enable Auto Model Downloading — let the server fetch missing STT models automatically during development.

CreateVoiceInputRequest — wire message sent to start a VoiceInput session.
BeginUtteranceRequest — wire message sent to open an audio capture window.
WireTranscriptUpdate — server push carrying incremental and final transcript text.

Use Voice Input¶

Prerequisites¶

1. Create a VoiceInput handle¶

2. Set up transcript callbacks¶

3. Begin an utterance¶

4. Stream audio¶

5. End or cancel an utterance¶

6. Input modes¶

Push-to-talk¶

Hands-free (VAD-driven)¶

7. Destroy the handle¶

Next steps¶

Related reference¶