Use Voice Input¶
Voice Input is the Tryll subsystem that streams raw PCM audio to the
server, runs it through a speech-to-text (STT) model, and delivers
transcripts back to your application. This page covers the full
lifecycle: creating a VoiceInput handle, starting and finishing
utterances, streaming audio, and cleaning up. It ends with links to
hotwords biasing and model selection.
Prerequisites¶
- A connected session configured with an STT engine:
// C++ — enable Sherpa-ONNX for STT
client.ConfigureSession(Tryll::InferenceEngine::Mock, /* LLM engine */
Tryll::InferenceEngine::SherpaOnnx);
# Python
client.configure_session(
engine=InferenceEngine.LlamaCpp,
stt_engine=InferenceEngine.SherpaOnnx,
)
// Unity (C#)
await client.ConfigureSessionAsync(new SessionConfig
{
Engine = TryllInferenceEngine.LlamaCpp,
SttEngine = TryllInferenceEngine.SherpaOnnx,
});
// Unreal (C++)
FTryllSessionConfig Cfg;
Cfg.Engine = ETryllInferenceEngine::LlamaCpp;
Cfg.SttEngine = ETryllInferenceEngine::SherpaOnnx;
Subsystem->ConfigureSession(Cfg, OnDone);
1. Create a VoiceInput handle¶
A VoiceInput is a stateful server-side object that holds the loaded
STT model and the current utterance state. Create one per microphone
source; most games need only one.
C++
#include <tryll/VoiceInput.h>
Tryll::VoiceInputConfig cfg;
cfg.modelName = "Parakeet TDT 0.6B v2 (int8)";
cfg.inputFormat = { .sampleRate = 48000, .channels = 1, .bitsPerSample = 16 };
cfg.vadThreshold = 0.5f; // Silero VAD speech-probability threshold
cfg.vadMinSilenceMs = 500; // silence that ends a segment (ms)
cfg.vadSpeechPadMs = 250; // padding around detected speech (ms)
auto vi = client.CreateVoiceInput(cfg);
Python
voice_input_id = client.create_voice_input(
model_name="Parakeet TDT 0.6B v2 (int8)",
sample_rate=48000, channels=1, bits_per_sample=16,
)
Unity (C#)
var cfg = new VoiceInputConfig
{
ModelName = "Parakeet TDT 0.6B v2 (int8)",
InputFormat = new AudioFormat { SampleRate = 48000, Channels = 1, BitsPerSample = 16 },
VadThreshold = 0.5f,
VadMinSilenceMs = 500,
VadSpeechPadMs = 250,
};
var (voice, err) = await client.CreateVoiceInputAsync(cfg);
if (!err.IsOk) { /* handle */ }
Unreal (C++)
FTryllVoiceInputConfig Config;
Config.ModelName = TEXT("Parakeet TDT 0.6B v2 (int8)");
Config.InputFormat.SampleRate = 48000;
Subsystem->CreateVoiceInput(Config,
[](TSharedPtr<FTryllVoiceInput> Voice, FTryllError Err)
{
if (!Err.IsOk()) { /* handle */ return; }
// Store Voice — it is now ready for BeginUtterance.
});
The server loads the model (or reuses a cached copy) and returns a
voice_input_id. This id is embedded in the handle objects in C++,
Unity, and Unreal; it is returned directly in Python.
2. Set up transcript callbacks¶
Register a callback before starting utterances so you receive transcripts as they arrive. Tryll delivers four update kinds:
| Kind | When |
|---|---|
SpeechStart |
VAD detects rising edge; no text yet |
Partial |
In-progress hypothesis from online (streaming) STT |
SegmentFinal |
Engine committed a chunk; utterance still open |
UtteranceFinal |
Last update for this Begin/End cycle |
C++
vi->SetOnTranscriptUpdate([](const Tryll::TranscriptUpdate& u) {
if (u.kind == Tryll::TranscriptUpdateKind::UtteranceFinal)
ProcessFinalTranscript(u.text);
});
Unity (C#)
voice.OnTranscriptUpdate += update =>
{
if (update.Kind == TranscriptUpdateKind.UtteranceFinal)
ProcessFinalTranscript(update.Text);
};
Unreal (C++)
Voice->OnTranscriptUpdate.AddLambda([](const FTryllTranscriptUpdate& U)
{
if (U.Kind == ETryllTranscriptUpdateKind::UtteranceFinal)
ProcessFinalTranscript(U.Text);
});
3. Begin an utterance¶
BeginUtterance opens a recording window and arms the VAD.
C++
Tryll::UtteranceOptions opts;
opts.autoSendAgentId = myAgent->Id(); // 0 for transcribe-only
opts.autoFinishOnSilence = true;
opts.maxUtteranceMs = 30000;
vi->BeginUtterance(opts);
Python — Python currently requires constructing the request manually via the wire codec (encode_begin_utterance_request in tryll_client.codec).
Unity (C#)
var opts = new UtteranceOptions
{
AutoSendAgentId = agentId, // 0 = transcribe-only
AutoFinishOnSilence = true,
MaxUtteranceMs = 30000,
};
voice.BeginUtterance(opts);
Unreal (C++)
FTryllUtteranceOptions Opts;
Opts.AutoSendAgentId = AgentId; // 0 = transcribe-only
Opts.bAutoFinishOnSilence = true;
Opts.MaxUtteranceMs = 30000;
Voice->BeginUtterance(Opts);
When autoSendAgentId is non-zero the server automatically forwards
the final transcript as a SendMessageRequest to that agent when the
utterance closes — no extra round-trip needed.
4. Stream audio¶
Send PCM chunks as fast as they arrive. The buffer is fire-and-forget: no acknowledgement, no backpressure from the server.
C++
// Called from your audio capture callback at ~20 ms intervals.
vi->SendAudioBuffer(pcmChunk.data(), pcmChunk.size());
Unity (C#)
// Called from Unity's OnAudioFilterRead or a microphone polling coroutine.
voice.SendAudioBuffer(pcmBytes);
Unreal (C++)
Audio must be raw signed-16-bit PCM at the sample rate declared in
inputFormat. The server resamples to the model's expected rate as
needed. No encoding (opus, mp3, etc.) is supported on the voice path.
5. End or cancel an utterance¶
If autoFinishOnSilence is true the server closes the segment
automatically. You can also close it explicitly:
vi->EndUtterance(); // C++ — commit; produces UtteranceFinal
vi->CancelUtterance(); // C++ — discard; no UtteranceFinal fired
6. Input modes¶
Push-to-talk¶
Call BeginUtterance when the player presses a button and
EndUtterance when they release it. Disable autoFinishOnSilence
so VAD does not cut short a deliberate pause mid-sentence.
Hands-free (VAD-driven)¶
Keep autoFinishOnSilence = true and leave the utterance open.
The server fires UtteranceFinal after each detected pause and
automatically reopens for the next one.
7. Destroy the handle¶
Always destroy the VoiceInput when you are done with it. This
frees the model slot on the server.
Next steps¶
- Bias Voice Input with Hotwords — nudge the STT decoder toward game-specific proper nouns and spell names without retraining.
- Pin and Unpin Models — keep the STT model warm between sessions to avoid reload latency.
- Enable Auto Model Downloading — let the server fetch missing STT models automatically during development.
Related reference¶
CreateVoiceInputRequest— wire message sent to start a VoiceInput session.BeginUtteranceRequest— wire message sent to open an audio capture window.WireTranscriptUpdate— server push carrying incremental and final transcript text.