Bias Voice Input with Hotwords¶
Open-vocabulary speech-to-text engines guess every word from acoustic evidence alone, which means they often mangle game-specific proper nouns — the NPC name "Aeltharion" becomes "Al Therion", the spell "frostbolt" becomes "frost bolt". Sherpa-ONNX exposes a hotwords list that nudges the decoder toward a designer-authored set of phrases without retraining the model.
This recipe shows how to ship a phrase list with your game and apply
it to a VoiceInput.
Prerequisites¶
- A working
VoiceInputsession — see Use Voice Input for the full setup walkthrough. - An STT model that supports hotwords. Most transducer / CTC families do; encoder-decoder families (Whisper, SenseVoice, Moonshine, Canary) silently ignore the list. The matrix is in Which models support hotwords below.
1. Author a phrase list¶
Hotwords live in a session-scoped StringStorage of kind List. The
simplest authoring format is one phrase per line in a UTF-8 text
file. Blank lines and lines starting with # are ignored:
# fantasy NPC names
Aeltharion
Cor'than
Bjornthor
# spells / abilities
fireball
frostbolt
magic missile
2. Register the storage and wire it into VoiceInput¶
C++
#include <tryll/TryllClient.h>
#include <tryll/VoiceInput.h>
// 1. Register the lexicon under a name of your choice. Same-session lifetime.
std::vector<std::string> phrases = {
"Aeltharion", "Cor'than", "Bjornthor",
"fireball", "frostbolt", "magic missile",
};
client.CreateStringStorage("game-lexicon", phrases);
// 2. Reference it when creating the VoiceInput.
Tryll::VoiceInputConfig viCfg;
viCfg.modelName = "Parakeet TDT 0.6B v2 (int8)";
viCfg.inputFormat = micFormat;
viCfg.hotwordsStorageName = "game-lexicon";
viCfg.hotwordsScore = 1.8f; // 1.0 = no bias, 2.5 = aggressive
auto vi = client.CreateVoiceInput(viCfg);
Python
# 1. Register the lexicon.
client.create_string_storage("game-lexicon", strings=[
"Aeltharion", "Cor'than", "Bjornthor",
"fireball", "frostbolt", "magic missile",
])
# 2. Create the VoiceInput with hotwords biasing.
voice_input_id = client.create_voice_input(
model_name="Parakeet TDT 0.6B v2 (int8)",
sample_rate=16000, channels=1, bits_per_sample=16,
hotwords_storage_name="game-lexicon",
hotwords_score=1.8,
)
Unity (C#)
// 1. Register the lexicon.
await client.CreateStringStorageAsync("game-lexicon", new List<string>
{
"Aeltharion", "Cor'than", "Bjornthor",
"fireball", "frostbolt", "magic missile",
});
// 2. Create the VoiceInput with hotwords biasing.
var cfg = new VoiceInputConfig
{
ModelName = "Parakeet TDT 0.6B v2 (int8)",
InputFormat = AudioFormat.Default,
VadThreshold = 0.5f,
VadMinSilenceMs = 500,
VadSpeechPadMs = 250,
HotwordsStorageName = "game-lexicon",
HotwordsScore = 1.8f,
};
var (voice, err) = await client.CreateVoiceInputAsync(cfg);
Unreal (C++)
// 1. Register the lexicon.
TArray<FString> Phrases = {
TEXT("Aeltharion"), TEXT("Cor'than"), TEXT("Bjornthor"),
TEXT("fireball"), TEXT("frostbolt"), TEXT("magic missile"),
};
Subsystem->CreateStringStorage(TEXT("game-lexicon"), Phrases,
[](TSharedPtr<FTryllStringStorage>, FTryllError Err)
{
if (!Err.IsOk()) { /* handle error */ }
});
// 2. Create the VoiceInput with hotwords biasing.
FTryllVoiceInputConfig Config;
Config.ModelName = TEXT("Parakeet TDT 0.6B v2 (int8)");
Config.InputFormat.SampleRate = 16000;
Config.HotwordsStorageName = TEXT("game-lexicon");
Config.HotwordsScore = 1.8f;
Subsystem->CreateVoiceInput(Config,
[](TSharedPtr<FTryllVoiceInput> Voice, FTryllError Err)
{
if (!Err.IsOk()) { /* handle error */ return; }
// Voice is ready — call BeginUtterance to start capturing.
});
If you already have the phrases in a text file on the server's
filesystem, swap step 1 for CreateStringStorageFromFile(name, path)
(C++/Unreal) or client.create_string_storage(name, file_path=path)
(Python).
Picking a score¶
| Score | Bias | Use when… |
|---|---|---|
| 1.0 | none / off | sanity check that the path is wired but you do not yet want bias |
| 1.5 | gentle (default) | the lexicon is broad and you do not want false positives |
| 1.8 – 2.0 | moderate | proper nouns + spell names — typical game lexicon |
| 2.5+ | aggressive | the recognizer keeps picking a phonetically-similar real word and you accept some false positives |
The score is applied to every phrase in the list. A future revision may allow per-phrase overrides.
Which models support hotwords¶
| Family | Hotwords |
|---|---|
| Zipformer transducer / Parakeet TDT (Transducer, NemoTransducer) | yes |
| Zipformer-CTC, NeMo CTC, Paraformer, Dolphin, WeNet CTC | yes |
| Omnilingual, MedASR, TeleSpeech | yes |
| Whisper | no — silently ignored |
| SenseVoice | no — silently ignored |
| Moonshine, Canary, FireRedASR, CohereTranscribe | no |
When you pass a hotwords list to an unsupported model the server
emits a warn-level log line at session creation:
[Sherpa-ONNX] STT family 'whisper' does not support hotwords;
ignoring 9 supplied phrase(s) on this session.
The session is otherwise unaffected.
Caveats¶
- BPE encoding is approximate in v1. The server takes your
grapheme phrases and prefixes each whitespace-separated word with
the BPE word marker (
▁). This works perfectly for words that the model's BPE tokenizer represents as a single piece — i.e. most common English verbs and nouns. Multi-piece words (the made-up fantasy name "Aeltharion" decomposes into 3–4 pieces) get partial bias, not the full effect. - Pre-tokenize for full effect. If a particular proper noun
matters and the approximate encoding is not biasing strongly
enough, run sherpa-onnx's
text2token.pywith the model's tokenizer over your lexicon and put the▁Ael th ar ion-style form directly in the storage. The server detects a leading▁and passes the phrase through unchanged. - The score is recognizer-wide per session. It is not per-phrase.
- The lexicon lives on the server.
StringStorageis session-scoped — destroying the session destroys the lexicon. To share a lexicon across sessions, re-create it on each session (cheap; it is just a list of strings). - Out-of-vocabulary characters are dropped. The naïve encoder treats characters one-to-one. If your lexicon contains characters the tokenizer cannot represent (e.g. CJK characters against a pure-English Zipformer), they will be silently dropped from the bias.