Skip to content

Bias Voice Input with Hotwords

Open-vocabulary speech-to-text engines guess every word from acoustic evidence alone, which means they often mangle game-specific proper nouns — the NPC name "Aeltharion" becomes "Al Therion", the spell "frostbolt" becomes "frost bolt". Sherpa-ONNX exposes a hotwords list that nudges the decoder toward a designer-authored set of phrases without retraining the model.

This recipe shows how to ship a phrase list with your game and apply it to a VoiceInput.

Prerequisites

  • A working VoiceInput session — see Use Voice Input for the full setup walkthrough.
  • An STT model that supports hotwords. Most transducer / CTC families do; encoder-decoder families (Whisper, SenseVoice, Moonshine, Canary) silently ignore the list. The matrix is in Which models support hotwords below.

1. Author a phrase list

Hotwords live in a session-scoped StringStorage of kind List. The simplest authoring format is one phrase per line in a UTF-8 text file. Blank lines and lines starting with # are ignored:

# fantasy NPC names
Aeltharion
Cor'than
Bjornthor

# spells / abilities
fireball
frostbolt
magic missile

2. Register the storage and wire it into VoiceInput

C++

#include <tryll/TryllClient.h>
#include <tryll/VoiceInput.h>

// 1. Register the lexicon under a name of your choice. Same-session lifetime.
std::vector<std::string> phrases = {
    "Aeltharion", "Cor'than", "Bjornthor",
    "fireball",   "frostbolt", "magic missile",
};
client.CreateStringStorage("game-lexicon", phrases);

// 2. Reference it when creating the VoiceInput.
Tryll::VoiceInputConfig viCfg;
viCfg.modelName            = "Parakeet TDT 0.6B v2 (int8)";
viCfg.inputFormat          = micFormat;
viCfg.hotwordsStorageName  = "game-lexicon";
viCfg.hotwordsScore        = 1.8f;   // 1.0 = no bias, 2.5 = aggressive

auto vi = client.CreateVoiceInput(viCfg);

Python

# 1. Register the lexicon.
client.create_string_storage("game-lexicon", strings=[
    "Aeltharion", "Cor'than", "Bjornthor",
    "fireball", "frostbolt", "magic missile",
])

# 2. Create the VoiceInput with hotwords biasing.
voice_input_id = client.create_voice_input(
    model_name="Parakeet TDT 0.6B v2 (int8)",
    sample_rate=16000, channels=1, bits_per_sample=16,
    hotwords_storage_name="game-lexicon",
    hotwords_score=1.8,
)

Unity (C#)

// 1. Register the lexicon.
await client.CreateStringStorageAsync("game-lexicon", new List<string>
{
    "Aeltharion", "Cor'than", "Bjornthor",
    "fireball", "frostbolt", "magic missile",
});

// 2. Create the VoiceInput with hotwords biasing.
var cfg = new VoiceInputConfig
{
    ModelName            = "Parakeet TDT 0.6B v2 (int8)",
    InputFormat          = AudioFormat.Default,
    VadThreshold         = 0.5f,
    VadMinSilenceMs      = 500,
    VadSpeechPadMs       = 250,
    HotwordsStorageName  = "game-lexicon",
    HotwordsScore        = 1.8f,
};
var (voice, err) = await client.CreateVoiceInputAsync(cfg);

Unreal (C++)

// 1. Register the lexicon.
TArray<FString> Phrases = {
    TEXT("Aeltharion"), TEXT("Cor'than"), TEXT("Bjornthor"),
    TEXT("fireball"), TEXT("frostbolt"), TEXT("magic missile"),
};
Subsystem->CreateStringStorage(TEXT("game-lexicon"), Phrases,
    [](TSharedPtr<FTryllStringStorage>, FTryllError Err)
    {
        if (!Err.IsOk()) { /* handle error */ }
    });

// 2. Create the VoiceInput with hotwords biasing.
FTryllVoiceInputConfig Config;
Config.ModelName             = TEXT("Parakeet TDT 0.6B v2 (int8)");
Config.InputFormat.SampleRate = 16000;
Config.HotwordsStorageName   = TEXT("game-lexicon");
Config.HotwordsScore         = 1.8f;

Subsystem->CreateVoiceInput(Config,
    [](TSharedPtr<FTryllVoiceInput> Voice, FTryllError Err)
    {
        if (!Err.IsOk()) { /* handle error */ return; }
        // Voice is ready — call BeginUtterance to start capturing.
    });

If you already have the phrases in a text file on the server's filesystem, swap step 1 for CreateStringStorageFromFile(name, path) (C++/Unreal) or client.create_string_storage(name, file_path=path) (Python).

Picking a score

Score Bias Use when…
1.0 none / off sanity check that the path is wired but you do not yet want bias
1.5 gentle (default) the lexicon is broad and you do not want false positives
1.8 – 2.0 moderate proper nouns + spell names — typical game lexicon
2.5+ aggressive the recognizer keeps picking a phonetically-similar real word and you accept some false positives

The score is applied to every phrase in the list. A future revision may allow per-phrase overrides.

Which models support hotwords

Family Hotwords
Zipformer transducer / Parakeet TDT (Transducer, NemoTransducer) yes
Zipformer-CTC, NeMo CTC, Paraformer, Dolphin, WeNet CTC yes
Omnilingual, MedASR, TeleSpeech yes
Whisper no — silently ignored
SenseVoice no — silently ignored
Moonshine, Canary, FireRedASR, CohereTranscribe no

When you pass a hotwords list to an unsupported model the server emits a warn-level log line at session creation:

[Sherpa-ONNX] STT family 'whisper' does not support hotwords;
              ignoring 9 supplied phrase(s) on this session.

The session is otherwise unaffected.

Caveats

  • BPE encoding is approximate in v1. The server takes your grapheme phrases and prefixes each whitespace-separated word with the BPE word marker (). This works perfectly for words that the model's BPE tokenizer represents as a single piece — i.e. most common English verbs and nouns. Multi-piece words (the made-up fantasy name "Aeltharion" decomposes into 3–4 pieces) get partial bias, not the full effect.
  • Pre-tokenize for full effect. If a particular proper noun matters and the approximate encoding is not biasing strongly enough, run sherpa-onnx's text2token.py with the model's tokenizer over your lexicon and put the ▁Ael th ar ion-style form directly in the storage. The server detects a leading and passes the phrase through unchanged.
  • The score is recognizer-wide per session. It is not per-phrase.
  • The lexicon lives on the server. StringStorage is session-scoped — destroying the session destroys the lexicon. To share a lexicon across sessions, re-create it on each session (cheap; it is just a list of strings).
  • Out-of-vocabulary characters are dropped. The naïve encoder treats characters one-to-one. If your lexicon contains characters the tokenizer cannot represent (e.g. CJK characters against a pure-English Zipformer), they will be silently dropped from the bias.