Add SenseVoice local STT model support#1634
Conversation
|
| Filename | Overview |
|---|---|
| packages/server/src/server/speech/providers/local/sherpa/model-catalog.ts | Adds SenseVoice catalog entry with directFiles for Hugging Face mirror downloads; refactors SherpaOnnxCatalogEntry to a discriminated union carrying a recognizer spec per STT model. Clean structural extension. |
| packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.ts | Extends the recognizer engine to support sense_voice configs alongside nemo_transducer. Logic is correct; two private interface aliases duplicate the union member shapes unnecessarily. |
| packages/server/src/server/speech/providers/local/sherpa/model-downloader.ts | Adds downloadDirectFiles with per-URL fallback; correctly falls back to archive on failure and uses atomic temp-file rename. Download logic is sound. |
| packages/server/src/server/speech/providers/local/worker-process.ts | Replaces hardcoded Parakeet paths with catalog-driven buildSttRecognizerModel; now uses path.join via localModelPath. The Parakeet-named provider classes now serve both model families without renaming. |
| packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.test.ts | New test file verifying SenseVoice recognizer config shape; uses vi.mock and vi.hoisted patterns previously flagged as banned by project test rules. |
| packages/server/src/server/speech/speech-config-resolver.test.ts | Adds acceptance test for SenseVoice as dictation and voice-mode STT; clean module-interface test. |
| packages/server/src/server/speech/providers/local/sherpa/model-downloader.test.ts | Adds coverage for SenseVoice direct-file path; uses vi.stubGlobal for fetch and asserts both observable outcomes (files on disk) and internal URLs (previously flagged). |
| packages/server/src/server/speech/providers/local/models.ts | Adds getLocalSpeechModelSpec passthrough consistent with existing delegation pattern in the module facade. |
| packages/server/scripts/download-speech-models.ts | Adds --help/-h guard before arg parsing to fix accidental default-model downloads; correct fix. |
| packages/server/scripts/transcribe-local-wav.ts | Adds voiceTurnDetection to the required RequestedSpeechProviders shape; fixes local WAV transcription initialization. |
| public-docs/voice.md | Documents SenseVoice model ID, language support, config snippet, and HF-mirror download strategy; accurate and consistent with implementation. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[ensureSherpaOnnxModel] --> B{spec.directFiles?}
B -- yes --> C[downloadDirectFiles\nhf-mirror URLs then HF fallback]
C --> D{requiredFiles present?}
D -- yes --> E[return modelDir]
D -- no --> F[warn: fall back to archive]
C -- throws --> F
B -- no --> G[download .tar.bz2 archive]
F --> G
G --> H[extractTarArchive to modelsDir]
H --> I{requiredFiles present?}
I -- yes --> J[clean up archive, return modelDir]
I -- no --> K[throw: required files missing]
subgraph workerProcess [worker-process.ts — engine init]
L[buildSttRecognizerModel] --> M{spec.recognizer.kind}
M -- nemo_transducer --> N[absolute paths for\nencoder/decoder/joiner/tokens]
M -- sense_voice --> O[absolute paths for\nmodel/tokens + language config]
N --> P[SherpaOfflineRecognizerEngine\nnemo_transducer config]
O --> Q[SherpaOfflineRecognizerEngine\nsenseVoice config]
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[ensureSherpaOnnxModel] --> B{spec.directFiles?}
B -- yes --> C[downloadDirectFiles\nhf-mirror URLs then HF fallback]
C --> D{requiredFiles present?}
D -- yes --> E[return modelDir]
D -- no --> F[warn: fall back to archive]
C -- throws --> F
B -- no --> G[download .tar.bz2 archive]
F --> G
G --> H[extractTarArchive to modelsDir]
H --> I{requiredFiles present?}
I -- yes --> J[clean up archive, return modelDir]
I -- no --> K[throw: required files missing]
subgraph workerProcess [worker-process.ts — engine init]
L[buildSttRecognizerModel] --> M{spec.recognizer.kind}
M -- nemo_transducer --> N[absolute paths for\nencoder/decoder/joiner/tokens]
M -- sense_voice --> O[absolute paths for\nmodel/tokens + language config]
N --> P[SherpaOfflineRecognizerEngine\nnemo_transducer config]
O --> Q[SherpaOfflineRecognizerEngine\nsenseVoice config]
end
Reviews (2): Last reviewed commit: "Use path.join for local speech model pat..." | Re-trigger Greptile
Summary
Closes #1633.
Adds daemon-side local STT support for a sherpa-onnx SenseVoice int8 model:
sense-voice-zh-en-ja-ko-yue-int8-2025-09-09to the local speech model catalogsenseVoiceconfigs as well as existing NeMo transducer configsspeech:download --helpso it does not accidentally start default model downloadsspeech:transcribe:localprovider setup so local wav transcription can initialize the shared local runtimeScope
This keeps the current mobile/iPad audio streaming architecture unchanged. STT still runs on the daemon host.
Out of scope for this PR:
Testing
ELECTRON_SKIP_BINARY_DOWNLOAD=1 npm install --workspaces --include-workspace-rootnpm run build:server-depsnpx vitest run packages/server/src/server/speech/speech-config-resolver.test.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.test.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.test.tsnpm run typecheck --workspace=@getpaseo/servernpm run build --workspace=@getpaseo/servernpm run lint -- packages/server/src/server/speech/providers/local/sherpa/model-catalog.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.ts packages/server/src/server/speech/providers/local/models.ts packages/server/src/server/speech/providers/local/worker-process.ts packages/server/src/server/speech/speech-config-resolver.test.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.test.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.test.ts packages/server/scripts/download-speech-models.ts packages/server/scripts/transcribe-local-wav.tsnpm run format:check:files -- public-docs/voice.md packages/server/src/server/speech/providers/local/sherpa/model-catalog.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.ts packages/server/src/server/speech/providers/local/models.ts packages/server/src/server/speech/providers/local/worker-process.ts packages/server/src/server/speech/speech-config-resolver.test.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.test.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.test.ts packages/server/scripts/download-speech-models.ts packages/server/scripts/transcribe-local-wav.tsnpm run speech:download --workspace=@getpaseo/server -- --helpManual local model test:
npm run speech:download --workspace=@getpaseo/server -- --models-dir /tmp/paseo-sensevoice-direct-test --model sense-voice-zh-en-ja-ko-yue-int8-2025-09-09https://hf-mirror.com/.../model.int8.onnxandtokens.txtmodel.int8.onnx: 241Mtokens.txt: 312KManual local inference test:
test_wavs/zh.wavfrom the same HF mirror reponpm run speech:transcribe:local --workspace=@getpaseo/server -- /tmp/paseo-sensevoice-zh.wav --models-dir /tmp/paseo-sensevoice-direct-test --model sense-voice-zh-en-ja-ko-yue-int8-2025-09-09放时间早上九点至下午五点