Skip to content

Add SenseVoice local STT model support#1634

Open
waibiwaibig wants to merge 3 commits into
getpaseo:mainfrom
waibiwaibig:feat/sensevoice-local-stt
Open

Add SenseVoice local STT model support#1634
waibiwaibig wants to merge 3 commits into
getpaseo:mainfrom
waibiwaibig:feat/sensevoice-local-stt

Conversation

@waibiwaibig

@waibiwaibig waibiwaibig commented Jun 20, 2026

Copy link
Copy Markdown

Summary

Closes #1633.

Adds daemon-side local STT support for a sherpa-onnx SenseVoice int8 model:

  • adds sense-voice-zh-en-ja-ko-yue-int8-2025-09-09 to the local speech model catalog
  • downloads SenseVoice from Hugging Face mirror direct files before falling back to the GitHub release archive
  • extends the offline recognizer to initialize sherpa-onnx senseVoice configs as well as existing NeMo transducer configs
  • builds STT recognizer configs from catalog metadata instead of assuming every STT model has Parakeet transducer files
  • allows the model to resolve through existing dictation / voice-mode local STT config paths
  • documents the new Chinese/English mixed local STT option
  • fixes speech:download --help so it does not accidentally start default model downloads
  • fixes speech:transcribe:local provider setup so local wav transcription can initialize the shared local runtime

Scope

This keeps the current mobile/iPad audio streaming architecture unchanged. STT still runs on the daemon host.

Out of scope for this PR:

  • mobile-side local inference
  • UI redesign or model-management UI
  • OpenAI speech changes
  • Chinese TTS
  • changing the default STT model for all users

Testing

  • ELECTRON_SKIP_BINARY_DOWNLOAD=1 npm install --workspaces --include-workspace-root
  • npm run build:server-deps
  • npx vitest run packages/server/src/server/speech/speech-config-resolver.test.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.test.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.test.ts
  • npm run typecheck --workspace=@getpaseo/server
  • npm run build --workspace=@getpaseo/server
  • npm run lint -- packages/server/src/server/speech/providers/local/sherpa/model-catalog.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.ts packages/server/src/server/speech/providers/local/models.ts packages/server/src/server/speech/providers/local/worker-process.ts packages/server/src/server/speech/speech-config-resolver.test.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.test.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.test.ts packages/server/scripts/download-speech-models.ts packages/server/scripts/transcribe-local-wav.ts
  • npm run format:check:files -- public-docs/voice.md packages/server/src/server/speech/providers/local/sherpa/model-catalog.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.ts packages/server/src/server/speech/providers/local/models.ts packages/server/src/server/speech/providers/local/worker-process.ts packages/server/src/server/speech/speech-config-resolver.test.ts packages/server/src/server/speech/providers/local/sherpa/model-downloader.test.ts packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.test.ts packages/server/scripts/download-speech-models.ts packages/server/scripts/transcribe-local-wav.ts
  • npm run speech:download --workspace=@getpaseo/server -- --help

Manual local model test:

  • npm run speech:download --workspace=@getpaseo/server -- --models-dir /tmp/paseo-sensevoice-direct-test --model sense-voice-zh-en-ja-ko-yue-int8-2025-09-09
  • Downloaded via https://hf-mirror.com/.../model.int8.onnx and tokens.txt
  • Completed in about 15 seconds on this network
  • Downloaded files:
    • model.int8.onnx: 241M
    • tokens.txt: 312K

Manual local inference test:

  • Downloaded test_wavs/zh.wav from the same HF mirror repo
  • npm run speech:transcribe:local --workspace=@getpaseo/server -- /tmp/paseo-sensevoice-zh.wav --models-dir /tmp/paseo-sensevoice-direct-test --model sense-voice-zh-en-ja-ko-yue-int8-2025-09-09
  • Output: 放时间早上九点至下午五点

@waibiwaibig waibiwaibig marked this pull request as ready for review June 20, 2026 13:22
@greptile-apps

greptile-apps Bot commented Jun 20, 2026

Copy link
Copy Markdown

Greptile Summary

Adds sense-voice-zh-en-ja-ko-yue-int8-2025-09-09 as a local STT option supporting Chinese, English, Japanese, Korean, and Cantonese, and wires it through the full daemon-side pipeline: catalog metadata, direct-file download with HF-mirror fallback, recognizer config dispatch, and the config resolver.

  • Catalog + downloader: model-catalog.ts gains a discriminated union SherpaOnnxCatalogEntry that attaches a recognizer spec per STT model; model-downloader.ts adds a downloadDirectFiles path with per-URL retry and atomic renames before falling back to the archive.
  • Recognizer + worker: sherpa-offline-recognizer.ts now builds either a nemo_transducer or senseVoice config from the spec; worker-process.ts reads file paths from catalog metadata instead of hardcoding Parakeet filenames.
  • Bug fixes: speech:download --help no longer triggers default model downloads; transcribe-local-wav.ts now satisfies the voiceTurnDetection field required by the shared runtime config shape.

Confidence Score: 5/5

Safe to merge — no runtime correctness issues introduced.

The new model type flows cleanly through every layer: catalog spec, direct-file downloader, recognizer config builder, and the worker engine. The direct-download path uses atomic renames and falls back to the archive on any failure, leaving no corrupt state. The two script fixes are narrow and clearly correct. Remaining findings are naming and structural style concerns.

No files require special attention for correctness; sherpa-offline-recognizer.test.ts carries the test-pattern concerns noted in the previous review round.

Important Files Changed

Filename Overview
packages/server/src/server/speech/providers/local/sherpa/model-catalog.ts Adds SenseVoice catalog entry with directFiles for Hugging Face mirror downloads; refactors SherpaOnnxCatalogEntry to a discriminated union carrying a recognizer spec per STT model. Clean structural extension.
packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.ts Extends the recognizer engine to support sense_voice configs alongside nemo_transducer. Logic is correct; two private interface aliases duplicate the union member shapes unnecessarily.
packages/server/src/server/speech/providers/local/sherpa/model-downloader.ts Adds downloadDirectFiles with per-URL fallback; correctly falls back to archive on failure and uses atomic temp-file rename. Download logic is sound.
packages/server/src/server/speech/providers/local/worker-process.ts Replaces hardcoded Parakeet paths with catalog-driven buildSttRecognizerModel; now uses path.join via localModelPath. The Parakeet-named provider classes now serve both model families without renaming.
packages/server/src/server/speech/providers/local/sherpa/sherpa-offline-recognizer.test.ts New test file verifying SenseVoice recognizer config shape; uses vi.mock and vi.hoisted patterns previously flagged as banned by project test rules.
packages/server/src/server/speech/speech-config-resolver.test.ts Adds acceptance test for SenseVoice as dictation and voice-mode STT; clean module-interface test.
packages/server/src/server/speech/providers/local/sherpa/model-downloader.test.ts Adds coverage for SenseVoice direct-file path; uses vi.stubGlobal for fetch and asserts both observable outcomes (files on disk) and internal URLs (previously flagged).
packages/server/src/server/speech/providers/local/models.ts Adds getLocalSpeechModelSpec passthrough consistent with existing delegation pattern in the module facade.
packages/server/scripts/download-speech-models.ts Adds --help/-h guard before arg parsing to fix accidental default-model downloads; correct fix.
packages/server/scripts/transcribe-local-wav.ts Adds voiceTurnDetection to the required RequestedSpeechProviders shape; fixes local WAV transcription initialization.
public-docs/voice.md Documents SenseVoice model ID, language support, config snippet, and HF-mirror download strategy; accurate and consistent with implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ensureSherpaOnnxModel] --> B{spec.directFiles?}
    B -- yes --> C[downloadDirectFiles\nhf-mirror URLs then HF fallback]
    C --> D{requiredFiles present?}
    D -- yes --> E[return modelDir]
    D -- no --> F[warn: fall back to archive]
    C -- throws --> F
    B -- no --> G[download .tar.bz2 archive]
    F --> G
    G --> H[extractTarArchive to modelsDir]
    H --> I{requiredFiles present?}
    I -- yes --> J[clean up archive, return modelDir]
    I -- no --> K[throw: required files missing]

    subgraph workerProcess [worker-process.ts — engine init]
        L[buildSttRecognizerModel] --> M{spec.recognizer.kind}
        M -- nemo_transducer --> N[absolute paths for\nencoder/decoder/joiner/tokens]
        M -- sense_voice --> O[absolute paths for\nmodel/tokens + language config]
        N --> P[SherpaOfflineRecognizerEngine\nnemo_transducer config]
        O --> Q[SherpaOfflineRecognizerEngine\nsenseVoice config]
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[ensureSherpaOnnxModel] --> B{spec.directFiles?}
    B -- yes --> C[downloadDirectFiles\nhf-mirror URLs then HF fallback]
    C --> D{requiredFiles present?}
    D -- yes --> E[return modelDir]
    D -- no --> F[warn: fall back to archive]
    C -- throws --> F
    B -- no --> G[download .tar.bz2 archive]
    F --> G
    G --> H[extractTarArchive to modelsDir]
    H --> I{requiredFiles present?}
    I -- yes --> J[clean up archive, return modelDir]
    I -- no --> K[throw: required files missing]

    subgraph workerProcess [worker-process.ts — engine init]
        L[buildSttRecognizerModel] --> M{spec.recognizer.kind}
        M -- nemo_transducer --> N[absolute paths for\nencoder/decoder/joiner/tokens]
        M -- sense_voice --> O[absolute paths for\nmodel/tokens + language config]
        N --> P[SherpaOfflineRecognizerEngine\nnemo_transducer config]
        O --> Q[SherpaOfflineRecognizerEngine\nsenseVoice config]
    end
Loading

Reviews (2): Last reviewed commit: "Use path.join for local speech model pat..." | Re-trigger Greptile

Comment thread packages/server/src/server/speech/providers/local/worker-process.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PRD: Local Chinese/English mixed speech support via SenseVoice

1 participant