Skip to content

fix: audio export quality and Whisper language selection#4

Open
ScreepCode wants to merge 1 commit into
thehwang:mainfrom
ScreepCode:fix/audio-quality-and-whisper-language
Open

fix: audio export quality and Whisper language selection#4
ScreepCode wants to merge 1 commit into
thehwang:mainfrom
ScreepCode:fix/audio-quality-and-whisper-language

Conversation

@ScreepCode

@ScreepCode ScreepCode commented Jun 15, 2026

Copy link
Copy Markdown

Summary

  • Audio too quiet (Audio export is too quiet #2): Raises system audio capture from 16 kHz to 48 kHz and updates the AAC encoder to 48 kHz / 128 kbps. The exported audio file now receives the original 48 kHz PCM instead of the already-downsampled 16 kHz buffer that was previously written. Before this change the Nyquist limit cut everything above 8 kHz, making recordings sound thin and quiet.
  • Whisper language hardcoded (Language selection not respected for mic transcription (Whisper) and summary #3): Adds a language property to WhisperEngine and wires it up from MeetingRecorder.startRecording() using the 2-letter ISO prefix of recognitionLanguage (e.g. "de-DE""de"). Previously transcribeChunk had the language hardcoded, so the mic channel always transcribed in one language regardless of the UI selection.

Changes

File Change
SystemAudioCapture.swift config.sampleRate: 16 000 → 48 000
MeetingRecorder.swift audioFileSettings: 48 kHz / 128 kbps; write original PCM to file; fix writeMicAudio memcpy condition
WhisperEngine.swift Add language property, use self.language in transcribeChunk

Test plan

  • Record a short meeting in German or other language — mic channel should now transcribe in German (or other language) (not English/hardcoded language)
  • Remote channel (SFSpeech) continues to work as before
  • Exported audio-mic.m4a and audio-system.m4a sound noticeably louder/fuller than before
  • Switching language in UI between sessions changes Whisper transcription language correctly

- SystemAudioCapture: raise sample rate from 16 kHz to 48 kHz so
  exported audio captures the full voice frequency range (0–24 kHz)
  instead of being limited to 8 kHz (Nyquist of 16 kHz)

- MeetingRecorder: update audio file settings to 48 kHz / 128 kbps AAC;
  write original 48 kHz PCM to the audio file in handleSystemAudioBuffer
  instead of the already-downsampled 16 kHz buffer that was fed to
  SFSpeech; fix writeMicAudio memcpy fast-path to also trigger for
  stereo hardware input (was gated on channelCount == 1 unnecessarily)

- WhisperEngine: add `language` property (default "en"), use it in
  transcribeChunk instead of a hardcoded language string; set it from
  MeetingRecorder.startRecording() via the 2-letter ISO prefix of
  recognitionLanguage (e.g. "de-DE" → "de")

Fixes thehwang#2, fixes thehwang#3
@ScreepCode ScreepCode force-pushed the fix/audio-quality-and-whisper-language branch from c211121 to 0c22e3d Compare June 15, 2026 16:35

@thehwang thehwang left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff against the code in context. This is a clean, correct bugfix — happy to approve. Both goals land well: higher-quality audio export, and selectable Whisper language.

Quality path is self-consistent

  • SystemAudioCapture.sampleRate = 48_000 is SCK's native rate; capturing at 16k was effectively asking SCK to downsample for us, so 48k is cleaner.
  • handleSystemAudioBuffer now writes the original pcm to disk and only downsamples a separate buffer for SFSpeech. writeSystemAudio already has a format-conversion fallback, and the file's processing format is now 48k mono, so the original buffer matches and no extra conversion happens.
  • audioFileSettings (48k / 128 kbps mono) is shared by both the mic and system writers, so sample rates stay consistent when the two tracks are merged.

Language path is correct

  • strdup(self.language) is still released by the existing free(langStr) — no leak introduced.
  • recognitionLanguage.components(separatedBy: "-").first?.lowercased() maps en-US -> en, zh-CN -> zh, matching Whisper's two-letter codes.

Suggestions (non-blocking)

  1. Multilingual model is a prerequisite. Language selection only works when a multilingual model (ggml-base.bin) is loaded; with an English-only *.en model, setting e.g. zh will produce garbage. Worth a guard/warning when a non-en language is chosen but the loaded model is .en.
  2. Dropping && buffer.format.channelCount == 1 in writeMicAudio is fine here — the slow path also reads only ch0, so the result is identical. It's only safe because our buffers are non-interleaved (interleaved: false); a one-line comment noting that assumption would prevent a future interleaved-stereo memcpy(ch0) foot-gun.
  3. Multi-subtag locales like yue-Hant-HK -> yue parse fine, but yue is only supported by larger Whisper models, not base. Edge case, just flagging.
  4. self.language is read on processingQueue. It isn't mutated during recording so it's safe in practice; capturing let lang = self.language before the async block would make that explicit.

Not affected (double-checked)

  • Whisper's 16 kHz input requirement: the mic→whisper feed is unchanged; only the saved file's sample rate changed.
  • Mic/system merge: both share the 48k settings, so they stay aligned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants