Skip to content

feat(tools): in-situ recording evaluation protocol#138

Merged
ToaruPen merged 8 commits intomainfrom
feat/task-8c-recording-evaluation-v3
Apr 5, 2026
Merged

feat(tools): in-situ recording evaluation protocol#138
ToaruPen merged 8 commits intomainfrom
feat/task-8c-recording-evaluation-v3

Conversation

@ToaruPen
Copy link
Copy Markdown
Owner

@ToaruPen ToaruPen commented Apr 5, 2026

Summary

  • Add CLI tools for calibrating sound detection thresholds using real facility recordings (Closes Task 8c: In-Situ Recording Evaluation Protocol #35)
  • record_evaluation: Audio HAL capture → WAV + metadata JSON export
  • evaluate_detection: Precision/recall/F1 computation against annotated ground truth
  • optimize_thresholds: Grid search over sensitivity, energy gate, speech prob, overlap parameters
  • Pydantic annotation schema with full validation and JSON round-trip support
  • 43 passing tests with shared test fixtures

Test plan

  • uv run pytest tests/tools/ -v — 43 tests pass
  • uv run ruff format --check && ruff check — no issues
  • uv run mypy --strict — no issues (11 files)
  • uv run pyright — 0 errors
  • uv run python scripts/smoke_test.py — ALL CHECKS PASSED
  • CI Python quality gates
  • CI UI quality gates
  • CodeRabbit review

🤖 Generated with Claude Code

Summary by CodeRabbit

リリースノート

  • 新機能

    • オーディオ注釈スキーマ(JSONでの保存/読み込み、イベントモデルと検証)
    • 検出評価ツール(録音ペア評価、精度/再現率/F1/FPR算出、CLI出力)
    • しきい値最適化フレームワーク(グリッドサーチで最良設定選定、CLI)
    • オーディオ録音機能(WAV出力およびメタデータ生成、CLI)
  • テスト

    • 注釈、評価、最適化、録音機能の包括的なテストとテストユーティリティを追加
  • ドキュメント

    • ADR(決定記録)を追加

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

Warning

Rate limit exceeded

@ToaruPen has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 39 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 3 minutes and 39 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1fd653f0-1abe-4d35-a251-88349220e73d

📥 Commits

Reviewing files that changed from the base of the PR and between baaf8bd and f3cee0a.

📒 Files selected for processing (4)
  • daemon/src/daemon/tools/annotation_schema.py
  • tests/tools/test_annotation_schema.py
  • tests/tools/test_evaluate_detection.py
  • tests/tools/test_optimize_thresholds.py
📝 Walkthrough

Walkthrough

録音・注釈スキーマ、検出評価、閾値最適化、録音実行の新規ツール群と対応するテスト群、及び関連ADRを追加しました。JSON注釈モデル、WAV入出力・デコード、グリッド探索フレームワーク、CLIエントリが導入されています。

Changes

Cohort / File(s) Summary
モジュール初期化
daemon/src/daemon/tools/__init__.py, tests/tools/__init__.py
from __future__ import annotations を導入し、空の __all__: list[str] = [] を設定。
アノテーション スキーマ
daemon/src/daemon/tools/annotation_schema.py
EventType"bell","multi_voice","quiet","speech","noise")と AnnotationEvent/RecordingAnnotation の Pydantic(ContractModel) を追加。end_s > time_s 検証、save_annotation/load_annotation のUTF‑8インデントJSON IOを実装。
検出評価ツール一式
daemon/src/daemon/tools/evaluate_detection.py
DetectionMetrics/EvaluationResult モデル、compute_metrics()(長さチェック含む)、WAVデコード(8/16/32bit)と注釈読み込み load_recording_pairevaluate_recording()、録音ディレクトリ走査CLIを追加(検出器は注入式、CLIはプレースホルダ検出を使用)。
閾値最適化ツール一式
daemon/src/daemon/tools/optimize_thresholds.py
ThresholdGrid/OptimizeConfig/GridSearchResult モデル、run_grid_search() による全組合せ評価フレームワーク、select_best()、CLIを追加。MetricName = Literal["f1","precision","recall"] を導入。
録音ツール
daemon/src/daemon/tools/record_evaluation.py
RecordingMetadata モデルと record_audio() を実装。AudioInputからチャンク録音し、クリップ・モノラルPCM16 WAVとメタデータJSONを出力するCLIを追加。KeyboardInterruptハンドリングと入出力クローズを含む。
テスト補助ユーティリティ
tests/tools/conftest.py
SAMPLE_RATE=16000make_wav_bytes()(float→PCM16 WAVバイト)、write_wav()(無音WAV生成)、write_annotation()(save_annotationラッパ)を追加。
スキーマテスト
tests/tools/test_annotation_schema.py
AnnotationEvent/RecordingAnnotation の正常系・異常系バリデーション、save/loadの往復検証、FileNotFoundError ケースを追加。
検出評価テスト
tests/tools/test_evaluate_detection.py
DetectionMetrics 検証、compute_metrics() 多様ケース、load_recording_pair のWAVデコード(8/16/32bit)とエラーケース、evaluate_recording() の統合的検証、CLIフローの一部を追加。
最適化テスト
tests/tools/test_optimize_thresholds.py
ThresholdGrid/OptimizeConfig 検証、select_best() の挙動(空/同点)検査、run_grid_search() の組合せ走査・最良選択、CLIパスをテスト。
記録テスト
tests/tools/test_record_evaluation.py
RecordingMetadata 検証、record_audio() の出力(WAV/JSON)・自動ID生成・ネストディレクトリ作成・KeyboardInterruptハンドリングをテスト。
ドキュメント / ADR
docs/adr/decision-log.md, docs/adr/decisions/...2026-04-05-...md
ADR ログ行追記と新しい決定記録ファイルを追加。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed タイトルは施設の実録音評価を含むツール機能を正確に反映しており、PRの主要な変更内容(record_evaluation、evaluate_detection、optimize_thresholds CLI)を簡潔に要約しています。
Linked Issues check ✅ Passed PRは#35の全要件を満たしています:記録CLI、評価CLI、最適化CLI、Pydanticアノテーションスキーマ、包括的なテスト、strict型チェック対応、隠匿対応。
Out of Scope Changes check ✅ Passed 全ての変更が#35の範囲内です:annotation_schemaモジュール、record_evaluationツール、evaluate_detectionツール、optimize_thresholdsツール、対応するテスト、ADRドキュメント。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/task-8c-recording-evaluation-v3

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@daemon/src/daemon/tools/annotation_schema.py`:
- Around line 41-47: The load_annotation function currently reads and parses
JSON with json.loads then calls RecordingAnnotation.model_validate; replace that
two-step flow by calling RecordingAnnotation.model_validate_json on the file
contents (or pass the path.read_text result directly to model_validate_json) and
remove the now-unused json import; update the function to use
model_validate_json and keep the existing FileNotFoundError behavior.
- Around line 30-32: The RecordingAnnotation model currently declares events:
list[AnnotationEvent] but does not enforce whether the list can be empty; decide
and enforce policy: if empty lists are invalid, change the field to a
constrained list (e.g., use pydantic conlist or Field(..., min_items=1)) or add
a model validator on RecordingAnnotation that raises ValueError when events is
empty; if empty lists are allowed, add a clear docstring to RecordingAnnotation
explaining that events may be empty and describe downstream implications so
callers handle edge cases. Ensure you update the RecordingAnnotation declaration
and/or add the validator/docstring accordingly.

In `@daemon/src/daemon/tools/evaluate_detection.py`:
- Around line 136-142: The loop in evaluate_detection.py currently uses a
placeholder detected = [True] * len(annotation.events) which yields perfect
metrics; replace this with the real detection output or, if you cannot implement
detection in this PR, replace that line with a clear TODO and a short plan for
how to obtain detections (e.g., call to your detector function or model
inference) so evaluations are meaningful. Locate the loop that calls
load_recording_pair and compute_metrics and either (a) call your detection
function (e.g., detector.detect(wav_path) or a new detect_events(_samples,
_sample_rate, annotation)) and set detected to its boolean/event list matching
annotation.events, or (b) add a TODO comment above the placeholder referencing
the intended function name (e.g., detect_events) and why it’s deferred, then
keep the continue-on-exception behavior. Ensure detected aligns in length/format
with annotation.events before calling compute_metrics.
- Around line 45-48: The four separate generator-counts over zip(predicted,
ground_truth, strict=True) should be merged into a single loop to improve
performance: create integer counters tp, fp, fn, tn initialized to 0, iterate
once over zip(predicted, ground_truth, strict=True) and inside the loop check
the pair (p, g) and increment the appropriate counter (tp if p and g, fp if p
and not g, fn if not p and g, tn if not p and not g). Update any references to
the original tp/fp/fn/tn variables to use these counters and preserve the same
semantics as the current code.
- Around line 78-82: The code incorrectly assumes non-16-bit WAVs are float64;
update the sampwidth handling around PCM16_SAMPWIDTH and pcm so each supported
width is explicitly handled: for sampwidth==1, read as np.uint8 then convert to
float in [-1,1] by (pcm.astype(np.float32) - 128)/128; for sampwidth==2 keep the
existing np.int16 -> float conversion; for sampwidth==3, parse 24-bit
little-endian samples by reshaping the raw bytes into (N,3), sign-extending into
int32 then converting to float/divide by 2**23; for sampwidth==4, read as
np.int32 (or np.float32 if the WAV format indicates float) and convert to float
in [-1,1]; finally raise a clear error for unsupported sampwidth values.
Reference symbols: PCM16_SAMPWIDTH, sampwidth, pcm in evaluate_detection.py.
- Around line 141-142: The bare except block "except Exception: continue" in
evaluate_detection.py swallows errors and should be changed to either log the
exception and continue or track failures and report them at the end: capture the
exception object and call the module logger (e.g., logger.exception or
logger.error with traceback) inside that except block, and/or increment a
failure counter (e.g., failed_files += 1) so the calling function (or the end of
the loop) can print a summary of successes vs failures; update the except
handler around the current try/except to include the logging call and add a
final report after the loop to surface how many files failed.
- Around line 97-99: The bug is that evaluate_recording() currently creates
ground_truth = [True] * len(annotation.events), which wrongly treats "quiet"
events as positives; change evaluate_recording() to build ground_truth by
iterating annotation.events and mapping event.type (e.g., "quiet" -> False,
"speech"/"bell" -> True) so ground_truth aligns with each event, ensure
detected_flags = detector(samples, sample_rate, annotation) still aligns
index-wise with this list, and keep compute_metrics() unchanged but add/update
tests to assert that "quiet" yields False entries so FPR/TN are computed
correctly.

In `@daemon/src/daemon/tools/optimize_thresholds.py`:
- Around line 136-144: _real_evaluator currently returns an empty list which
causes run_grid_search to always raise ValueError; replace the placeholder with
a real implementation that instantiates or calls SoundDetectionService using the
provided parameters (sensitivity, energy_gate, speech_prob, overlap_ms) to
evaluate each pair in eval_pairs and return a populated list[EvaluationResult];
if returning empty is intentional for now, mark the function with a clear TODO
referencing the issue (e.g., "TODO: Task `#35` implement optimizer
recommendations") and raise a descriptive error instead of returning [] so
callers fail fast with a helpful message.
- Around line 43-48: select_best currently uses max(results, key=lambda r:
r.score) which returns the first occurrence on ties and makes behavior dependent
on grid iteration order; update select_best to handle equal scores
deterministically (or explicitly document the current tie-breaking). Fix by
either: 1) implementing a deterministic secondary key (e.g., return max by
(r.score, deterministic_repr_of_parameters) or by comparing a stable attribute
on GridSearchResult such as params/config/id) so ties are resolved predictably,
or 2) if the current first-occurrence behavior is intended, update the
select_best docstring to state that ties are broken by iteration order and the
first matching GridSearchResult is returned. Reference: select_best,
GridSearchResult, r.score (and the result attribute you choose for the secondary
deterministic tie-breaker).
- Around line 149-157: The call to run_grid_search currently discards its return
value so CLI users never see the best parameters; change the try block in
optimize_thresholds.py to capture the return (e.g., best =
run_grid_search(...)), then surface that result to the user (print to stdout or
log via the existing logger) and/or return it from the enclosing function so the
CLI can display the optimal parameters; keep the existing ValueError handling
but ensure the successful path returns or outputs the captured best result.
- Around line 80-81: 現在の eval_results チェックは空リストであることだけを検出しており、select_best が投げる
ValueError が「evaluator が全て空を返した」のか「recording_pairs が不正なのか」を判別できません。修正するには
optimize_thresholds.py の該当ブロック(参照: eval_results, select_best, recording_pairs,
evaluator)で eval_results が空の原因を具体的に判定し、例えば全ての evaluator が空を返した場合は "All
evaluators returned empty results" のような明確な ValueError を投げし、どの
evaluator(名前またはインデックス)や recording_pairs のサイズなどのコンテキストを含めてログ/例外メッセージに出すようにしてください。

In `@daemon/src/daemon/tools/record_evaluation.py`:
- Around line 54-58: ループ内で frames_remaining を to_read
で減算していますが、audio_input.read_frames(to_read)
が要求より少ないサンプルを返す可能性があるため、実際に返されたサンプル長で減算するよう修正してください(use
len(audio_frames.samples) rather than to_read), and handle the case of zero
returned frames by breaking to avoid an infinite loop; update references in the
loop that use to_read, including the variables chunk_frames,
audio_input.read_frames, audio_frames.samples, frames_remaining, and
collected.extend so the loop decrements and exits based on the actual samples
read.
- Around line 31-37: The record_audio function currently allows duration_s <= 0
to reach Pydantic's RecordingMetadata validation which yields unclear errors;
add an early fail-fast check at the start of record_audio that validates
duration_s > 0 and raises a clear ValueError (or a custom exception) with a
descriptive message if the check fails, so callers get an immediate, readable
error instead of a downstream Pydantic validation error; reference the
record_audio function and the RecordingMetadata usage to place the check before
constructing or passing data to RecordingMetadata.
- Around line 126-127: The imports are wrong and will fail at runtime: replace
the non-existent module imports for create_audio_input and load_settings with
the correct exported symbols; change the import of create_audio_input from
"daemon.audio.hal" to "daemon.audio" (referencing create_audio_input) and
replace load_settings (which doesn't exist) with the appropriate config loader
such as load_config_directory (or another available symbol like
parse_persona_document, resolve_config_dir, or resolve_config_file) and update
any subsequent uses of load_settings to the chosen function name (e.g.,
load_config_directory) so references in this file match the new import.

In `@tests/tools/test_annotation_schema.py`:
- Around line 138-140: Add tests to cover invalid-JSON and schema-mismatch cases
for load_annotation: create a temporary file containing malformed JSON and
assert load_annotation raises json.JSONDecodeError, and create another file
containing JSON that doesn't match the expected schema (e.g., {"foo":"bar"}) and
assert load_annotation raises pydantic.ValidationError (or ValidationError
imported in the test). Reference the existing test function name
test_load_annotation_missing_file_raises and add two new tests named like
test_load_annotation_invalid_json_raises and
test_load_annotation_schema_mismatch_raises so they run alongside the current
tests.

In `@tests/tools/test_evaluate_detection.py`:
- Around line 115-120: Add a symmetric test for missing annotation files: create
a new test function (e.g., test_load_recording_pair_missing_annotation_raises)
mirroring test_load_recording_pair_missing_wav_raises that writes a WAV via
_write_wav to wav_path and then asserts load_recording_pair(wav_path, tmp_path /
"no.json") raises FileNotFoundError; reference load_recording_pair and
_write_wav so the test covers the missing-annotation case.
- Around line 128-150: The test only covers the perfect-match case using
test_evaluate_recording_with_dummy_detector and a _dummy_detector that always
returns True; add additional tests exercising edge cases for evaluate_recording:
(1) a partial-detection case where _dummy_detector returns a mix of True/False
to assert precision/recall/F1 values less than 1, (2) an all-False detector to
assert F1==0 (and precision/recall as expected), and (3) a detector that raises
an exception to confirm evaluate_recording either propagates or handles the
error as intended; implement these new tests in
tests/tools/test_evaluate_detection.py using the same wav/annotation helpers and
assert the EvaluationResult metrics or error behavior.
- Around line 27-32: Remove the unnecessary wrapper functions _write_wav and
_write_annotation in tests/tools/test_evaluate_detection.py and replace all
their usages with direct calls to the conftest helper functions write_wav and
write_annotation (import them at the top if not already). Delete the _write_wav
and _write_annotation definitions and update any test references to call
write_wav(path, duration_s) and write_annotation(path, annotation) directly to
keep the tests DRY.

In `@tests/tools/test_optimize_thresholds.py`:
- Around line 25-30: Remove the unnecessary wrapper functions _write_wav and
_write_annotation and use the existing conftest helpers directly: replace calls
to _write_wav(...) with write_wav(...) and calls to _write_annotation(...) with
write_annotation(...), then delete the _write_wav and _write_annotation
definitions; ensure imports for write_wav and write_annotation are present or
use the same fixtures/helpers as test_evaluate_detection.py.
- Around line 113-143: The test suite lacks a deterministic tie-breaker test for
select_best when multiple GridSearchResult entries share the highest score; add
a unit test (e.g., test_select_best_tie_returns_first) in
tests/tools/test_optimize_thresholds.py that creates at least two
GridSearchResult objects with identical score values and asserts the expected
deterministic choice (document whether select_best should return the first
matching entry), and if select_best does not already implement this behavior,
update the select_best implementation to consistently return the first
highest-scoring GridSearchResult to make the test pass.

In `@tests/tools/test_record_evaluation.py`:
- Around line 69-78: Add a negative-duration unit test to complement
test_recording_metadata_positive_duration: create a new test (e.g.,
test_recording_metadata_negative_duration_raises) that constructs
RecordingMetadata with duration_s set to a negative value (e.g., -1.0) and
asserts it raises ValidationError; place it alongside the existing tests and use
the same fields (recording_id, wav_path, device_type, sample_rate, timestamp) so
it validates the model's rejection of negative duration_s.
- Around line 86-151: Add tests covering the CLI entrypoint _main: verify normal
startup (pass known args via monkeypatching sys.argv and patching record_audio
to return expected paths) and that _main parses args correctly; add a test that
simulates record_audio raising KeyboardInterrupt and assert _main handles it
(calls cleanup if any) and exits with SystemExit/return code 1; also add a test
for invalid args to ensure argument parsing triggers SystemExit with non-zero
code. Use unique identifiers _main, record_audio, KeyboardInterrupt, and check
exit codes and any logged messages to validate behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bf4f663d-e854-41f0-8683-56f230a4507a

📥 Commits

Reviewing files that changed from the base of the PR and between 1280bd1 and dfcf17b.

📒 Files selected for processing (12)
  • daemon/src/daemon/tools/__init__.py
  • daemon/src/daemon/tools/annotation_schema.py
  • daemon/src/daemon/tools/evaluate_detection.py
  • daemon/src/daemon/tools/optimize_thresholds.py
  • daemon/src/daemon/tools/record_evaluation.py
  • data/evaluation/.gitkeep
  • tests/tools/__init__.py
  • tests/tools/conftest.py
  • tests/tools/test_annotation_schema.py
  • tests/tools/test_evaluate_detection.py
  • tests/tools/test_optimize_thresholds.py
  • tests/tools/test_record_evaluation.py

Comment thread daemon/src/daemon/tools/annotation_schema.py
Comment thread daemon/src/daemon/tools/annotation_schema.py Outdated
Comment thread daemon/src/daemon/tools/evaluate_detection.py Outdated
Comment thread daemon/src/daemon/tools/evaluate_detection.py Outdated
Comment thread daemon/src/daemon/tools/evaluate_detection.py Outdated
Comment thread tests/tools/test_evaluate_detection.py
Comment thread tests/tools/test_optimize_thresholds.py Outdated
Comment thread tests/tools/test_optimize_thresholds.py
Comment thread tests/tools/test_record_evaluation.py
Comment thread tests/tools/test_record_evaluation.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (6)
daemon/src/daemon/tools/record_evaluation.py (2)

31-75: ⚠️ Potential issue | 🟠 Major

duration_s は WAV 書き込み前に弾いてください。

0 や負値を許すと、現状は Line 61 で空 WAV を書いたあと Line 65-72 の RecordingMetadata 検証で落ちます。失敗時に録音アーティファクトが残るので、入口で fail-fast した方が安全です。

💡 修正案
 def record_audio(
     *,
     audio_input: AudioInput,
     duration_s: float,
     output_dir: Path,
     recording_id: str | None = None,
 ) -> tuple[Path, Path]:
     """Record audio for *duration_s* seconds and save WAV + metadata JSON.
 
     Returns the (wav_path, metadata_json_path) tuple.
     """
+    if duration_s <= 0:
+        raise ValueError("duration_s must be positive")
     output_dir.mkdir(parents=True, exist_ok=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/record_evaluation.py` around lines 31 - 75, The
record_audio function should validate the duration_s input up-front and
fail-fast to avoid creating empty WAV/metadata files: check that duration_s is a
positive number (e.g., > 0) at the start of record_audio and raise a ValueError
(or appropriate exception) before any I/O; do not proceed to compute frames,
call audio_input.read_frames, or invoke _write_wav/RecordingMetadata if the
duration is invalid. Reference: record_audio, audio_input.read_frames,
_write_wav, and RecordingMetadata to locate where to add the validation.

54-58: ⚠️ Potential issue | 🟠 Major

短い read_frames() を誤って完了扱いしています。

Line 58 で to_read を引いているため、HAL が要求数より短い chunk を返すと録音が黙って短くなります。実際に読めた件数で減算し、0 件なら明示的に失敗させてください。

💡 修正案
     while frames_remaining > 0:
         to_read = min(chunk_frames, frames_remaining)
         audio_frames = audio_input.read_frames(to_read)
-        collected.extend(audio_frames.samples)
-        frames_remaining -= to_read
+        frames_read = len(audio_frames.samples)
+        if frames_read == 0:
+            raise RuntimeError("audio input returned no frames during recording")
+        collected.extend(audio_frames.samples)
+        frames_remaining -= frames_read
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/record_evaluation.py` around lines 54 - 58, The loop
incorrectly subtracts the requested to_read from frames_remaining which hides
short reads; change the loop in record_evaluation.py to use the actual number of
frames returned by audio_input.read_frames (e.g., use audio_frames.num_frames if
available or compute frames_read from len(audio_frames.samples)/channels) to
decrement frames_remaining, and if frames_read == 0 raise/return an explicit
error instead of silently finishing; update references to
audio_input.read_frames, audio_frames, frames_remaining and collected
accordingly so collected.extend still uses the actual samples read.
daemon/src/daemon/tools/evaluate_detection.py (1)

157-166: ⚠️ Potential issue | 🔴 Critical

_main が detector を一度も使っていません。

Line 159-161 は ground_truth をそのまま compute_metrics() に渡しているので、CLI は常に完全一致の評価になります。これでは実録音の精度評価や VAD 評価にならず、今のまま 0 を返すのは誤解を招きます。少なくとも各 pair を evaluate_recording(..., detector=...) で処理し、結果を集計・出力できるまでは成功扱いにしないでください。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/evaluate_detection.py` around lines 157 - 166, The
loop in _main currently computes metrics by passing ground_truth to
compute_metrics directly, never using the detector; change it to call
evaluate_recording(wav_path, ann_path, detector=detector) (or pass the loaded
samples to evaluate_recording) to obtain the detected_event list, then call
compute_metrics(ground_truth, detected_events) for each pair, aggregate the
per-file results (e.g., accumulate counts/metrics or store results) and only
treat the run as successful (non-zero/zero exit) after producing a combined
summary output; update the exception path to continue on errors but ensure the
final return code reflects whether any recordings were actually evaluated.
daemon/src/daemon/tools/optimize_thresholds.py (3)

84-85: ⚠️ Potential issue | 🟠 Major

全パラメータで評価結果が空のとき、例外原因が不明瞭です。

現状は select_best 側の汎用メッセージに委譲されるため、「evaluator が全て空を返した」ことが判別しづらいです。run_grid_search で明示的に失敗させてください。

♻️ 提案パッチ
-    return select_best(all_results)
+    if not all_results:
+        msg = (
+            "No valid evaluation results: evaluator returned empty results for "
+            f"all parameter combinations (recording_pairs={len(recording_pairs)})"
+        )
+        raise ValueError(msg)
+    return select_best(all_results)

Also applies to: 102-102

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/optimize_thresholds.py` around lines 84 - 85, When
run_grid_search finds that eval_results is empty for a parameter trial (the "if
not eval_results: continue" branch), change behavior to raise a clear exception
instead of silently continuing so callers don't get a generic select_best error;
update the logic in run_grid_search to detect if every trial produced empty
eval_results and raise a RuntimeError (or custom exception) with a message like
"All evaluations returned empty results — evaluator returned no metrics for any
parameter set" so the failure is explicit (also update the analogous check at
the other occurrence around the lines referenced). Ensure the exception
references the evaluator/context (e.g., include evaluator name or grid id) to
aid debugging and leave select_best unchanged.

140-148: ⚠️ Potential issue | 🔴 Critical

_real_evaluator が常に空配列を返すため、CLI が実質的に常時失敗します。

この実装だと run_grid_search に有効結果が1件も積まれず、最適化機能が成立しません。少なくとも未実装を明示して fail-fast にしてください(理想は実実装)。

🛠 最低限の fail-fast 修正例
     def _real_evaluator(
         sensitivity: float,  # noqa: ARG001
         energy_gate: float,  # noqa: ARG001
         speech_prob: float,  # noqa: ARG001
         overlap_ms: int,  # noqa: ARG001
         eval_pairs: list[tuple[Path, Path]],  # noqa: ARG001
     ) -> list[EvaluationResult]:
-        # Placeholder: in production wire up SoundDetectionService with these params
-        return []
+        msg = (
+            "Optimizer evaluator is not implemented. "
+            "Wire SoundDetectionService evaluation for Task `#35`."
+        )
+        raise NotImplementedError(msg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/optimize_thresholds.py` around lines 140 - 148, The
_real_evaluator function currently always returns an empty list causing
run_grid_search to never receive results; replace the placeholder by failing
fast: raise a NotImplementedError (or a RuntimeError with a clear message) from
_real_evaluator so the CLI surfaces the unimplemented evaluator instead of
silently succeeding, or preferably implement the evaluator by invoking the real
SoundDetectionService evaluation path (e.g., call the service/evaluate method to
produce a list[EvaluationResult] from eval_pairs using the sensitivity,
energy_gate, speech_prob, and overlap_ms parameters) so run_grid_search receives
actual results; update any callers/tests expecting list behavior accordingly.

153-163: ⚠️ Potential issue | 🟠 Major

最適化結果を捨てており、CLI利用者が推奨値を取得できません。

成功時に run_grid_search の戻り値を表示してください。

♻️ 提案パッチ
     try:
-        run_grid_search(
+        best = run_grid_search(
             recording_pairs=pairs,
             grid=grid,
             config=cfg,
             evaluator=_real_evaluator,
         )
     except ValueError:
         return 1
 
+    print(best.model_dump_json(indent=2))
     return 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/optimize_thresholds.py` around lines 153 - 163,
現在の最適化処理は run_grid_search の戻り値を破棄しているため CLI 利用者に推奨値が渡りません。optimize_thresholds.py
の該当ブロックで run_grid_search(...) の戻り値を変数(例: result または
best_params)で受け取り、成功時に標準出力へ表示またはログ出力するように変更してください(例:
print/json.dump/processLogger.info 等を使って CLI で読める形式で出力)。エラー分岐はそのまま維持し、ValueError
時は既存の return 1 を残してください。Ensure to reference the run_grid_search call and use the
returned value for output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@daemon/src/daemon/tools/evaluate_detection.py`:
- Around line 40-41: The current DetectorCallable signature
(Callable[[FloatArray, int, RecordingAnnotation], list[bool]]) only returns
per-annotation hit/miss and cannot represent detection timestamps, latency,
false-positive times or VAD interval matches; update the detector API to return
a structured DetectionResult (e.g., include fields: detected_times or
detected_intervals, confidence, matched_annotation_id or None,
detection_latency_ms, vad_start/end_overlap) and change all usages in
evaluate_recording and EvaluationResult to consume that structure (modify
evaluate_recording, any code handling detector outputs, and extend
EvaluationResult to aggregate latency, FP times, and VAD comparisons
accordingly); ensure RecordingAnnotation stays as input and adapt code paths
referenced around DetectorCallable and the blocks at lines ~109-123 to handle
the new DetectionResult fields.

In `@daemon/src/daemon/tools/record_evaluation.py`:
- Around line 99-115: The parser in _build_parser currently only accepts
--duration and --output and lacks any retention controls; add CLI options such
as a boolean --ephemeral (to auto-delete files after evaluation) and a numeric
--retention-days (or --retain-seconds) with sensible defaults, update the
argument parsing usage in the recording flow to honor these flags (if
--ephemeral delete files immediately after processing; if --retention-days
schedule or record an expiry timestamp in metadata and ensure cleanup logic
removes expired files), and ensure the output Path default remains but that
downstream save/cleanup code checks the new args before persisting files.

In `@tests/tools/test_evaluate_detection.py`:
- Around line 288-326: The test only asserts exit code and an error log, so
change it to also verify that per-pair evaluation actually ran and emitted
metrics: monkeypatch or spy on evaluate_detection.evaluate_recording (or the
detector function used by _main) to record calls and return a realistic metrics
dict (e.g., non-empty metrics) for the good pair, then call _main and assert
evaluate_recording was invoked for the good WAV (and not for the bad WAV which
triggers the loader error) and that the returned/printed metrics string appears
in caplog.text (or stdout capture) instead of relying solely on exit code;
reference evaluate_detection._main and evaluate_detection.evaluate_recording (or
the detector function used inside _main) when adding the spy and the assertions.

In `@tests/tools/test_optimize_thresholds.py`:
- Around line 178-227: The test test_run_grid_search_returns_best currently only
varies sensitivity_threshold so other grid axes aren't exercised; update the
ThresholdGrid in this test to provide multiple values for at least one
additional axis (e.g., energy_gate_threshold and/or speech_prob_threshold and/or
overlap_hold_ms) so the grid is truly multi-dimensional, then update
expected_call_count to equal the product of the lengths of all threshold lists
and keep using run_grid_search and the _mock_evaluator to verify the evaluator
was called that many times and that best.sensitivity_threshold remains the
expected best value; reference: test_run_grid_search_returns_best,
ThresholdGrid, run_grid_search, and _mock_evaluator.
- Around line 262-316: The tests only assert exit codes and therefore miss
regressions in the CLI output (the "current vs recommended" table and
recommended thresholds); update the two tests
test_main_returns_0_when_grid_search_succeeds and
test_main_returns_1_when_grid_search_raises to capture stdout/stderr (use
pytest's capsys) and assert that the selected metric string (e.g., "precision")
and both the "current" and "recommended" threshold values or table headers
appear in stdout when run_grid_search returns a GridSearchResult, and in the
failing case assert the appropriate error/failure message is printed to
stderr/stdout; locate the CLI entrypoint function _main and monkeypatched
optimize_thresholds.run_grid_search to validate output content rather than only
exit codes.

In `@tests/tools/test_record_evaluation.py`:
- Around line 72-94: Add an invalid-input test that exercises record_audio()
directly (not just RecordingMetadata) to ensure it immediately raises for
duration_s <= 0 and produces no files; call record_audio(...) with a zero or
negative duration and assert it raises the appropriate exception, then assert
the expected output paths (WAV and JSON) do not exist; reference the
record_audio function and the RecordingMetadata model so the test builds the
same input object but verifies record_audio's error-path behavior and absence of
any created artifacts.
- Around line 101-167: The tests only check artifact presence but not
privacy/retention behavior; update the record_audio tests (e.g.,
test_record_audio_creates_wav_and_metadata, test_record_audio_auto_generates_id,
test_record_audio_output_dir_created) to assert explicit privacy behavior: call
record_audio with the retention/opt_in parameter (or simulate the cleanup flow)
and then assert either that the metadata includes retention fields like
"retention_policy" or "expires_at" and/or that files (wav_path and meta_path)
are deleted after invoking the cleanup/rollback routine; use the same symbols
from the diff (record_audio, wav_path, meta_path, and the test helpers like
_FakeAudioInput) so the tests validate deletion/expiration or required opt-in
rather than only existence.

---

Duplicate comments:
In `@daemon/src/daemon/tools/evaluate_detection.py`:
- Around line 157-166: The loop in _main currently computes metrics by passing
ground_truth to compute_metrics directly, never using the detector; change it to
call evaluate_recording(wav_path, ann_path, detector=detector) (or pass the
loaded samples to evaluate_recording) to obtain the detected_event list, then
call compute_metrics(ground_truth, detected_events) for each pair, aggregate the
per-file results (e.g., accumulate counts/metrics or store results) and only
treat the run as successful (non-zero/zero exit) after producing a combined
summary output; update the exception path to continue on errors but ensure the
final return code reflects whether any recordings were actually evaluated.

In `@daemon/src/daemon/tools/optimize_thresholds.py`:
- Around line 84-85: When run_grid_search finds that eval_results is empty for a
parameter trial (the "if not eval_results: continue" branch), change behavior to
raise a clear exception instead of silently continuing so callers don't get a
generic select_best error; update the logic in run_grid_search to detect if
every trial produced empty eval_results and raise a RuntimeError (or custom
exception) with a message like "All evaluations returned empty results —
evaluator returned no metrics for any parameter set" so the failure is explicit
(also update the analogous check at the other occurrence around the lines
referenced). Ensure the exception references the evaluator/context (e.g.,
include evaluator name or grid id) to aid debugging and leave select_best
unchanged.
- Around line 140-148: The _real_evaluator function currently always returns an
empty list causing run_grid_search to never receive results; replace the
placeholder by failing fast: raise a NotImplementedError (or a RuntimeError with
a clear message) from _real_evaluator so the CLI surfaces the unimplemented
evaluator instead of silently succeeding, or preferably implement the evaluator
by invoking the real SoundDetectionService evaluation path (e.g., call the
service/evaluate method to produce a list[EvaluationResult] from eval_pairs
using the sensitivity, energy_gate, speech_prob, and overlap_ms parameters) so
run_grid_search receives actual results; update any callers/tests expecting list
behavior accordingly.
- Around line 153-163: 現在の最適化処理は run_grid_search の戻り値を破棄しているため CLI
利用者に推奨値が渡りません。optimize_thresholds.py の該当ブロックで run_grid_search(...) の戻り値を変数(例:
result または best_params)で受け取り、成功時に標準出力へ表示またはログ出力するように変更してください(例:
print/json.dump/processLogger.info 等を使って CLI で読める形式で出力)。エラー分岐はそのまま維持し、ValueError
時は既存の return 1 を残してください。Ensure to reference the run_grid_search call and use the
returned value for output.

In `@daemon/src/daemon/tools/record_evaluation.py`:
- Around line 31-75: The record_audio function should validate the duration_s
input up-front and fail-fast to avoid creating empty WAV/metadata files: check
that duration_s is a positive number (e.g., > 0) at the start of record_audio
and raise a ValueError (or appropriate exception) before any I/O; do not proceed
to compute frames, call audio_input.read_frames, or invoke
_write_wav/RecordingMetadata if the duration is invalid. Reference:
record_audio, audio_input.read_frames, _write_wav, and RecordingMetadata to
locate where to add the validation.
- Around line 54-58: The loop incorrectly subtracts the requested to_read from
frames_remaining which hides short reads; change the loop in
record_evaluation.py to use the actual number of frames returned by
audio_input.read_frames (e.g., use audio_frames.num_frames if available or
compute frames_read from len(audio_frames.samples)/channels) to decrement
frames_remaining, and if frames_read == 0 raise/return an explicit error instead
of silently finishing; update references to audio_input.read_frames,
audio_frames, frames_remaining and collected accordingly so collected.extend
still uses the actual samples read.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0d407e9c-c102-4480-85b2-856ba1bd8819

📥 Commits

Reviewing files that changed from the base of the PR and between dfcf17b and d8b7131.

📒 Files selected for processing (9)
  • daemon/src/daemon/tools/annotation_schema.py
  • daemon/src/daemon/tools/evaluate_detection.py
  • daemon/src/daemon/tools/optimize_thresholds.py
  • daemon/src/daemon/tools/record_evaluation.py
  • docs/adr/decision-log.md
  • docs/adr/decisions/2026-04-05-address-pr-138-review-findings-for-recording-evaluation-tools-coverage-wav-decoding-ground-truth-mapping-and-tests.md
  • tests/tools/test_evaluate_detection.py
  • tests/tools/test_optimize_thresholds.py
  • tests/tools/test_record_evaluation.py

Comment thread daemon/src/daemon/tools/evaluate_detection.py Outdated
Comment thread daemon/src/daemon/tools/record_evaluation.py
Comment thread tests/tools/test_evaluate_detection.py
Comment thread tests/tools/test_optimize_thresholds.py
Comment thread tests/tools/test_optimize_thresholds.py
Comment thread tests/tools/test_record_evaluation.py
Comment thread tests/tools/test_record_evaluation.py Outdated
@ToaruPen
Copy link
Copy Markdown
Owner Author

ToaruPen commented Apr 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
daemon/src/daemon/tools/record_evaluation.py (3)

129-133: ⚠️ Potential issue | 🔴 Critical

存在しない import に依存しており、実行時に失敗します。

Line 129 の daemon.audio.hal と Line 130 の load_settings は、提示された関連コードと整合していません。実在する公開シンボルへの import に修正し、続く呼び出しも合わせて更新してください。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/record_evaluation.py` around lines 129 - 133, The
imports in record_evaluation.py reference non-existent symbols; replace the
incorrect imports (create_audio_input from daemon.audio.hal and load_settings)
with the actual public symbols used elsewhere (e.g., the real audio factory and
config loader used in the project) and update the calls accordingly: import the
correct config loader (the project's actual load_settings or equivalent) to
obtain settings, and import the real audio input factory (the module/class that
exposes the audio input constructor used elsewhere) so the call audio_input =
create_audio_input(settings.audio) uses the real factory name; search for the
project's canonical config loader and audio factory (symbols referenced
elsewhere) and swap the import names and function call in record_evaluation.py
to match those existing public symbols.

102-118: ⚠️ Potential issue | 🟠 Major

録音保持ポリシーの指定がなく、privacy要件を満たせません。

--duration / --output だけでは「ephemeral もしくは configurable retention」の運用ができません。--ephemeral と保持期限(例: --retention-days)を CLI とメタデータの両方で扱えるようにしてください。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/record_evaluation.py` around lines 102 - 118, The CLI
parser in _build_parser currently only exposes --duration and --output and lacks
retention policy options; add two new arguments --ephemeral (store_true) and
--retention-days (int, default maybe 0) and make them mutually
exclusive/validated so users choose ephemeral OR a retention period; update any
downstream metadata creation code that writes evaluation metadata (the
function(s) that save WAV and metadata in this module) to include a
retention_policy field reflecting either "ephemeral": true or "retention_days":
N so the on-disk metadata contains the chosen policy; ensure argument names
referenced are --ephemeral and --retention-days and that validation occurs after
parsing in the same module.

57-61: ⚠️ Potential issue | 🟠 Major

実際に読めたフレーム数で進捗を更新しないと録音長が壊れます。

Line 61 で to_read 固定減算しているため、read_frames() が短く返す環境で実録音時間が過小になります。0 フレーム返却時の脱出も必要です。

♻️ 修正案
     while frames_remaining > 0:
         to_read = min(chunk_frames, frames_remaining)
         audio_frames = audio_input.read_frames(to_read)
-        collected.extend(audio_frames.samples)
-        frames_remaining -= to_read
+        actual_read = len(audio_frames.samples)
+        if actual_read == 0:
+            break
+        collected.extend(audio_frames.samples)
+        frames_remaining -= actual_read
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/record_evaluation.py` around lines 57 - 61,
ループで進捗を更新する際に固定の to_read を減算しているため、audio_input.read_frames(to_read)
が短く返す環境で実録音長が短くなる不具合があります。audio_input.read_frames の戻り値から実際に読めたフレーム数(たとえば
audio_frames.num_frames または samples の長さからチャネル数で割った値)を算出し、collected.extend(...)
後に frames_remaining をその実際読込フレーム数で減算し、もし実際読込フレーム数が 0 ならループを脱出する処理を追加してください(参照箇所:
while frames_remaining > 0, to_read, audio_input.read_frames,
audio_frames.samples)。
tests/tools/test_record_evaluation.py (1)

96-105: ⚠️ Potential issue | 🟠 Major

invalid-input テストで「生成物が残らないこと」まで直接検証してください。

現状は例外確認のみで、副作用(WAV/JSON未作成)の安全性が未検証です。tmp_path が空のままであることを追加で assert してください。
As per coding guidelines, tests/**/*.py: Core backend behavior should cover happy-path, invalid-input, and error-path cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tools/test_record_evaluation.py` around lines 96 - 105, Update the two
invalid-input tests (test_record_audio_zero_duration_raises and
test_record_audio_negative_duration_raises) to also verify there are no
side-effects by asserting the temporary output directory remains empty after the
exception from record_audio; after the pytest.raises block use
tmp_path.iterdir() (or list(tmp_path.iterdir())) to assert no files were created
and reference the same _FakeAudioInput instance and output_dir=tmp_path
parameters to ensure you’re checking the correct location.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@daemon/src/daemon/tools/record_evaluation.py`:
- Around line 52-53: The calculation for total_frames can become zero for very
small positive duration_s, causing empty WAVs and later failures; change the
computation to guarantee at least one frame (e.g., set total_frames = max(1,
int(sample_rate * duration_s)) or enforce duration_s >= 1/sample_rate) and
update any later places that derive frames/duration (the block that creates
RecordingMetadata(duration_s=...)) to use the adjusted non-zero duration/frames;
apply the same minimum-frame guard to the duplicate logic referenced (lines
~66-76) so both code paths never produce total_frames == 0 and errors are
avoided.

In `@tests/tools/test_record_evaluation.py`:
- Around line 187-224: The test lacks verification that the audio input was
closed; update the stub used in test_main_records_and_closes_audio_input so its
_FakeAudioInput exposes a closed flag (e.g., self.closed = False and a close()
method that sets it True), ensure the monkeypatched
daemon.audio.hal.create_audio_input returns that stub instance, and after
calling record_evaluation._main([...]) assert that audio_input.closed is True
(in addition to the existing calls/asserts) to confirm close() was invoked.

---

Duplicate comments:
In `@daemon/src/daemon/tools/record_evaluation.py`:
- Around line 129-133: The imports in record_evaluation.py reference
non-existent symbols; replace the incorrect imports (create_audio_input from
daemon.audio.hal and load_settings) with the actual public symbols used
elsewhere (e.g., the real audio factory and config loader used in the project)
and update the calls accordingly: import the correct config loader (the
project's actual load_settings or equivalent) to obtain settings, and import the
real audio input factory (the module/class that exposes the audio input
constructor used elsewhere) so the call audio_input =
create_audio_input(settings.audio) uses the real factory name; search for the
project's canonical config loader and audio factory (symbols referenced
elsewhere) and swap the import names and function call in record_evaluation.py
to match those existing public symbols.
- Around line 102-118: The CLI parser in _build_parser currently only exposes
--duration and --output and lacks retention policy options; add two new
arguments --ephemeral (store_true) and --retention-days (int, default maybe 0)
and make them mutually exclusive/validated so users choose ephemeral OR a
retention period; update any downstream metadata creation code that writes
evaluation metadata (the function(s) that save WAV and metadata in this module)
to include a retention_policy field reflecting either "ephemeral": true or
"retention_days": N so the on-disk metadata contains the chosen policy; ensure
argument names referenced are --ephemeral and --retention-days and that
validation occurs after parsing in the same module.
- Around line 57-61: ループで進捗を更新する際に固定の to_read
を減算しているため、audio_input.read_frames(to_read)
が短く返す環境で実録音長が短くなる不具合があります。audio_input.read_frames の戻り値から実際に読めたフレーム数(たとえば
audio_frames.num_frames または samples の長さからチャネル数で割った値)を算出し、collected.extend(...)
後に frames_remaining をその実際読込フレーム数で減算し、もし実際読込フレーム数が 0 ならループを脱出する処理を追加してください(参照箇所:
while frames_remaining > 0, to_read, audio_input.read_frames,
audio_frames.samples)。

In `@tests/tools/test_record_evaluation.py`:
- Around line 96-105: Update the two invalid-input tests
(test_record_audio_zero_duration_raises and
test_record_audio_negative_duration_raises) to also verify there are no
side-effects by asserting the temporary output directory remains empty after the
exception from record_audio; after the pytest.raises block use
tmp_path.iterdir() (or list(tmp_path.iterdir())) to assert no files were created
and reference the same _FakeAudioInput instance and output_dir=tmp_path
parameters to ensure you’re checking the correct location.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a4005bed-c5ce-449e-97df-c5c7eba381cc

📥 Commits

Reviewing files that changed from the base of the PR and between d8b7131 and c93e9df.

📒 Files selected for processing (2)
  • daemon/src/daemon/tools/record_evaluation.py
  • tests/tools/test_record_evaluation.py

Comment thread daemon/src/daemon/tools/record_evaluation.py Outdated
Comment thread tests/tools/test_record_evaluation.py
@ToaruPen
Copy link
Copy Markdown
Owner Author

ToaruPen commented Apr 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@ToaruPen
Copy link
Copy Markdown
Owner Author

ToaruPen commented Apr 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@ToaruPen ToaruPen closed this Apr 5, 2026
@ToaruPen ToaruPen reopened this Apr 5, 2026
@ToaruPen ToaruPen force-pushed the feat/task-8c-recording-evaluation-v3 branch 3 times, most recently from e4d18a9 to c9a2c21 Compare April 5, 2026 08:58
ToaruPen and others added 6 commits April 5, 2026 17:59
…etrics, and threshold optimizer

Add CLI tools for calibrating sound detection and VAD thresholds using real
facility recordings: record_evaluation (capture + WAV export), evaluate_detection
(precision/recall/F1 against annotated ground truth), and optimize_thresholds
(grid search over detection parameters). Includes Pydantic annotation schema
with validation, shared test fixtures, and 43 passing tests.

Closes #35

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ToaruPen ToaruPen force-pushed the feat/task-8c-recording-evaluation-v3 branch from c9a2c21 to 51de3e8 Compare April 5, 2026 08:59
coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 5, 2026
@ToaruPen
Copy link
Copy Markdown
Owner Author

ToaruPen commented Apr 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (5)
tests/tools/test_optimize_thresholds.py (1)

29-35: 🧹 Nitpick | 🔵 Trivial

ラッパー関数は不要です。

過去レビューで指摘済み:_write_wav_write_annotationは単にconftestヘルパーを呼び出すだけです。直接インポートして使用する方がDRYです。

♻️ 提案
-def _write_wav(path: Path, duration_s: float = 1.0) -> None:
-    write_wav(path, duration_s)
-
-
-def _write_annotation(path: Path, annotation: RecordingAnnotation) -> None:
-    write_annotation(path, annotation)

# テスト内では直接 write_wav と write_annotation を使用
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tools/test_optimize_thresholds.py` around lines 29 - 35, Remove the
unnecessary wrapper functions _write_wav and _write_annotation and update the
test to call the conftest helper functions directly; locate the two wrappers in
tests/tools/test_optimize_thresholds.py (functions named _write_wav and
_write_annotation) and delete them, then import the corresponding helper symbols
(write_wav and write_annotation or the actual helper names from conftest) at the
top of the test file and replace any calls to _write_wav/_write_annotation with
direct calls to those imported helpers.
daemon/src/daemon/tools/annotation_schema.py (1)

29-31: 🧹 Nitpick | 🔵 Trivial

eventsリストの空許容ポリシーをdocstringで明示することを推奨。

過去レビューで指摘済み:空のeventsリストが許容される設計であれば、docstringで明示すると下流での0除算等のエッジケース対応が明確になります。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/annotation_schema.py` around lines 29 - 31, Add a
docstring to the RecordingAnnotation model that explicitly states the events
list may be empty and the expected semantics (e.g., empty list is valid,
consumers must handle zero-length events to avoid divide-by-zero or aggregation
errors); update the RecordingAnnotation class docstring to mention the type of
items (AnnotationEvent), whether None is allowed (no), and any recommended
handling for downstream code that consumes events so behavior is clear to
readers of the ContractModel.
tests/tools/test_annotation_schema.py (1)

138-140: ⚠️ Potential issue | 🟡 Minor

不正なJSON・スキーマ不一致のエラーパステストが不足しています。

過去レビューで指摘済み:load_annotationが不正なJSON(パース不可)やスキーマ不一致のJSONを読み込んだ場合のテストがありません。

💚 追加テスト例
def test_load_annotation_invalid_json_raises(tmp_path: Path) -> None:
    filepath = tmp_path / "invalid.json"
    filepath.write_text("not valid json", encoding="utf-8")
    with pytest.raises(json.JSONDecodeError):
        load_annotation(filepath)


def test_load_annotation_schema_mismatch_raises(tmp_path: Path) -> None:
    filepath = tmp_path / "bad_schema.json"
    filepath.write_text('{"foo": "bar"}', encoding="utf-8")
    with pytest.raises(ValidationError):
        load_annotation(filepath)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tools/test_annotation_schema.py` around lines 138 - 140, Add two tests
to tests/tools/test_annotation_schema.py that verify load_annotation raises the
correct exceptions for malformed input: create
test_load_annotation_invalid_json_raises which writes an invalid JSON string to
a temp file and asserts load_annotation(filepath) raises json.JSONDecodeError,
and create test_load_annotation_schema_mismatch_raises which writes a JSON that
doesn't match the schema (e.g., {"foo":"bar"}) and asserts
load_annotation(filepath) raises pydantic.ValidationError; reference the
existing test_load_annotation_missing_file_raises for style and use tmp_path and
load_annotation as in the file.
tests/tools/test_evaluate_detection.py (1)

32-38: 🧹 Nitpick | 🔵 Trivial

ラッパー関数は不要です。

過去レビューで指摘済み:_write_wav_write_annotationは単にconftestヘルパーを呼び出すだけです。

♻️ 提案
-def _write_wav(path: Path, duration_s: float = 1.0) -> None:
-    write_wav(path, duration_s)
-
-
-def _write_annotation(path: Path, annotation: RecordingAnnotation) -> None:
-    write_annotation(path, annotation)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tools/test_evaluate_detection.py` around lines 32 - 38, Remove the
unnecessary wrapper functions _write_wav and _write_annotation in
tests/tools/test_evaluate_detection.py: these only forward to the conftest
helpers (write_wav, write_annotation), so delete the two functions and update
any test code in the file that calls _write_wav or _write_annotation to call
write_wav and write_annotation directly (ensure the conftest helpers are
imported or available in the test context).
daemon/src/daemon/tools/optimize_thresholds.py (1)

84-85: 🧹 Nitpick | 🔵 Trivial

全evaluatorが空を返した場合のエラーメッセージが不明瞭

全ての評価結果が空の場合、select_bestValueError("Cannot select best from empty results list")を投げますが、原因が「evaluatorが全て空を返した」のか「recording_pairsが不正」なのか判別できません。

♻️ 提案する修正
+    if not all_results:
+        msg = "Grid search produced no results: all evaluator calls returned empty"
+        raise ValueError(msg)
     return select_best(all_results)

Also applies to: 102-102

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@daemon/src/daemon/tools/optimize_thresholds.py` around lines 84 - 85, When
all evaluators return empty results the later call to select_best raises a
generic ValueError; change the logic around the eval_results collection (the
block using eval_results, evaluators, and recording_pairs) to detect the case
where every evaluator produced no results and raise a clearer ValueError before
calling select_best. Specifically, after iterating evaluators and before
invoking select_best, check if the aggregated list of eval_results is empty and
raise ValueError("No evaluation results: all evaluators returned empty for given
recording_pairs; evaluators=[...], recording_pairs_summary=...") — include
evaluator identifiers (from evaluators) and a brief summary of recording_pairs
to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@daemon/src/daemon/tools/evaluate_detection.py`:
- Around line 187-214: _add a clear TODO inside the _placeholder_detector
function to mark it as a temporary stub that always returns detected=False and
must be replaced with real detection logic; update the _placeholder_detector
docstring or top of the function body to include a TODO note (e.g., "TODO:
implement real detection logic; currently returns all events as not detected")
and optionally reference evaluate_recording and DetectionEventResult so future
implementers know this stub is used by evaluate_recording and should produce
proper DetectionEventResult values when implemented.
- Around line 106-114: The current branch in evaluate_detection.py treats
sampwidth==4 as always int32 (PCM32_SAMPWIDTH) which misinterprets 32-bit IEEE
float WAV data; update the loading logic for sampwidth==PCM32_SAMPWIDTH to
detect whether the WAV is 32-bit float vs 32-bit int by using a reliable source
of format info (e.g., switch to soundfile.read or scipy.io.wavfile.read which
returns dtype, or parse the WAV fmt chunk manually) and then call
np.frombuffer(raw, dtype=np.float32).astype(np.float64) / 1.0 for float32 or
np.int32 conversion + PCM32_SCALE for int32; adjust the branch handling around
sampwidth, PCM32_SAMPWIDTH, and the conversion code accordingly and add a unit
test covering a 32-bit-float WAV sample to ensure correct behavior.

In `@daemon/src/daemon/tools/optimize_thresholds.py`:
- Around line 158-163: The code mixes the field names energy_gate_threshold
(used in ThresholdGrid and GridSearchResult) and gate_rms_threshold (the actual
field on SoundDetectionSettings), which is confusing; pick a single canonical
name (either energy_gate_threshold or gate_rms_threshold) and make them
consistent across ThresholdGrid, GridSearchResult and any code that reads/writes
settings, or add an explicit mapping layer and comment where values are
translated (for example update current_thresholds in optimize_thresholds.py to
map settings.sound_detection.gate_rms_threshold -> energy_gate_threshold or
rename ThresholdGrid/GridSearchResult fields to gate_rms_threshold) and add a
short comment documenting the chosen correspondence so future maintainers can
find the translation between SoundDetectionSettings.gate_rms_threshold and the
grid/result field.

In `@daemon/src/daemon/tools/record_evaluation.py`:
- Around line 96-100: The function currently returns wav_path and meta_path
after deleting them when ephemeral is True, causing callers to get paths that no
longer exist; update the function (in record_evaluation.py) so that when
ephemeral is True it returns (None, None) instead of deleted wav_path/meta_path
and add a clear docstring note for the function describing that when
ephemeral=True the file paths will be deleted and None is returned for both
wav_path and meta_path; ensure callers that rely on wav_path/meta_path handle
None accordingly and reference the parameters ephemeral and the return values
wav_path/meta_path in the docstring.

In
`@docs/adr/decisions/2026-04-05-address-pr-138-review-findings-for-recording-evaluation-tools-coverage-wav-decoding-ground-truth-mapping-and-tests.md`:
- Around line 7-8: このADRのfilesおよびadr_paths配列が空になっているので、PR
`#138で実際に修正`・追加したファイルや関連ADRファイルを列挙してください;具体的にはfiles配列に変更したソース/テスト/スキーマのパス(例:daemon/src/daemon/tools/...
や tests/tools/...
等)を追加し、adr_pathsにはこの決定に関連するADRファイル名/パスを追加してドキュメントがどのファイルに影響するか明示してください(編集対象フィールドは
files と adr_paths)。

---

Duplicate comments:
In `@daemon/src/daemon/tools/annotation_schema.py`:
- Around line 29-31: Add a docstring to the RecordingAnnotation model that
explicitly states the events list may be empty and the expected semantics (e.g.,
empty list is valid, consumers must handle zero-length events to avoid
divide-by-zero or aggregation errors); update the RecordingAnnotation class
docstring to mention the type of items (AnnotationEvent), whether None is
allowed (no), and any recommended handling for downstream code that consumes
events so behavior is clear to readers of the ContractModel.

In `@daemon/src/daemon/tools/optimize_thresholds.py`:
- Around line 84-85: When all evaluators return empty results the later call to
select_best raises a generic ValueError; change the logic around the
eval_results collection (the block using eval_results, evaluators, and
recording_pairs) to detect the case where every evaluator produced no results
and raise a clearer ValueError before calling select_best. Specifically, after
iterating evaluators and before invoking select_best, check if the aggregated
list of eval_results is empty and raise ValueError("No evaluation results: all
evaluators returned empty for given recording_pairs; evaluators=[...],
recording_pairs_summary=...") — include evaluator identifiers (from evaluators)
and a brief summary of recording_pairs to aid debugging.

In `@tests/tools/test_annotation_schema.py`:
- Around line 138-140: Add two tests to tests/tools/test_annotation_schema.py
that verify load_annotation raises the correct exceptions for malformed input:
create test_load_annotation_invalid_json_raises which writes an invalid JSON
string to a temp file and asserts load_annotation(filepath) raises
json.JSONDecodeError, and create test_load_annotation_schema_mismatch_raises
which writes a JSON that doesn't match the schema (e.g., {"foo":"bar"}) and
asserts load_annotation(filepath) raises pydantic.ValidationError; reference the
existing test_load_annotation_missing_file_raises for style and use tmp_path and
load_annotation as in the file.

In `@tests/tools/test_evaluate_detection.py`:
- Around line 32-38: Remove the unnecessary wrapper functions _write_wav and
_write_annotation in tests/tools/test_evaluate_detection.py: these only forward
to the conftest helpers (write_wav, write_annotation), so delete the two
functions and update any test code in the file that calls _write_wav or
_write_annotation to call write_wav and write_annotation directly (ensure the
conftest helpers are imported or available in the test context).

In `@tests/tools/test_optimize_thresholds.py`:
- Around line 29-35: Remove the unnecessary wrapper functions _write_wav and
_write_annotation and update the test to call the conftest helper functions
directly; locate the two wrappers in tests/tools/test_optimize_thresholds.py
(functions named _write_wav and _write_annotation) and delete them, then import
the corresponding helper symbols (write_wav and write_annotation or the actual
helper names from conftest) at the top of the test file and replace any calls to
_write_wav/_write_annotation with direct calls to those imported helpers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 98b3caf5-d08c-4d6c-b3bd-ec9ac031476c

📥 Commits

Reviewing files that changed from the base of the PR and between c93e9df and baaf8bd.

📒 Files selected for processing (14)
  • daemon/src/daemon/tools/__init__.py
  • daemon/src/daemon/tools/annotation_schema.py
  • daemon/src/daemon/tools/evaluate_detection.py
  • daemon/src/daemon/tools/optimize_thresholds.py
  • daemon/src/daemon/tools/record_evaluation.py
  • data/evaluation/.gitkeep
  • docs/adr/decision-log.md
  • docs/adr/decisions/2026-04-05-address-pr-138-review-findings-for-recording-evaluation-tools-coverage-wav-decoding-ground-truth-mapping-and-tests.md
  • tests/tools/__init__.py
  • tests/tools/conftest.py
  • tests/tools/test_annotation_schema.py
  • tests/tools/test_evaluate_detection.py
  • tests/tools/test_optimize_thresholds.py
  • tests/tools/test_record_evaluation.py

Comment thread daemon/src/daemon/tools/evaluate_detection.py
Comment thread daemon/src/daemon/tools/evaluate_detection.py
Comment thread daemon/src/daemon/tools/optimize_thresholds.py
Comment thread daemon/src/daemon/tools/record_evaluation.py
@ToaruPen
Copy link
Copy Markdown
Owner Author

ToaruPen commented Apr 5, 2026

@coderabbitai full review - all CI checks are passing, conflicts resolved, and all previous review findings have been addressed. Please re-review the latest changes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

@ToaruPen: Understood — triggering a full re-review of PR #138. I'll evaluate all changes fresh, including the recording evaluation tools, annotation schema, WAV decoding, ground truth mapping, and associated tests.

Re-review requested: all CI checks passing, conflicts resolved, and all previous review findings addressed. Please perform a complete re-review of all changes.


🧠 Learnings used
Learnt from: CR
Repo: ToaruPen/Wooly-Fluffy PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-30T23:28:19.180Z
Learning: Do not dismiss CodeRabbit `CHANGES_REQUESTED` reviews
✅ Actions performed

Full review triggered.

…nd invalid JSON test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ToaruPen
Copy link
Copy Markdown
Owner Author

ToaruPen commented Apr 5, 2026

@coderabbitai review — all 5 duplicate findings addressed: removed wrapper functions, added RecordingAnnotation docstring, added invalid JSON test.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

@ToaruPen: Acknowledged — triggering a review of the latest commits. I'll evaluate the changes covering the removal of wrapper functions, the added RecordingAnnotation docstring, and the new invalid JSON test, along with any other updates in the PR.


🧠 Learnings used
Learnt from: CR
Repo: ToaruPen/Wooly-Fluffy PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-30T23:28:19.180Z
Learning: Do not dismiss CodeRabbit `CHANGES_REQUESTED` reviews
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@ToaruPen ToaruPen merged commit 3aec5a4 into main Apr 5, 2026
7 checks passed
@ToaruPen ToaruPen deleted the feat/task-8c-recording-evaluation-v3 branch April 5, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task 8c: In-Situ Recording Evaluation Protocol

1 participant