-
Notifications
You must be signed in to change notification settings - Fork 7
Add analysis app #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add analysis app #40
Changes from 14 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
2af1fb6
Add Streamlit analysis app
gabegma 3f03701
Add dataset explorer app
gabegma cf1dc49
Fix turn_taking metric category
gabegma 4abeaac
Update documentation
gabegma 732a283
Update README.md
gabegma ec44fac
Apply suggestions from code review
gabegma b58708e
Show which metric button is selected
JosephMarinier 5f61f03
Use dict/list comprehensions and ternary if/else
JosephMarinier 81ebcc6
Apply pre-commit
JosephMarinier 3c79496
Use `st.html()`
JosephMarinier ca605c1
Replace deprecated `use_container_width=True` with `width="stretch"`
JosephMarinier 3548294
Remove data app
gabegma 8addedb
Edit README
gabegma 43873e5
Apply pre-commit
gabegma c342926
Apply suggestion from @gabegma
gabegma File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| # EVA Apps | ||
|
|
||
| Streamlit applications for exploring EVA data and results. | ||
|
|
||
| ## Analysis App | ||
|
|
||
| Interactive dashboard for visualizing and comparing results. | ||
|
|
||
| ### Usage | ||
|
|
||
| ```bash | ||
| streamlit run apps/analysis.py | ||
| ``` | ||
|
|
||
| By default, the app looks for runs in the `output/` directory. You can change this in the sidebar or by setting the `EVA_OUTPUT_DIR` environment variable: | ||
|
|
||
| ```bash | ||
| EVA_OUTPUT_DIR=path/to/results streamlit run apps/analysis.py | ||
| ``` | ||
|
|
||
| ### Views | ||
|
|
||
| **Cross-Run Comparison** — Compare aggregate metrics across multiple runs. Filter by model, provider, and pipeline type. Includes an EVA scatter plot (accuracy vs. experience) and per-metric bar charts. | ||
|
|
||
|  | ||
|
|
||
| **Run Overview** — Drill into a single run: per-category metric breakdowns, score distributions, and a full records table with per-metric scores. | ||
|
|
||
|  | ||
|
|
||
| **Record Detail** — Deep-dive into individual conversation records: | ||
| - Audio playback (mixed recording) | ||
| - Transcript with color-coded speaker turns | ||
| - Metric scores with explanations | ||
| - Conversation trace: tool calls, LLM calls, and audit log entries with a timeline view | ||
| - Database state diff (expected vs. actual) | ||
| - User goal, persona, and ground truth from the evaluation record | ||
|
|
||
|  | ||
|
|
||
| ### Sidebar Navigation | ||
|
|
||
| 1. **Output Directory** — Path to the directory containing run folders | ||
| 2. **View** — Switch between the three views above | ||
| 3. **Run Selection** — Pick a run (with metadata summary) | ||
| 4. **Record Selection** — Pick a record within the selected run | ||
| 5. **Trial Selection** — If a record has multiple trials, pick one | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| """EVA analysis app.""" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.