feat: optional term_filter for run-inference and latest-inference-cohort endpoint by chapmanhk · Pull Request #201 · datakind/edvise-api

chapmanhk · 2026-02-02T20:35:19Z

changes

Run-inference: optional term_filter list (API → Databricks)

API (routers/models.py)
- InferenceRunRequest: added optional term_filter: list[str] | None = None.
- When term_filter is present: require at least one non-empty label; reject empty list (400) and empty or whitespace-only labels (400), with logging before each raise.
- Build DatabricksInferenceRunRequest with term_filter=req.term_filter and pass it to run_pdp_inference.
- After a successful run, log once: user, term_filter (cohorts), and job_run_id.
- Docstring updated to describe the optional term_filter field.
- Batch-not-found error message fixed to use len(batch_result) instead of len(inst_result).
- Renamed res to inference_run_response in trigger_inference_run for clarity.
Databricks (databricks.py)
- DatabricksInferenceRunRequest: added optional term_filter: list[str] | None = None.
- In run_pdp_inference(): build job_params dict; when req.term_filter is not None, set job_params["term_filter"] = json.dumps(req.term_filter). Omit the key when term_filter is None.
- Improved logging: use %-style and include exception in log messages for job lookup and job run failures.
Tests (routers/models_test.py)
- No term_filter in request → 200 and mock called with term_filter=None.
- Single and multiple cohorts in term_filter → 200 and mock receives the correct list.
- Empty list, empty string in list, and whitespace-only label in term_filter → 400 with the expected detail messages.

Latest-inference-cohort endpoint and config from Databricks

API (routers/data.py)
- New GET /{inst_id}/latest-inference-cohort returning LatestInferenceCohortResponse (cohort_label, valid_student_count, status, batch_name, reason). Always returns 200; uses status='invalid' with reason when no batch, missing config, or no cohort has course data.
- Optional query params: model_name (optional): if omitted, the institution must have exactly one registered model (valid, non-deleted); otherwise invalid. When provided, the model must exist for the institution or the endpoint returns invalid. batch_name (optional): when provided (URL-decoded), cohort selection uses that batch; when omitted, uses the most recent batch that has both STUDENT and COURSE files.
- Config is read from the model's training run in the silver volume via DatabricksControl.read_volume_training_config(inst.name, model_run_id). model_run_id is the latest model version's run ID for the chosen model.
- New/refactored helpers: _resolve_model_name_and_run_id_for_inference (resolve model name, validate existence, fetch run ID), _load_batch_and_dataframes_for_inference (resolve batch, load config, read batch files, validate STUDENT/COURSE), get_batch_for_inference, get_batch_by_name_with_student_and_course, _batch_has_student_and_course, plus existing cohort logic split into _latest_inference_cohort_validation_response, _first_valid_from_ordered_terms, and _resolve_latest_inference_cohort_from_dataframes (≤50 lines per function). Mypy type-narrowing assertions added where optional values are guaranteed after error checks.
Config (config.py)
- Added ENV_TO_VOLUME_SCHEMA = {"DEV": "dev_sst_02", "STAGING": "staging_sst_01"} for Databricks volume paths. Used by databricks.read_volume_training_config and front_end_tables.get_model_cards (replacing a local dict there).
Databricks (databricks.py)
- New read_volume_training_config(inst_name, model_run_id): reads from /Volumes/{env_schema}/{slug}_silver/silver_volume/{model_run_id}/, finds any .toml under that path that contains [preprocessing.selection], and returns that section only (or None if missing/unreadable). Uses ENV_TO_VOLUME_SCHEMA; returns None for unsupported ENV (e.g. LOCAL). New helpers: _parse_config_toml_to_selection(raw: bytes) for TOML parsing and section extraction, and _find_selection_in_toml_under(w, directory_path, inst_name) to list directory and scan .toml files. Guard for response.contents is None before calling .read() (mypy-safe).
Front-end (routers/front_end_tables.py)
- get_model_cards now uses ENV_TO_VOLUME_SCHEMA from config instead of a local schema dict.
Tests
- routers/data_test.py: Tests for get_latest_inference_cohort (unauthorized; no model_name and no models / models exist but none registered; missing cohort column/term/config; valid; valid with batch_name; model_name not found; batch_name not found; loop-back when no course data for most recent; no cohort has course data; missing student id column; zero students meeting criteria; student_criteria references missing column); for get_latest_batch_with_student_and_course (returns batch when it has STUDENT and COURSE, returns None when no batch or no batch has both); and for apply_student_criteria_count and cohort term ordering/filtering helpers.
- databricks_test.py: Tests for _parse_config_toml_to_selection (valid TOML with section, invalid/missing section); for read_volume_training_config (empty inst_name, empty model_run_id, ENV not DEV/STAGING, databricksify raises, WorkspaceClient raises, list_raises, download raises, TOML missing [preprocessing.selection], success when .toml found under run dir). Added return type -> list[DirectoryEntry] to _one_toml_entry for mypy.

context

Run-inference term_filter: Callers need to be able to choose which cohorts (e.g. "fall 2024-25") to run inference on. This PR adds an optional term_filter list to the run-inference request and passes it through to the Databricks job so the pipeline can filter by those terms/cohorts. When term_filter is omitted, no term_filter key is sent to Databricks and behavior is unchanged (pipeline uses its config default).
Latest-inference-cohort: The UI needs a single "latest inference-ready cohort" (label, valid student count, status) for an institution. This endpoint supports optional model_name and batch_name so the UI can show a specific model/batch or rely on defaults (single registered model, latest batch with student+course files). It reads the chosen model's training/preprocessing selection config from Databricks (silver volume, per model run) and returns the first cohort term that has course data and meets the config criteria. Refactors and tests align with project standards (e.g. function length, naming, error-path coverage).

questions

None.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1213317049256276
- https://app.asana.com/0/0/1213087332160187

- Add GET /institutions/{inst_id}/latest-inference-cohort endpoint returning LatestInferenceCohortResponse (status, cohort_label, valid_student_count, batch_name, reason). Always 200; status=invalid with reason when no batch, missing config, missing cohort_term, or no cohort has course data. - Define "most recent cohort" as most recent cohort term (e.g. Fall 2024, Spring 2025). Require cohort_term column; support cohort+cohort_term or entry_year+cohort_term. Use config preprocessing.selection.cohort_term to filter allowed terms. Loop back to next cohort term when current has no course data for students. - Add DatabricksControl.read_volume_training_config to read config.toml from institution bronze volume and return [preprocessing.selection] for cohort term filtering and student_criteria. - Add get_ordered_cohort_terms, filter_to_most_recent_cohort, _resolve_latest_inference_cohort_from_dataframes, and helpers; integrate apply_student_criteria_count with course-data restriction. - Add unit tests for get_ordered_cohort_terms and filter_to_most_recent_cohort; API tests for latest-inference-cohort (unauthorized, missing cohort/term/config, valid, loop-back, missing student id, valid_count=0, missing criteria column); apply_student_criteria_count tests; tighten response shape assertions. - Remove unused imports (joinedload, LatestInferenceCohortResponse); apply black.

- Add optional cohort: list[str] to InferenceRunRequest and DatabricksInferenceRunRequest - Validate cohort when provided: reject empty list and empty/whitespace-only labels - Serialize cohort as JSON in job_parameters when set; omit key when None (backward compatible) - Log user, cohorts, and job_run_id once in API after successful run - Fix batch error message to use len(batch_result) instead of len(inst_result) - Add tests: backward compat (no cohort), single/multiple cohorts, empty list, empty string, whitespace-only

- Centralize ENV_TO_VOLUME_SCHEMA in config; use in front_end_tables and databricks - Add read_volume_training_config and _parse_config_toml_to_selection for volume TOML - Refactor _resolve_latest_inference_cohort_from_dataframes into helpers (≤50 lines) - Improve run_pdp_inference logging (%-style, exception context); guard None response.contents - Rename res to inference_run_response in trigger_inference_run - Add tests: get_latest_batch when no batch has both STUDENT and COURSE; read_volume_training_config error paths Co-authored-by: Cursor <[email protected]>

…ection - Add optional batch_name query param; resolve batch by name or latest with student+course - Resolve model from single registered (valid, non-deleted) when model_name omitted - Validate model exists for institution when model_name is provided - Read training config from Databricks volumes for cohort selection - Refactor endpoint into focused helpers; add mypy type-narrowing assertions - Add tests for batch_name, model/batch validation, and registered-model resolution Co-authored-by: Cursor <[email protected]>

- Add term_filter to InferenceRunRequest and DatabricksInferenceRunRequest - Pass term_filter in job_params to Databricks job - Update models router validation and tests - Extend databricks tests and remove unused import Co-authored-by: Cursor <[email protected]>

src/webapp/routers/data.py

kaylawilding

adding comments but this is on hold

chapmanhk and others added 4 commits January 30, 2026 10:00

Trigger Github actions

9b236b1

chapmanhk requested a review from Mesh-ach February 3, 2026 17:43

chapmanhk and others added 2 commits February 3, 2026 13:52

fix(databricks_test): add return type to _one_toml_entry for mypy

dea85e5

chapmanhk changed the title ~~feat: Pass optional cohort list to run-inference API and forward to Databricks.~~ feat: optional cohort for run-inference and latest-inference-cohort endpoint Feb 3, 2026

chapmanhk changed the title ~~feat: optional cohort for run-inference and latest-inference-cohort endpoint~~ feat: optional term_filter for run-inference and latest-inference-cohort endpoint Feb 3, 2026

nm3224 requested review from kaylawilding and nm3224 February 12, 2026 19:07

kaylawilding reviewed Feb 12, 2026

View reviewed changes

src/webapp/routers/data.py Show resolved Hide resolved

src/webapp/routers/data.py Show resolved Hide resolved

chapmanhk requested review from kaylawilding and removed request for Mesh-ach and nm3224 February 18, 2026 00:26