feat: webapp file ingestion changes by Mesh-ach · Pull Request #206 · datakind/edvise-api

Mesh-ach · 2026-02-19T18:09:45Z

Functionality to list available bronze datasets and upload selected CSVs into GCS bucket without manual uploads.

changes

Added GET /institutions/{inst_id}/input/bronze-datasets to list available .csv datasets in the institution’s Databricks bronze volume root (no subfolders), filtered by VALID_BRONZE_FILE_RE.
Added POST /institutions/{inst_id}/input/import-from-bronze to import a selected bronze dataset into the institution’s GCS bucket under unvalidated/ using the existing signed-upload URL flow (same overwrite checks as manual uploads).
Added unit tests covering the new endpoints and mocked Databricks/GCS interactions.

context

Today users must upload input datasets from the UI into GCS before they can run inference.
This change enables a “dataset select” workflow where datasets already present in Databricks bronze can be selected and pulled into the existing GCS validation/batching/inference pipeline, eliminating manual uploads while keeping downstream behaviour unchanged.

questions

No questions at this time

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1212480373505400
- https://app.asana.com/0/0/1213573798385201

vishpillai123 · 2026-02-24T21:02:39Z

src/webapp/routers/data.py

+    "/{inst_id}/input/upload-from-volume-to-gcs-bucket",
+    response_model=BronzeImportResponse,
+)
+def upload_from_volume_to_gcs_bucket(


So frontend flow will be... FE first list available datasets through "/{inst_id}/input/bronze-datasets", then user selects a CSV, then clicks upload or something (which then makes a call to "/{inst_id}/input/upload-from-volume-to-gcs-bucket") and this creates an unvalidated batch? Then we proceed with validation to create a batch correct?

Exactly. Frontend would implement something like a dropdown to select the course file and another to select a cohort file. Then, when you click upload, it calls the endpoint /{inst_id}/input/upload-from-volume-to-gcs-bucket to pull both files into the GCS bucket as unvalidated.

vishpillai123 · 2026-02-24T21:02:54Z

src/webapp/routers/data.py

+    storage_control: Annotated[StorageControl, Depends(StorageControl)],
+    databricks_control: Annotated[DatabricksControl, Depends(DatabricksControl)],
+) -> Any:
+    """Import a selected dataset from the institution's bronze volume into GCS unvalidated/."""


Can a user select multiple datasets? For example a cohort and a course file?

Yes, that’s something the frontend would need to handle. Essentially, it would just be calling the endpoint multiple times.

vishpillai123

Just to confirm that I understand the process -

For PDP, we will be ingesting via SFTP in Databricks. Then, we will be uploading files from Databricks bronze into GCS unvalidated?

vishpillai123 · 2026-03-02T22:18:14Z

Keeping this out of develop until our next merge! Tracking it by leaving as a draft

Mesh-ach · 2026-03-09T17:22:03Z

Just to confirm that I understand the process -

For PDP, we will be ingesting via SFTP in Databricks. Then, we will be uploading files from Databricks bronze into GCS unvalidated?

Sorry Vish, I somehow missed this comment. Yes, that’s the process. We do not want to pull files directly from the SFTP because it is NSC’s SFTP, so it is outside our control, and files do not persist there. The idea is to keep Databricks as our source of truth and pull files from there.

Mesh-ach · 2026-03-31T15:52:57Z

All endpoints here have been tested successfully. The idea is to add a dropdown to the UI and use https://dev-sst.datakind.org/api/v1/institutions/{inst_id}/input/bronze-datasets to retrieve datasets from the bronze volume. Currently, the endpoint only supports retrieving PDP files, so the dropdown is filtered to ensure that only PDP files appear.

The second endpoint allows direct upload of the selected files to the GCS bucket https://dev-sst.datakind.org/api/v1/institutions/{inst_id}/input/upload-from-volume-to-gcs-bucket. It uploads them to the unvalidated folder by default.

These endpoints will be used by the frontend, so no further backend adjustments are needed.

Mesh-ach added 2 commits February 19, 2026 11:41

added new list bronze datasets endpoint

39e8715

added databricks to gcs upload functionality

b80c003

Mesh-ach requested a review from chapmanhk February 19, 2026 18:15

Mesh-ach added 2 commits February 19, 2026 12:16

fixed formatting

d70ed0e

fixed formatting

0403afb

Mesh-ach marked this pull request as ready for review February 19, 2026 18:19

fixed formatting

fbec688

Mesh-ach requested a review from vishpillai123 February 19, 2026 18:26

Mesh-ach mentioned this pull request Feb 19, 2026

feat: added automated ingestion workflow datakind/edvise#113

Merged

Mesh-ach added 2 commits February 19, 2026 12:53

fixed formatting

bb3447b

fixed formatting

3f06762

vishpillai123 reviewed Feb 24, 2026

View reviewed changes

vishpillai123 marked this pull request as draft March 2, 2026 22:17

Mesh-ach marked this pull request as ready for review March 30, 2026 15:04

Mesh-ach and others added 6 commits March 30, 2026 10:14

Merge branch 'develop' into WebappFileIngestionChanges

572a928

fix: added import route

2654d85

fix: added import route

0822f64

fix: added import route

674ddcb

fix: adjusted databricks to gcs endpoint

0af9c60

fix: adjusted upload message

145082b

vishpillai123 changed the title ~~Webapp file ingestion changes~~ feat: webapp file ingestion changes Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: webapp file ingestion changes#206

feat: webapp file ingestion changes#206
Mesh-ach wants to merge 13 commits intodevelopfrom
WebappFileIngestionChanges

Mesh-ach commented Feb 19, 2026 •

edited

Loading

Uh oh!

vishpillai123 Feb 24, 2026 •

edited

Loading

Uh oh!

Mesh-ach Mar 9, 2026 •

edited

Loading

Uh oh!

vishpillai123 Feb 24, 2026

Uh oh!

Mesh-ach Mar 9, 2026

Uh oh!

vishpillai123 left a comment

Uh oh!

vishpillai123 commented Mar 2, 2026

Uh oh!

Mesh-ach commented Mar 9, 2026

Uh oh!

Mesh-ach commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mesh-ach commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Functionality to list available bronze datasets and upload selected CSVs into GCS bucket without manual uploads.

changes

context

questions

Uh oh!

vishpillai123 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mesh-ach Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vishpillai123 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Mesh-ach Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

vishpillai123 left a comment

Choose a reason for hiding this comment

Uh oh!

vishpillai123 commented Mar 2, 2026

Uh oh!

Mesh-ach commented Mar 9, 2026

Uh oh!

Mesh-ach commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mesh-ach commented Feb 19, 2026 •

edited

Loading

vishpillai123 Feb 24, 2026 •

edited

Loading

Mesh-ach Mar 9, 2026 •

edited

Loading