Skip to content

feat: webapp file ingestion changes#206

Open
Mesh-ach wants to merge 13 commits intodevelopfrom
WebappFileIngestionChanges
Open

feat: webapp file ingestion changes#206
Mesh-ach wants to merge 13 commits intodevelopfrom
WebappFileIngestionChanges

Conversation

@Mesh-ach
Copy link
Copy Markdown
Collaborator

@Mesh-ach Mesh-ach commented Feb 19, 2026

Functionality to list available bronze datasets and upload selected CSVs into GCS bucket without manual uploads.

changes

  • Added GET /institutions/{inst_id}/input/bronze-datasets to list available .csv datasets in the institution’s Databricks bronze volume root (no subfolders), filtered by VALID_BRONZE_FILE_RE.
  • Added POST /institutions/{inst_id}/input/import-from-bronze to import a selected bronze dataset into the institution’s GCS bucket under unvalidated/ using the existing signed-upload URL flow (same overwrite checks as manual uploads).
  • Added unit tests covering the new endpoints and mocked Databricks/GCS interactions.

context

  • Today users must upload input datasets from the UI into GCS before they can run inference.
  • This change enables a “dataset select” workflow where datasets already present in Databricks bronze can be selected and pulled into the existing GCS validation/batching/inference pipeline, eliminating manual uploads while keeping downstream behaviour unchanged.

questions

No questions at this time


@Mesh-ach Mesh-ach requested a review from chapmanhk February 19, 2026 18:15
@Mesh-ach Mesh-ach marked this pull request as ready for review February 19, 2026 18:19
"/{inst_id}/input/upload-from-volume-to-gcs-bucket",
response_model=BronzeImportResponse,
)
def upload_from_volume_to_gcs_bucket(
Copy link
Copy Markdown
Contributor

@vishpillai123 vishpillai123 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So frontend flow will be... FE first list available datasets through "/{inst_id}/input/bronze-datasets", then user selects a CSV, then clicks upload or something (which then makes a call to "/{inst_id}/input/upload-from-volume-to-gcs-bucket") and this creates an unvalidated batch? Then we proceed with validation to create a batch correct?

Copy link
Copy Markdown
Collaborator Author

@Mesh-ach Mesh-ach Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Frontend would implement something like a dropdown to select the course file and another to select a cohort file. Then, when you click upload, it calls the endpoint /{inst_id}/input/upload-from-volume-to-gcs-bucket to pull both files into the GCS bucket as unvalidated.

storage_control: Annotated[StorageControl, Depends(StorageControl)],
databricks_control: Annotated[DatabricksControl, Depends(DatabricksControl)],
) -> Any:
"""Import a selected dataset from the institution's bronze volume into GCS unvalidated/."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a user select multiple datasets? For example a cohort and a course file?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s something the frontend would need to handle. Essentially, it would just be calling the endpoint multiple times.

Copy link
Copy Markdown
Contributor

@vishpillai123 vishpillai123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm that I understand the process -

For PDP, we will be ingesting via SFTP in Databricks. Then, we will be uploading files from Databricks bronze into GCS unvalidated?

@vishpillai123 vishpillai123 marked this pull request as draft March 2, 2026 22:17
@vishpillai123
Copy link
Copy Markdown
Contributor

Keeping this out of develop until our next merge! Tracking it by leaving as a draft

@Mesh-ach
Copy link
Copy Markdown
Collaborator Author

Mesh-ach commented Mar 9, 2026

Just to confirm that I understand the process -

For PDP, we will be ingesting via SFTP in Databricks. Then, we will be uploading files from Databricks bronze into GCS unvalidated?

Sorry Vish, I somehow missed this comment. Yes, that’s the process. We do not want to pull files directly from the SFTP because it is NSC’s SFTP, so it is outside our control, and files do not persist there. The idea is to keep Databricks as our source of truth and pull files from there.

@Mesh-ach Mesh-ach marked this pull request as ready for review March 30, 2026 15:04
@Mesh-ach
Copy link
Copy Markdown
Collaborator Author

All endpoints here have been tested successfully. The idea is to add a dropdown to the UI and use https://dev-sst.datakind.org/api/v1/institutions/{inst_id}/input/bronze-datasets to retrieve datasets from the bronze volume. Currently, the endpoint only supports retrieving PDP files, so the dropdown is filtered to ensure that only PDP files appear.

The second endpoint allows direct upload of the selected files to the GCS bucket https://dev-sst.datakind.org/api/v1/institutions/{inst_id}/input/upload-from-volume-to-gcs-bucket. It uploads them to the unvalidated folder by default.

These endpoints will be used by the frontend, so no further backend adjustments are needed.

@vishpillai123 vishpillai123 changed the title Webapp file ingestion changes feat: webapp file ingestion changes Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants