feat: added automated ingestion workflow#113
Conversation
vishpillai123
left a comment
There was a problem hiding this comment.
@Mesh-ach can we move this notebook style design to python modules, scripts, and github actions?
notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb
Outdated
Show resolved
Hide resolved
notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb
Outdated
Show resolved
Hide resolved
So I’m trying to move some of the functions and reusable pieces into helper.py. Ultimately, I don’t think we want this to be a GitHub action, given how sensitive and unstable NSC data ingestion is. I think we should trigger this manually from the databricks workflow, rather than having it run automatically as a fully automated pipeline from GitHub. |
notebooks/nsc_sftp_automated_data_ingestion/01_sftp_receive_scan.ipynb
Outdated
Show resolved
Hide resolved
notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb
Show resolved
Hide resolved
notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb
Outdated
Show resolved
Hide resolved
notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb
Show resolved
Hide resolved
notebooks/nsc_sftp_automated_data_ingestion/02_file_institution_expand.ipynb
Outdated
Show resolved
Hide resolved
|
@Mesh-ach I'm going to hop into this branch and work on converting it into the DAB format, will let you know when I'm ready and then we can test together. Utilizing cursor, I think this is quite doable. |
notebooks/nsc_sftp_automated_data_ingestion/02_file_institution_expand.ipynb
Outdated
Show resolved
Hide resolved
…tils/databricks - Move both functions and helper functions from api_requests.py to databricks.py - Update all imports across codebase (tests, notebooks, api_requests.py) - Functions are now in their logical location (databricks utilities) - Maintains backward compatibility by updating all call sites
- Create new test file tests/utils/test_databricks.py - Move TestDatabricksifyInstName and TestReverseDatabricksifyInstName from test_api_requests.py - Tests are now organized with the module they test (databricks.py)
notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| name = name.replace(old, new) | ||
|
|
||
| # Replace spaces with underscores | ||
| final_name = name.replace(" ", "_") |
There was a problem hiding this comment.
Special character replacement order breaks abbreviation matching
High Severity
databricksify_inst_name applies abbreviation replacements before special character normalization. This means & in institution names is never converted before abbreviation matching occurs. For example, "University of Science & Technology" won't match the pattern "of science and technology" → "st", producing "uni_of_science_technology" instead of the docstring's claimed "uni_of_st_technology". Since this function is used in the bronze ingestion notebook to resolve schema names via find_bronze_schema, any institution whose API-returned name uses & instead of and will fail to locate its schema and error out.


changes
This workflow ingests PDP data from the NSC website using multiple focused notebooks/scripts (scan → stage → plan → ingest → registry sync) ensuring all bronze_volumes are constantly updated with the most recent data available on the NSC SFTP.
01_sftp_receive_scan:
Per-institution ingestion is driven by expanding files into institution-specific work items and ingesting filtered rows per institution exactly using the existing ingestion logic.
Added/updated Delta tables to support workflow state + UI:
context
We needed a more reliable and operable ingestion pipeline for NSC files that:
This PR addresses current pain points with:
This workflow has been tested and performs as expected. The only remaining step is to convert it into a scheduled workflow that runs periodically (most likely three times daily) to scan the NSC SFTP and ingest any new files.
This workflow does not overwrite any existing files in the bronze volume. It only ingests a file if it is not already present in the bronze volume for the institution whose PDP ID is found in the dataset.
Relevant tables for verification
questions
Please let me know if you have any questions
Note
Medium Risk
Introduces new SFTP download + Databricks volume/Delta-table write paths and SST API token/lookup logic, so misconfiguration or edge cases could affect ingestion state and bronze outputs despite being largely additive.
Overview
Implements a new NSC SFTP ingestion workflow via Databricks notebooks that (1) scans SFTP and records unseen files in
ingestion_manifest, (2) stages NEW files to a UC volume and queues them, (3) expands staged files into per-institution work items, and (4) ingests per-institution slices to the appropriate bronze volume while updating manifest status.Adds a new
edvise.ingestionmodule with shared constants and helpers (fingerprinting, idempotentMERGEupserts, staged-download verification, manifest updates), plus newutils.sftpatomic/resumable downloads andutils.databricksschema/volume discovery + institution-name mapping utilities. Updatesutils.api_requestswith anEdviseAPIClientthat caches and refreshes bearer tokens and supports PDP-id institution lookup; adds tests and minor logging improvements, and ignores local notebook configs/tmp files.Written by Cursor Bugbot for commit 25af159. This will update automatically on new commits. Configure here.