feat: added automated ingestion workflow by Mesh-ach · Pull Request #113 · datakind/edvise

Mesh-ach · 2026-02-19T18:35:55Z

changes

This workflow ingests PDP data from the NSC website using multiple focused notebooks/scripts (scan → stage → plan → ingest → registry sync) ensuring all bronze_volumes are constantly updated with the most recent data available on the NSC SFTP.
01_sftp_receive_scan:
- scans ./receive, computes a stable file_fingerprint, and records status=NEW in staging_sst_02.default.ingestion_manifest
- downloads/stages only files that are NEW and not already queued, and upserts them into staging_sst_02.default.pending_ingest_queue (idempotent + rerun-safe)
- exits gracefully when there is no work so downstream workflow tasks can skip.
Per-institution ingestion is driven by expanding files into institution-specific work items and ingesting filtered rows per institution exactly using the existing ingestion logic.
Added/updated Delta tables to support workflow state + UI:
- ingestion_manifest table to track file status progression (NEW → DOWNLOADED → BRONZE_WRITTEN / FAILED) and errors
- pending_ingest_queue to persist staged file paths so downstream notebooks don’t reconnect to SFTP

context

We needed a more reliable and operable ingestion pipeline for NSC files that:

Takes away the manual effort of ingesting PDP files from the NSC SFTP
supports ingesting per institution from the multi-institution pdp files (filter rows by institution_id and ingest using the existing method),
can automatically pull institution metadata from the SST staging APIs

This PR addresses current pain points with:

NSC file ingestion,
split responsibilities across notebooks for easier debugging/operations,
Ensuring the select dataset endpoint in this PR is looking at the most recent data pull from NSC feat: webapp file ingestion changes edvise-api#206.

This workflow has been tested and performs as expected. The only remaining step is to convert it into a scheduled workflow that runs periodically (most likely three times daily) to scan the NSC SFTP and ingest any new files.

This workflow does not overwrite any existing files in the bronze volume. It only ingests a file if it is not already present in the bronze volume for the institution whose PDP ID is found in the dataset.

Relevant tables for verification

questions

Please let me know if you have any questions

Note

Medium Risk
Introduces new SFTP download + Databricks volume/Delta-table write paths and SST API token/lookup logic, so misconfiguration or edge cases could affect ingestion state and bronze outputs despite being largely additive.

Overview
Implements a new NSC SFTP ingestion workflow via Databricks notebooks that (1) scans SFTP and records unseen files in ingestion_manifest, (2) stages NEW files to a UC volume and queues them, (3) expands staged files into per-institution work items, and (4) ingests per-institution slices to the appropriate bronze volume while updating manifest status.

Adds a new edvise.ingestion module with shared constants and helpers (fingerprinting, idempotent MERGE upserts, staged-download verification, manifest updates), plus new utils.sftp atomic/resumable downloads and utils.databricks schema/volume discovery + institution-name mapping utilities. Updates utils.api_requests with an EdviseAPIClient that caches and refreshes bearer tokens and supports PDP-id institution lookup; adds tests and minor logging improvements, and ignores local notebook configs/tmp files.

^{Written by Cursor Bugbot for commit 25af159. This will update automatically on new commits. Configure here.}

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1213440914549455

notebooks/nsc_sftp_automated_data_ingestion/gcp_config.yaml

vishpillai123

@Mesh-ach can we move this notebook style design to python modules, scripts, and github actions?

notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb

Mesh-ach · 2026-02-23T18:41:10Z

@Mesh-ach can we move this notebook style design to python modules, scripts, and github actions?

So I’m trying to move some of the functions and reusable pieces into helper.py. Ultimately, I don’t think we want this to be a GitHub action, given how sensitive and unstable NSC data ingestion is.

I think we should trigger this manually from the databricks workflow, rather than having it run automatically as a fully automated pipeline from GitHub.

notebooks/nsc_sftp_automated_data_ingestion/01_sftp_receive_scan.ipynb

.gitignore

notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb

notebooks/nsc_sftp_automated_data_ingestion/helper.py

notebooks/nsc_sftp_automated_data_ingestion/02_file_institution_expand.ipynb

vishpillai123 · 2026-02-24T19:59:34Z

@Mesh-ach I'm going to hop into this branch and work on converting it into the DAB format, will let you know when I'm ready and then we can test together. Utilizing cursor, I think this is quite doable.

tests/notebooks/test_nsc_sftp_helper.py

src/edvise/ingestion/nsc_sftp_helpers.py

notebooks/nsc_sftp_automated_data_ingestion/02_file_institution_expand.ipynb

src/edvise/utils/api_requests.py

…tils/databricks - Move both functions and helper functions from api_requests.py to databricks.py - Update all imports across codebase (tests, notebooks, api_requests.py) - Functions are now in their logical location (databricks utilities) - Maintains backward compatibility by updating all call sites

- Create new test file tests/utils/test_databricks.py - Move TestDatabricksifyInstName and TestReverseDatabricksifyInstName from test_api_requests.py - Tests are now organized with the module they test (databricks.py)

src/edvise/ingestion/constants.py

src/edvise/ingestion/nsc_sftp_helpers.py

src/edvise/ingestion/constants.py

tests/utils/test_databricks.py

src/edvise/ingestion/constants.py

notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb

notebooks/nsc_sftp_automated_data_ingestion/01_sftp_receive_scan.ipynb

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-26T19:35:19Z

src/edvise/utils/databricks.py

+        name = name.replace(old, new)
+
+    # Replace spaces with underscores
+    final_name = name.replace(" ", "_")


Special character replacement order breaks abbreviation matching

High Severity

databricksify_inst_name applies abbreviation replacements before special character normalization. This means & in institution names is never converted before abbreviation matching occurs. For example, "University of Science & Technology" won't match the pattern "of science and technology" → "st", producing "uni_of_science_technology" instead of the docstring's claimed "uni_of_st_technology". Since this function is used in the bronze ingestion notebook to resolve schema names via find_bronze_schema, any institution whose API-returned name uses & instead of and will fail to locate its schema and error out.

Additional Locations (1)

src/edvise/utils/databricks.py#L318-L323

Vishakh Pillai and others added 5 commits December 2, 2025 19:31

Merge branch 'release/0.1.7'

a6f5c9c

Merge branch 'release/0.1.8'

3d74559

Merge branch 'release/0.1.9'

187d54e

Merge branch 'release/0.1.10'

b2d8c1a

feat: added automated ingestion workflow

0b24b2c

Mesh-ach requested review from chapmanhk, kaylawilding and vishpillai123 February 19, 2026 18:43

vishpillai123 reviewed Feb 19, 2026

View reviewed changes

notebooks/nsc_sftp_automated_data_ingestion/gcp_config.yaml Outdated Show resolved Hide resolved

vishpillai123 reviewed Feb 19, 2026

View reviewed changes

Mesh-ach changed the base branch from main to develop February 19, 2026 18:59

Merge branch 'develop' into Automated_Ingestion_Workflow

ea9ea9d

vishpillai123 requested changes Feb 19, 2026

View reviewed changes

Mesh-ach and others added 2 commits February 19, 2026 13:11

added pdp ingestion files

7b5ce17

Merge branch 'develop' into Automated_Ingestion_Workflow

34297b2

cursor bot reviewed Feb 23, 2026

View reviewed changes

notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb Outdated Show resolved Hide resolved

notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb Outdated Show resolved Hide resolved

feat: moved reusueable components into helper.py

649ef40

cursor bot reviewed Feb 23, 2026

View reviewed changes

notebooks/nsc_sftp_automated_data_ingestion/01_sftp_receive_scan.ipynb Outdated Show resolved Hide resolved

.gitignore Show resolved Hide resolved

fix: initialized spark

716ac97

cursor bot reviewed Feb 23, 2026

View reviewed changes

notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb Show resolved Hide resolved

notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb Outdated Show resolved Hide resolved

Mesh-ach added 3 commits February 23, 2026 18:00

fix: initialized spark

d3e0f74

fix: notebook docs

0a3ae3a

fix: notebook docs

42357b7

Mesh-ach requested a review from vishpillai123 February 24, 2026 00:13

cursor bot reviewed Feb 24, 2026

View reviewed changes

Vishakh Pillai added 2 commits February 24, 2026 15:31

feat: refactor

9b137e9

refactor: moved helpers to src code

ca4ef23

cursor bot reviewed Feb 24, 2026

View reviewed changes

Vishakh Pillai added 7 commits February 24, 2026 17:45

fix: type check

4c061a7

chore: move test file from notebooks/ to ingestion/

e64280c

Move databricksify tests to test_databricks.py

a97fdbb

- Create new test file tests/utils/test_databricks.py - Move TestDatabricksifyInstName and TestReverseDatabricksifyInstName from test_api_requests.py - Tests are now organized with the module they test (databricks.py)

fix: style

c83c2bd

style

d3fee8e

fix: tests

2973bab

cursor bot reviewed Feb 24, 2026

View reviewed changes

src/edvise/ingestion/constants.py Show resolved Hide resolved

src/edvise/ingestion/nsc_sftp_helpers.py Show resolved Hide resolved

Mesh-ach added 2 commits February 26, 2026 17:05

fix: added env differentiation

54c979b

fix: env path

dc68cf9

cursor bot reviewed Feb 26, 2026

View reviewed changes

src/edvise/ingestion/constants.py Show resolved Hide resolved

tests/utils/test_databricks.py Outdated Show resolved Hide resolved

Mesh-ach added 4 commits February 26, 2026 11:54

fix: mandatory databricks parameters

1e81e88

fix: claude review

7af1c5b

fix: claude review

92c78ba

fix: claude review

c929ff1

cursor bot reviewed Feb 26, 2026

View reviewed changes

src/edvise/ingestion/constants.py Show resolved Hide resolved

Mesh-ach added 3 commits February 26, 2026 12:34

fix: added edvise imports

616cee4

fix: added verify parameter to download_sftp_atomic function

e35ceba

fix: issues with snakecase normalizations that claude flagged

d5c1e6e

cursor bot reviewed Feb 26, 2026

View reviewed changes

notebooks/nsc_sftp_automated_data_ingestion/03_per_institution_bronze_ingest.ipynb Outdated Show resolved Hide resolved

notebooks/nsc_sftp_automated_data_ingestion/01_sftp_receive_scan.ipynb Show resolved Hide resolved

Mesh-ach added 5 commits February 26, 2026 13:00

fix: resolved dbutils issues

12c5287

fix: resolved dbutils issues

e17d074

fix: resolved gcp_config.ysml

271de56

fix: resolved ruff issues

bd8f442

fix: resolved ruff issues

44c77bf

cursor bot reviewed Feb 26, 2026

View reviewed changes

fix: added valuable output statements for workflow

25af159

vishpillai123 approved these changes Mar 2, 2026

View reviewed changes

vishpillai123 merged commit f8d21e4 into develop Mar 2, 2026
7 checks passed

vishpillai123 deleted the Automated_Ingestion_Workflow branch March 2, 2026 15:30

Conversation

Mesh-ach commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

changes

context

questions

Uh oh!

Uh oh!

vishpillai123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Mesh-ach commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vishpillai123 commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Special character replacement order breaks abbreviation matching

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mesh-ach commented Feb 19, 2026 •

edited

Loading