Skip to content

feat: added automated ingestion workflow#113

Merged
vishpillai123 merged 45 commits intodevelopfrom
Automated_Ingestion_Workflow
Mar 2, 2026
Merged

feat: added automated ingestion workflow#113
vishpillai123 merged 45 commits intodevelopfrom
Automated_Ingestion_Workflow

Conversation

@Mesh-ach
Copy link
Copy Markdown
Collaborator

@Mesh-ach Mesh-ach commented Feb 19, 2026

changes

  • This workflow ingests PDP data from the NSC website using multiple focused notebooks/scripts (scan → stage → plan → ingest → registry sync) ensuring all bronze_volumes are constantly updated with the most recent data available on the NSC SFTP.

  • 01_sftp_receive_scan:

    • scans ./receive, computes a stable file_fingerprint, and records status=NEW in staging_sst_02.default.ingestion_manifest
    • downloads/stages only files that are NEW and not already queued, and upserts them into staging_sst_02.default.pending_ingest_queue (idempotent + rerun-safe)
    • exits gracefully when there is no work so downstream workflow tasks can skip.
  • Per-institution ingestion is driven by expanding files into institution-specific work items and ingesting filtered rows per institution exactly using the existing ingestion logic.

  • Added/updated Delta tables to support workflow state + UI:

    • ingestion_manifest table to track file status progression (NEW → DOWNLOADED → BRONZE_WRITTEN / FAILED) and errors
    • pending_ingest_queue to persist staged file paths so downstream notebooks don’t reconnect to SFTP

context

We needed a more reliable and operable ingestion pipeline for NSC files that:

  • Takes away the manual effort of ingesting PDP files from the NSC SFTP
  • supports ingesting per institution from the multi-institution pdp files (filter rows by institution_id and ingest using the existing method),
  • can automatically pull institution metadata from the SST staging APIs

This PR addresses current pain points with:

This workflow has been tested and performs as expected. The only remaining step is to convert it into a scheduled workflow that runs periodically (most likely three times daily) to scan the NSC SFTP and ingest any new files.

This workflow does not overwrite any existing files in the bronze volume. It only ingests a file if it is not already present in the bronze volume for the institution whose PDP ID is found in the dataset.

Relevant tables for verification

questions

Please let me know if you have any questions


Note

Medium Risk
Introduces new SFTP download + Databricks volume/Delta-table write paths and SST API token/lookup logic, so misconfiguration or edge cases could affect ingestion state and bronze outputs despite being largely additive.

Overview
Implements a new NSC SFTP ingestion workflow via Databricks notebooks that (1) scans SFTP and records unseen files in ingestion_manifest, (2) stages NEW files to a UC volume and queues them, (3) expands staged files into per-institution work items, and (4) ingests per-institution slices to the appropriate bronze volume while updating manifest status.

Adds a new edvise.ingestion module with shared constants and helpers (fingerprinting, idempotent MERGE upserts, staged-download verification, manifest updates), plus new utils.sftp atomic/resumable downloads and utils.databricks schema/volume discovery + institution-name mapping utilities. Updates utils.api_requests with an EdviseAPIClient that caches and refreshes bearer tokens and supports PDP-id institution lookup; adds tests and minor logging improvements, and ignores local notebook configs/tmp files.

Written by Cursor Bugbot for commit 25af159. This will update automatically on new commits. Configure here.


Copy link
Copy Markdown
Collaborator

@vishpillai123 vishpillai123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mesh-ach can we move this notebook style design to python modules, scripts, and github actions?

@Mesh-ach Mesh-ach changed the base branch from main to develop February 19, 2026 18:59
@Mesh-ach
Copy link
Copy Markdown
Collaborator Author

@Mesh-ach can we move this notebook style design to python modules, scripts, and github actions?

So I’m trying to move some of the functions and reusable pieces into helper.py. Ultimately, I don’t think we want this to be a GitHub action, given how sensitive and unstable NSC data ingestion is.

I think we should trigger this manually from the databricks workflow, rather than having it run automatically as a fully automated pipeline from GitHub.

@vishpillai123
Copy link
Copy Markdown
Collaborator

@Mesh-ach I'm going to hop into this branch and work on converting it into the DAB format, will let you know when I'm ready and then we can test together. Utilizing cursor, I think this is quite doable.

Vishakh Pillai added 7 commits February 24, 2026 17:45
…tils/databricks

- Move both functions and helper functions from api_requests.py to databricks.py
- Update all imports across codebase (tests, notebooks, api_requests.py)
- Functions are now in their logical location (databricks utilities)
- Maintains backward compatibility by updating all call sites
- Create new test file tests/utils/test_databricks.py
- Move TestDatabricksifyInstName and TestReverseDatabricksifyInstName from test_api_requests.py
- Tests are now organized with the module they test (databricks.py)
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

name = name.replace(old, new)

# Replace spaces with underscores
final_name = name.replace(" ", "_")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special character replacement order breaks abbreviation matching

High Severity

databricksify_inst_name applies abbreviation replacements before special character normalization. This means & in institution names is never converted before abbreviation matching occurs. For example, "University of Science & Technology" won't match the pattern "of science and technology""st", producing "uni_of_science_technology" instead of the docstring's claimed "uni_of_st_technology". Since this function is used in the bronze ingestion notebook to resolve schema names via find_bronze_schema, any institution whose API-returned name uses & instead of and will fail to locate its schema and error out.

Additional Locations (1)

Fix in Cursor Fix in Web

@vishpillai123 vishpillai123 merged commit f8d21e4 into develop Mar 2, 2026
7 checks passed
@vishpillai123 vishpillai123 deleted the Automated_Ingestion_Workflow branch March 2, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants