Skip to content

Prerequisites for Digitized Theses workflow#225

Merged
jonavellecuerdo merged 7 commits intomainfrom
IN-1626-digitized-theses-prereqs
Apr 24, 2026
Merged

Prerequisites for Digitized Theses workflow#225
jonavellecuerdo merged 7 commits intomainfrom
IN-1626-digitized-theses-prereqs

Conversation

@jonavellecuerdo
Copy link
Copy Markdown
Contributor

@jonavellecuerdo jonavellecuerdo commented Apr 3, 2026

Purpose and background context

This PR comprises of a series of updates to the DSC app in preparation for the Digitized Theses workflow. At high level, these updates are:

How can a reviewer manually see the effects of these changes?

Full DSC OpenCourseWare workflow run:

  1. create (step function execution, CloudWatch log stream)
  2. submit without collection-handle (step function execution, CloudWatch log stream)
  3. submit with collection-handle (step function execution, CloudWatch log stream)
  4. ingest with DSS (step function execution, CloudWatch log stream)
  5. finalize (step function execution, CloudWatch log stream)

See ingested item at: https://mit-dspace.eks.prod.4science.cloud/handle/1721.1/164758

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review April 3, 2026 18:31
@jonavellecuerdo jonavellecuerdo requested a review from a team as a code owner April 3, 2026 18:31
Copy link
Copy Markdown
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! A few questions

Comment thread pyproject.toml Outdated
Comment thread dsc/workflows/opencourseware/workflow.py
Comment thread dsc/utils/aws/s3.py
Comment thread dsc/utils/aws/s3.py
Comment thread dsc/utils/aws/s3.py
Comment thread dsc/cli.py
Comment thread dsc/utils/aws/s3.py Outdated
Comment thread dsc/workflows/base/workflow.py
Comment thread dsc/workflows/base/workflow.py
Copy link
Copy Markdown

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some thoughtful and targeted changes, nice work!

Please see my comments about considering a new private _submit_item() (singular), how this exposes collection_handle as kind of awkward, and a possible doubling down on the getting of collection handles as a first class action that might deserve a method that is always called, not just when collection_handle = None.

Comment thread dsc/workflows/opencourseware/workflow.py
Comment thread dsc/utils/aws/s3.py
Comment thread dsc/cli.py
Comment thread dsc/utils/aws/s3.py Outdated

@final
def submit_items(self, collection_handle: str) -> list:
def submit_items(self, collection_handle: str | None = None) -> list:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here, but not a formal request for this PR.

I think this method might benefit from leaning on another like _submit_item(...). Then, this could be simplified as a for loop that orchestrates, but the private _submit_item(...) method has most of the business logic.

You'd end up with something more like this, with the try only attemping one statement:

for item_submission in ItemSubmission.get_batch(self.batch_id):
    self.submission_summary["total"] += 1

    if not item_submission.ready_to_submit():
        self.submission_summary["skipped"] += 1
        continue

    try:
        items.append(
            self._submit_item(item_submission, collection_handle, batch_metadata)
        )

    except NotImplementedError:
        raise

    except Exception as exception:  # noqa: BLE001
        self.submission_summary["errors"] += 1
        item_submission.status = ItemSubmissionStatus.SUBMIT_FAILED
        item_submission.status_details = str(exception)
        item_submission.submit_attempts += 1
        item_submission.upsert_db()

What I feel like that refactor work does is expose how collection_handle and batch_metadata continue to feel kind of awkward.

It would appear that for any given invocation of submit_items() they are shared across all items, so why couldn't they be saved to the workflow instance? or, attached to the ItemSubmission instances that are getting passed around?

Feel free to ignore this for now, but I may reference it for the collection handle resolution if not provided in another comment.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm circling back to this now given my continued reading and comment down below.

I get how this works, but am finding there is some cognitive dissonance here.

Is it correct that for most workflows, collection_handle is static / fixed / same for all items, and that's how it gets passed to submit_items()? But for some -- e.g. digi theses -- it is not, and can theoretically vary per item?

If so, I think I'd double down and always use a dedicated method for getting the collection handle. And, make all workflows implement it. Maybe 90% of them just return a single, hardcoded, static string, but then there is parity on how a single item gets its collection handle: it calls get_collection_handle(<ItemSubmission>) or something.

This still leaves the awkward batch_metadata in my comment above about refactoring to a private _submit_item(), but it knocks out the collection_handle passing weirdness!

Copy link
Copy Markdown
Contributor Author

@jonavellecuerdo jonavellecuerdo Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To recap, the collection_handle is passed into DSC ad follows:

  • Users can provide a collection_handle in the input payload for submit command when invoking the Step Function.
  • The collection_handle is set for the DSC submit CLI command.
  • The collection_handle is passed to Workflow.submit_items() call.
  • The collection_handle is set on the ItemSubmission object.

In my initial passes at defining this method, I was planning for the following signature: _get_item_collection_handle(item_metadata: dict), rather than accepting an ItemSubmission instance (for context, item metadata is not saved to the ItemSubmission object and is thus not present when the object is loaded from DynamoDB). Thus, item metadata has to come from somewhere, and currently it is from this line (which is less than ideal).

Continuing the assumption that item metadata is not saved to the ItemSubmission object, if we were to define the method for all workflows, a workflow like opencourseware or sccs would look like this:

class Workflow:
   def submit_items(self, collection_handle: str):
      get_item_collection_handle(collection_handle)

   # where
   def get_item_collection_handle(collection_handle: str): 
      return collection_handle

The sample code above feels redundant/somewhat of an anti-pattern. 🤔 For this reason, I proposed:

class Workflow:
   def submit_items(self, collection_handle: str | None = None): 
      item_submission.collection_handle = (
                    collection_handle or self._get_item_collection_handle(<item metadata accessed from somewhere>)
                )

I hope this provides some context into the updates proposed here! Lastly, I selected the name _get_item_collection_handle in hopes it would imply how, if the method is called, this collection handle is at the item-level and not intended to be used across all the item submissions for a given batch.

Let me know what you think about keeping the derivation as is. Fully acknowledging that we still should tackle how we access item metadata in a follow-up PR!

Tagging @ehanson8 to request additional feedback and thoughts, too.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo - you're definitely much closer to the codebase, so I say go for it if it's feeling right still. I can always dig back in post PR and, if something still feels off, propose a small new PR at that time. Thanks for hearing me out!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge as-is for now but continue to ponder in next PR!

Comment thread dsc/workflows/base/workflow.py
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 23, 2026

Coverage Report for CI Build 24908887711

Warning

No base build found for commit cf71ca5 on main.
Coverage changes can't be calculated without a base build.
If a base build is processing, this comment will update automatically when it completes.

Coverage: 96.282%

Details

  • Patch coverage: 5 uncovered changes across 1 file (41 of 46 lines covered, 89.13%).

Uncovered Changes

File Changed Covered %
dsc/utils/aws/s3.py 32 27 84.38%

Coverage Regressions

Requires a base build to compare against. How to fix this →


Coverage Stats

Coverage Status
Relevant Lines: 1049
Covered Lines: 1010
Line Coverage: 96.28%
Coverage Strength: 0.96 hits per line

💛 - Coveralls

@jonavellecuerdo jonavellecuerdo force-pushed the IN-1626-digitized-theses-prereqs branch from ab5c628 to 46a53ea Compare April 24, 2026 19:33
* Resolve CVEs
* Add dependencies for handling MARC XML metadata files
* Add ipython and dspace-rest-client as deps
Why these changes are being introduced:
* With the introduction of the digitized theses workflow, the sync
command will now be required (a) when DSC needs to copy data from
one bucket to another in different environments and (b) when
DSC needs to copy data from the Imaging Lab S3 bucket to a batch
folder in the DSC S3 bucket.

How this addresses that need:
* Create function for running 'sync' AWS CLI command in utils.aws.s3
* Update DSC 'sync' CLI command to call helper function instead

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1626
Why these changes are being introduced:
The collection handle is not required when the item submissions in
the batch belong to different collections and the handle must be
derived dynamically (usually from the metadata) or if DSC
needs to send a submission message with the 'update' operation,
which requires that the item handle is provided instead.

How this addresses that need:
* Remove 'required' setting for collection_handle in 'submit' CLI command
* Collection handle is made optional in base workflow submit_items method
* Add optional sub method for getting an item collection handle,
raise NotImplementedError for base workflow

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1688
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1626-digitized-theses-prereqs branch from 795e0d2 to e1bf1e9 Compare April 24, 2026 19:51
@jonavellecuerdo jonavellecuerdo merged commit 3cc6e39 into main Apr 24, 2026
6 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1626-digitized-theses-prereqs branch April 24, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants