Prerequisites for Digitized Theses workflow#225
Conversation
ehanson8
left a comment
There was a problem hiding this comment.
Excellent work! A few questions
ghukill
left a comment
There was a problem hiding this comment.
These are some thoughtful and targeted changes, nice work!
Please see my comments about considering a new private _submit_item() (singular), how this exposes collection_handle as kind of awkward, and a possible doubling down on the getting of collection handles as a first class action that might deserve a method that is always called, not just when collection_handle = None.
|
|
||
| @final | ||
| def submit_items(self, collection_handle: str) -> list: | ||
| def submit_items(self, collection_handle: str | None = None) -> list: |
There was a problem hiding this comment.
Commenting here, but not a formal request for this PR.
I think this method might benefit from leaning on another like _submit_item(...). Then, this could be simplified as a for loop that orchestrates, but the private _submit_item(...) method has most of the business logic.
You'd end up with something more like this, with the try only attemping one statement:
for item_submission in ItemSubmission.get_batch(self.batch_id):
self.submission_summary["total"] += 1
if not item_submission.ready_to_submit():
self.submission_summary["skipped"] += 1
continue
try:
items.append(
self._submit_item(item_submission, collection_handle, batch_metadata)
)
except NotImplementedError:
raise
except Exception as exception: # noqa: BLE001
self.submission_summary["errors"] += 1
item_submission.status = ItemSubmissionStatus.SUBMIT_FAILED
item_submission.status_details = str(exception)
item_submission.submit_attempts += 1
item_submission.upsert_db()What I feel like that refactor work does is expose how collection_handle and batch_metadata continue to feel kind of awkward.
It would appear that for any given invocation of submit_items() they are shared across all items, so why couldn't they be saved to the workflow instance? or, attached to the ItemSubmission instances that are getting passed around?
Feel free to ignore this for now, but I may reference it for the collection handle resolution if not provided in another comment.
There was a problem hiding this comment.
I'm circling back to this now given my continued reading and comment down below.
I get how this works, but am finding there is some cognitive dissonance here.
Is it correct that for most workflows, collection_handle is static / fixed / same for all items, and that's how it gets passed to submit_items()? But for some -- e.g. digi theses -- it is not, and can theoretically vary per item?
If so, I think I'd double down and always use a dedicated method for getting the collection handle. And, make all workflows implement it. Maybe 90% of them just return a single, hardcoded, static string, but then there is parity on how a single item gets its collection handle: it calls get_collection_handle(<ItemSubmission>) or something.
This still leaves the awkward batch_metadata in my comment above about refactoring to a private _submit_item(), but it knocks out the collection_handle passing weirdness!
There was a problem hiding this comment.
To recap, the collection_handle is passed into DSC ad follows:
- Users can provide a
collection_handlein the input payload forsubmitcommand when invoking the Step Function. - The
collection_handleis set for the DSCsubmitCLI command. - The
collection_handleis passed toWorkflow.submit_items()call. - The
collection_handleis set on theItemSubmissionobject.
In my initial passes at defining this method, I was planning for the following signature: _get_item_collection_handle(item_metadata: dict), rather than accepting an ItemSubmission instance (for context, item metadata is not saved to the ItemSubmission object and is thus not present when the object is loaded from DynamoDB). Thus, item metadata has to come from somewhere, and currently it is from this line (which is less than ideal).
Continuing the assumption that item metadata is not saved to the ItemSubmission object, if we were to define the method for all workflows, a workflow like opencourseware or sccs would look like this:
class Workflow:
def submit_items(self, collection_handle: str):
get_item_collection_handle(collection_handle)
# where
def get_item_collection_handle(collection_handle: str):
return collection_handleThe sample code above feels redundant/somewhat of an anti-pattern. 🤔 For this reason, I proposed:
class Workflow:
def submit_items(self, collection_handle: str | None = None):
item_submission.collection_handle = (
collection_handle or self._get_item_collection_handle(<item metadata accessed from somewhere>)
)I hope this provides some context into the updates proposed here! Lastly, I selected the name _get_item_collection_handle in hopes it would imply how, if the method is called, this collection handle is at the item-level and not intended to be used across all the item submissions for a given batch.
Let me know what you think about keeping the derivation as is. Fully acknowledging that we still should tackle how we access item metadata in a follow-up PR!
Tagging @ehanson8 to request additional feedback and thoughts, too.
There was a problem hiding this comment.
@jonavellecuerdo - you're definitely much closer to the codebase, so I say go for it if it's feeling right still. I can always dig back in post PR and, if something still feels off, propose a small new PR at that time. Thanks for hearing me out!
There was a problem hiding this comment.
I will merge as-is for now but continue to ponder in next PR!
Coverage Report for CI Build 24908887711Warning No base build found for commit Coverage: 96.282%Details
Uncovered Changes
Coverage RegressionsRequires a base build to compare against. How to fix this → Coverage Stats
💛 - Coveralls |
ab5c628 to
46a53ea
Compare
* Resolve CVEs * Add dependencies for handling MARC XML metadata files * Add ipython and dspace-rest-client as deps
Why these changes are being introduced: * With the introduction of the digitized theses workflow, the sync command will now be required (a) when DSC needs to copy data from one bucket to another in different environments and (b) when DSC needs to copy data from the Imaging Lab S3 bucket to a batch folder in the DSC S3 bucket. How this addresses that need: * Create function for running 'sync' AWS CLI command in utils.aws.s3 * Update DSC 'sync' CLI command to call helper function instead Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1626
Why these changes are being introduced: The collection handle is not required when the item submissions in the batch belong to different collections and the handle must be derived dynamically (usually from the metadata) or if DSC needs to send a submission message with the 'update' operation, which requires that the item handle is provided instead. How this addresses that need: * Remove 'required' setting for collection_handle in 'submit' CLI command * Collection handle is made optional in base workflow submit_items method * Add optional sub method for getting an item collection handle, raise NotImplementedError for base workflow Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1688
795e0d2 to
e1bf1e9
Compare
Purpose and background context
This PR comprises of a series of updates to the DSC app in preparation for the Digitized Theses workflow. At high level, these updates are:
Workflow.prepare_batchto returnItemSubmissionobjectsItemSubmissionobjectsyncAWS CLI commandsyncDSC CLI commanddigitized-theses-workspaceto a minted batch folder indsc-uploadin S3.archive_file_with_new_keyas general method for moving files in S3replacement-theses/andnew_theses/subfolders (prefixes)submitCLI commandHow can a reviewer manually see the effects of these changes?
Full DSC
OpenCourseWareworkflow run:collection-handle(step function execution, CloudWatch log stream)collection-handle(step function execution, CloudWatch log stream)See ingested item at: https://mit-dspace.eks.prod.4science.cloud/handle/1721.1/164758
Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Code review