Dsc refactor p3#213
Conversation
…bility Why these changes are being introduced: * The restructuring of the app's project directory was required to better accommodate the introduction of new DSC workflows. As we've started to plan the implementation of other workflows, it's becoming clear that each workflow is different. This new structure allows us to standardize the way each workflow is organized, while also providing a space for workflow-specific assets that may be needed. How this addresses that need: * Create folders for each DSC workflow implementation * Organize workflow and transformer Python files and metadata mapping JSON files under corresponding workflow dirs * Clean up imports * Update DSC version to 1.3 Side effects of this change: * None. It's worth highlighting that this commit focuses on structural changes to the app's project directory and does not introduce any functional changes to the app. Relevant ticket(s): * TBD
Why these changes are being introduced: * Prior to OpenCourseWare, DSC's metadata transformation capabilities was limited to simple 1-1 mappings and splitting strings on delimiters for multi-valued fields. However, with OpenCourseWare and other workflows planned for implementation, transformations can be more complex. This extends the transformer approach originally developed for OpenCourseWare to other workflows as an option. How this addresses that need: * Create module for Transformer in dsc.workflows.base * Update OpenCourseWareTransformer to inherit from base Transformer Side effects of this change: * None Relevant ticket(s): *
Why these changes are being introduced: During batch creation, we are performing checks to see if both metadata and bitstreams are provided for a given item submission. When determining if metadata is available, what we should really ask is if the metadata is sufficient and can be transformed to the format that is needed for DSpace ingests. How this addresses that need: * Update ItemSubmission.prepare_dspace_metadata to clarify what is being created (the 'MetadataEntry' object payload for the DSpace REST API) * Deprecate ItemSubmission.validate_dspace_metadata * Add property methods for retrieving dspace metadata S3 URIs to base Workflow * Add final method for creating DSpace metadata JSON to base workflow * Remove calls to create DSpace metadata JSON from submit command Side effects of this change: * A successful run of batch creation will include the creation of a 'dspace_metadata/' subfolder in the batch folder that contains JSON files of Dublin Core formatted metadata for each item submission. Relevant ticket(s): * TBD
…etadata Why these changes are being introduced: * This allows alignment of DSC workflows such that metadata transformations are handled via a shared module/abstract -- the Transformer class. Prior to these changes, some workflows were solely relying on the metadata mapping JSON files. The Transformer class pulls inspiration from the Transmogrifier app (https://github.com/MITLibraries/transmogrifier). This will allow us to consolidate any and all complex transformations inside the Transformer class, while still standardizing the "how" across all DSC workflows. How this addresses that need: * Define transformer modules for ArchivesSpace and SCCS * Translate mapping from metadata mapping JSON files to class methods on Transformer classes * Add 'metadata_transformer' property to base Workflow class Side effects of this change: * This allows deprecation of metadata mapping JSON files. Relevant ticket(s): * TBD
There was a problem hiding this comment.
I made a version of this comment that seemed to fall in a black hole so repeating: I really like this organization, the previous approach worked but I appreciate having a workflow's files collocated like this!
There was a problem hiding this comment.
I don't necessarily have a problem with this change but more the timing in relation to the DSpace 8 migration. My DSS updates and testing are getting to the point where I want DSC to produce DSpace 8 metadata and we likely will soon need to handle configurable entities re: department names which could be a significant shift.
As I said on Slack, I think it would be wise to hold off on any changes to metadata transformation until DSC is comfortably producing DSpace 8 metadata.
I know your goal is to make DSC more accommodating for Digitized Theses but DSpace 8 is coming before that so my hope is we can do DSpace 8 metadata first. I think it will also be easier to envision a better approach to DSC metadata transformation after we are comfortable with DSC producing DSpace 8 metadata.
There was a problem hiding this comment.
Thank you for bringing up DSpace 8 as I see that I hadn't mentioned it yet in any of my messages regarding this proposed refactor!
Given that we know we already need to make updates to DSC to produce DSpace 8 metadata, the hope is that we can solve this under the new structure proposed in this refactor. More specifically, figure out how to produce DSpace 8 metadata using:
- the
Transformerclass- which is responsible for transforming source metadata to Dublin Core format
- methods on the
ItemSubmissionclass- which is responsible for taking the transformed, Dublin Core metadata and creating a JSON file with the format required by DSpace 6/8 REST API
Worth noting here is: no longer relying on the metadata mapping JSON files.
The main motivation for part 3 is:
- Metadata transformations in DSC--at its current state--is somewhat difficult to follow because we have transforms that occur via a transformer (
OpenCourseWareand--although tabled for now--Wiley) and via a metadata mapping JSON file. - The metadata mapping JSON file has limitations on complex transformations, which makes it seem like it would not solve for configurable entities as well. 🤔
- Knowing that the use of a
Transformerclass solves for complex transformations and, as this PR demonstrates, it would be easy to createTransformers for simpler workflows like SCCS.
In summary, I feel confident that aligning how we transform metadata across all workflows will make it easier to solve the case for "how DSC creates DSpace 8 metadata"--whether that is handled via the Transformer or the methods on ItemSubmission!
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class Transformer(ABC): |
There was a problem hiding this comment.
Despite my comment about timing, I do think the Transformer approach could be a smart update
| "item_identifier": item_metadata["item_identifier"], | ||
| "workflow_name": self.workflow_name, | ||
| } | ||
| ItemSubmission( |
There was a problem hiding this comment.
This seems like a good change!
| This method follows the specs for a 'MetadataEntry Object' for | ||
| the DSpace 6 REST API: | ||
| https://wiki.lyrasis.org/display/DSDOC6x/REST+API#RESTAPI-MetadataEntryO |
There was a problem hiding this comment.
I did confirm that the MetadataEntry concept is not reflected in the DSpace 7+ REST API (REST contract) so I have reservations about a refactor incorporating DSpace 6 structures when we're so close to DSpace 8.
However, reviewing the REST contract did reveal that we can access metadata schemas which I think will be critical for meaningful metadata validation
There was a problem hiding this comment.
Hmm, see #213 (comment). 🤔
Similar to our approach for DSS, DSC will need to support DS6 and DS8 until the migration is complete. I feel confident that aligning how we transform metadata / generate metadata required for DSpace REST API across all our workflows will make it easier to solve the case of supporting multiple DSpace versions!
| self._create_batch_in_db(item_submissions) | ||
|
|
||
| @abstractmethod | ||
| def prepare_batch(self, *, synced: bool = False) -> tuple[list, ...]: |
There was a problem hiding this comment.
As shared in an informal slack huddle, I'd propose reworking the create_batch() / _prepare_batch() relationship.
Instead of requiring that workflows define a _prepare_batch() method, perhaps all workflows should be required to define their create_batch() method, and then the base transformer fires a post create hook that performs this work:
if errors:
raise BatchCreationFailedError(errors)
_dspace_metadata_uris, errors = self._create_dspace_metadata_json(
item_submissions
)
if errors:
raise BatchCreationFailedError(errors)
self._create_batch_in_db(item_submissions)
Purpose and background context
DO NOT MERGE
The purpose of this draft PR is to allow DataEng to leave comments + questions in preparation for our meeting to discuss proposed refactoring updates next Tuesday (2/24). Currently, all the refactor work is organized under three branches, prefixed with
dsc-refactor-p*. Once DataEng has approved the general ideas for the refactor, the plan is to create individual PRs to merge changes from each branch intomainin an effort to ease the review process!Code review