Skip to content

Dsc refactor p3#213

Draft
jonavellecuerdo wants to merge 8 commits intomainfrom
dsc-refactor-p3
Draft

Dsc refactor p3#213
jonavellecuerdo wants to merge 8 commits intomainfrom
dsc-refactor-p3

Conversation

@jonavellecuerdo
Copy link
Copy Markdown
Contributor

Purpose and background context

DO NOT MERGE

The purpose of this draft PR is to allow DataEng to leave comments + questions in preparation for our meeting to discuss proposed refactoring updates next Tuesday (2/24). Currently, all the refactor work is organized under three branches, prefixed with dsc-refactor-p*. Once DataEng has approved the general ideas for the refactor, the plan is to create individual PRs to merge changes from each branch into main in an effort to ease the review process!


Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

…bility

Why these changes are being introduced:
* The restructuring of the app's project directory was required to better accommodate
the introduction of new DSC workflows. As we've started to plan the implementation of
other workflows, it's becoming clear that each workflow is different. This new structure
allows us to standardize the way each workflow is organized, while also providing a space
for workflow-specific assets that may be needed.

How this addresses that need:
* Create folders for each DSC workflow implementation
* Organize workflow and transformer Python files and metadata mapping JSON files
under corresponding workflow dirs
* Clean up imports
* Update DSC version to 1.3

Side effects of this change:
* None. It's worth highlighting that this commit focuses on structural changes
to the app's project directory and does not introduce any functional changes
to the app.

Relevant ticket(s):
* TBD
Why these changes are being introduced:
* Prior to OpenCourseWare, DSC's metadata transformation capabilities was limited
to simple 1-1 mappings and splitting strings on delimiters for multi-valued fields.
However, with OpenCourseWare and other workflows planned for implementation,
transformations can be more complex. This extends the transformer approach
originally developed for OpenCourseWare to other workflows as an option.

How this addresses that need:
* Create module for Transformer in dsc.workflows.base
* Update OpenCourseWareTransformer to inherit from base Transformer

Side effects of this change:
* None

Relevant ticket(s):
*
Why these changes are being introduced:
During batch creation, we are performing checks to see if
both metadata and bitstreams are provided for a given item
submission. When determining if metadata is available,
what we should really ask is if the metadata is sufficient
and can be transformed to the format that is needed for
DSpace ingests.

How this addresses that need:
* Update ItemSubmission.prepare_dspace_metadata to clarify
what is being created (the 'MetadataEntry' object payload for the DSpace
REST API)
* Deprecate ItemSubmission.validate_dspace_metadata
* Add property methods for retrieving dspace metadata S3 URIs
to base Workflow
* Add final method for creating DSpace metadata JSON to base workflow
* Remove calls to create DSpace metadata JSON from submit command

Side effects of this change:
* A successful run of batch creation will include the
creation of a 'dspace_metadata/' subfolder in the batch folder
that contains JSON files of Dublin Core formatted metadata
for each item submission.

Relevant ticket(s):
* TBD
…etadata

Why these changes are being introduced:
* This allows alignment of DSC workflows such that metadata transformations
are handled via a shared module/abstract -- the Transformer class. Prior to
these changes, some workflows were solely relying on the metadata mapping JSON
files. The Transformer class pulls inspiration from the Transmogrifier
app (https://github.com/MITLibraries/transmogrifier). This will allow
us to consolidate any and all complex transformations inside the Transformer
class, while still standardizing the "how" across all DSC workflows.

How this addresses that need:
* Define transformer modules for ArchivesSpace and SCCS
* Translate mapping from metadata mapping JSON files to class methods
on Transformer classes
* Add 'metadata_transformer' property to base Workflow class

Side effects of this change:
* This allows deprecation of metadata mapping JSON files.

Relevant ticket(s):
* TBD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a version of this comment that seemed to fall in a black hole so repeating: I really like this organization, the previous approach worked but I appreciate having a workflow's files collocated like this!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily have a problem with this change but more the timing in relation to the DSpace 8 migration. My DSS updates and testing are getting to the point where I want DSC to produce DSpace 8 metadata and we likely will soon need to handle configurable entities re: department names which could be a significant shift.

As I said on Slack, I think it would be wise to hold off on any changes to metadata transformation until DSC is comfortably producing DSpace 8 metadata.

I know your goal is to make DSC more accommodating for Digitized Theses but DSpace 8 is coming before that so my hope is we can do DSpace 8 metadata first. I think it will also be easier to envision a better approach to DSC metadata transformation after we are comfortable with DSC producing DSpace 8 metadata.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing up DSpace 8 as I see that I hadn't mentioned it yet in any of my messages regarding this proposed refactor!

Given that we know we already need to make updates to DSC to produce DSpace 8 metadata, the hope is that we can solve this under the new structure proposed in this refactor. More specifically, figure out how to produce DSpace 8 metadata using:

  • the Transformer class
    • which is responsible for transforming source metadata to Dublin Core format
  • methods on the ItemSubmission class
    • which is responsible for taking the transformed, Dublin Core metadata and creating a JSON file with the format required by DSpace 6/8 REST API

Worth noting here is: no longer relying on the metadata mapping JSON files.


The main motivation for part 3 is:

  • Metadata transformations in DSC--at its current state--is somewhat difficult to follow because we have transforms that occur via a transformer (OpenCourseWare and--although tabled for now--Wiley) and via a metadata mapping JSON file.
  • The metadata mapping JSON file has limitations on complex transformations, which makes it seem like it would not solve for configurable entities as well. 🤔
  • Knowing that the use of a Transformer class solves for complex transformations and, as this PR demonstrates, it would be easy to create Transformers for simpler workflows like SCCS.

In summary, I feel confident that aligning how we transform metadata across all workflows will make it easier to solve the case for "how DSC creates DSpace 8 metadata"--whether that is handled via the Transformer or the methods on ItemSubmission!

logger = logging.getLogger(__name__)


class Transformer(ABC):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite my comment about timing, I do think the Transformer approach could be a smart update

"item_identifier": item_metadata["item_identifier"],
"workflow_name": self.workflow_name,
}
ItemSubmission(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good change!

Comment thread dsc/item_submission.py
Comment on lines +258 to +260
This method follows the specs for a 'MetadataEntry Object' for
the DSpace 6 REST API:
https://wiki.lyrasis.org/display/DSDOC6x/REST+API#RESTAPI-MetadataEntryO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did confirm that the MetadataEntry concept is not reflected in the DSpace 7+ REST API (REST contract) so I have reservations about a refactor incorporating DSpace 6 structures when we're so close to DSpace 8.

However, reviewing the REST contract did reveal that we can access metadata schemas which I think will be critical for meaningful metadata validation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, see #213 (comment). 🤔

Similar to our approach for DSS, DSC will need to support DS6 and DS8 until the migration is complete. I feel confident that aligning how we transform metadata / generate metadata required for DSpace REST API across all our workflows will make it easier to solve the case of supporting multiple DSpace versions!

Comment thread dsc/workflows/base/workflow.py Outdated
self._create_batch_in_db(item_submissions)

@abstractmethod
def prepare_batch(self, *, synced: bool = False) -> tuple[list, ...]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As shared in an informal slack huddle, I'd propose reworking the create_batch() / _prepare_batch() relationship.

Instead of requiring that workflows define a _prepare_batch() method, perhaps all workflows should be required to define their create_batch() method, and then the base transformer fires a post create hook that performs this work:

if errors:
    raise BatchCreationFailedError(errors)

_dspace_metadata_uris, errors = self._create_dspace_metadata_json(
    item_submissions
)
if errors:
    raise BatchCreationFailedError(errors)

self._create_batch_in_db(item_submissions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants