-
Notifications
You must be signed in to change notification settings - Fork 15
Add per-asset dataStandard, HED standard, and extensions to StandardsType #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
170eb62
721bd11
702f464
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,3 +14,5 @@ sandbox/ | |
| venv/ | ||
| venvs/ | ||
| dandischema/_version.py | ||
| uv.lock | ||
| .cache/ | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This file provides guidance to Claude Code when working with code in this | ||
| repository. | ||
|
|
||
| ## Project Overview | ||
|
|
||
| dandischema defines the Pydantic v2 metadata models for the DANDI | ||
| neurophysiology data archive. It is used by both the dandi-cli client and the | ||
| dandi-archive server. Key concerns: model definitions, JSON Schema generation, | ||
| metadata validation, schema migration between versions, and asset metadata | ||
| aggregation. | ||
|
|
||
| ## Build/Test Commands | ||
|
|
||
| ```bash | ||
| tox -e py3 # Run full test suite (preferred) | ||
| pytest dandischema/ # Run tests directly in active venv | ||
| pytest dandischema/tests/test_metadata.py -v -k "test_name" # Single test | ||
| tox -e lint # codespell + flake8 | ||
| tox -e typing # mypy (strict, with pydantic plugin) | ||
| ``` | ||
|
|
||
| - `filterwarnings = error` is active — new warnings will fail tests. | ||
| - Coverage is collected by default (`--cov=dandischema`). | ||
|
|
||
| ## Code Style | ||
|
|
||
| - **Formatter**: Black (no explicit line-length override → default 88) | ||
| - **Import sorting**: isort with `profile = "black"`, `force_sort_within_sections`, | ||
| `reverse_relative` | ||
| - **Linting**: flake8 (max-line-length=100, ignores E203/W503) | ||
| - **Type checking**: mypy strict — `no_implicit_optional`, `warn_return_any`, | ||
| `warn_unreachable`, pydantic plugin enabled | ||
| - **Pre-commit hooks**: trailing-whitespace, end-of-file-fixer, check-yaml, | ||
| check-added-large-files, black, isort, codespell, flake8 | ||
| - Imports at top of file; avoid function-level imports unless there is a | ||
| concrete reason (circular deps, heavy transitive imports) | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Key Modules | ||
|
|
||
| | File | Role | | ||
| |------|------| | ||
| | `models.py` | All Pydantic models (~2000 lines). Class hierarchy rooted at `DandiBaseModel`. | | ||
| | `metadata.py` | `validate()`, `migrate()`, `aggregate_assets_summary()`. | | ||
| | `consts.py` | `DANDI_SCHEMA_VERSION`, `ALLOWED_INPUT_SCHEMAS`, `ALLOWED_TARGET_SCHEMAS`. | | ||
| | `conf.py` | Instance configuration via env vars (`DANDI_INSTANCE_NAME`, etc.). | | ||
| | `types.py` | Custom Pydantic types (`ByteSizeJsonSchema`). | | ||
| | `utils.py` | JSON schema helpers, `version2tuple()`, `name2title()`. | | ||
| | `exceptions.py` | `ValidationError`, `JsonschemaValidationError`, `PydanticValidationError`. | | ||
| | `digests/` | `DandiETag` multipart-upload checksum calculation. | | ||
| | `datacite/` | DataCite DOI metadata conversion. | | ||
|
|
||
| ### Model Hierarchy (simplified) | ||
|
|
||
| ``` | ||
| DandiBaseModel | ||
| ├── PropertyValue # recursive (self-referencing) | ||
| ├── BaseType | ||
| │ ├── StandardsType # name, identifier, version, extensions (recursive) | ||
| │ ├── ApproachType, AssayType, SampleType, Anatomy, ... | ||
| │ └── MeasurementTechniqueType | ||
| ├── Person, Organization # Contributor subclasses | ||
| ├── BioSample # recursive (wasDerivedFrom) | ||
| ├── AssetsSummary # aggregated stats | ||
| └── CommonModel | ||
| ├── Dandiset → PublishedDandiset | ||
| └── BareAsset → Asset → PublishedAsset | ||
| ``` | ||
|
|
||
| Several models are **self-referencing** (PropertyValue, BioSample, | ||
| StandardsType). These require `model_rebuild()` after the class definition. | ||
|
|
||
| ### Data Flow: Asset Metadata Aggregation | ||
|
|
||
| 1. dandi-cli calls `asset.get_metadata()` → populates `BareAsset` including | ||
| per-asset `dataStandard` list | ||
| 2. Asset metadata is serialized via `model_dump(mode="json", exclude_none=True)` | ||
| 3. Server calls `aggregate_assets_summary(assets)` → | ||
| `_add_asset_to_stats()` per asset → `AssetsSummary` | ||
| 4. `_add_asset_to_stats()` collects: numberOfBytes, numberOfFiles, approach, | ||
| measurementTechnique, variableMeasured, species, subjects, dataStandard | ||
| 5. `dataStandard` has deprecated path/encoding heuristic fallbacks for old | ||
| clients (remove after 2026-12-01) | ||
|
|
||
| ### Pre-instantiated Standard Constants | ||
|
|
||
| ```python | ||
| nwb_standard # RRID:SCR_015242 | ||
| bids_standard # RRID:SCR_016124 | ||
| ome_ngff_standard # DOI:10.25504/FAIRsharing.9af712 | ||
| hed_standard # RRID:SCR_014074 | ||
| ``` | ||
|
|
||
| These are dicts (`model_dump(mode="json", exclude_none=True)`) used by both | ||
| dandischema (heuristic fallbacks) and dandi-cli (per-asset population). | ||
|
|
||
| ### Vendorization | ||
|
|
||
| The schema supports deployment for different DANDI instances. Environment | ||
| variables (`DANDI_INSTANCE_NAME`, `DANDI_INSTANCE_IDENTIFIER`, | ||
| `DANDI_DOI_PREFIX`, etc.) must be set **before** importing | ||
| `dandischema.models`. This dynamically adjusts identifier patterns, DOI | ||
| prefixes, license enums, and URL patterns. CI tests multiple vendored | ||
| configurations. | ||
|
|
||
| ## Schema Change Checklist | ||
|
|
||
| When adding or removing fields from any model (BareAsset, Dandiset, | ||
| AssetsSummary, etc.): | ||
|
|
||
| 1. **Update `_FIELDS_INTRODUCED` in `metadata.py:migrate()`** if adding a new | ||
| **top-level field to Dandiset metadata** — `migrate()` only processes | ||
| Dandiset-level dicts (not Asset metadata). Fields on BareAsset or nested | ||
| inside existing structures (e.g. new fields on StandardsType) do not need | ||
| entries here. | ||
|
|
||
| 2. **Update `consts.py`** if bumping `DANDI_SCHEMA_VERSION` or adding to | ||
| `ALLOWED_INPUT_SCHEMAS`. | ||
|
|
||
| 3. **Add tests** covering migration/aggregation with the new field. | ||
|
|
||
| 4. **Coordinate with dandi-cli** — new fields that dandi-cli populates need | ||
| backward-compat guards there (check `"field" in Model.model_fields`) until | ||
| the minimum dandischema dependency is bumped. | ||
|
|
||
| ## Testing Notes | ||
|
|
||
| - Tests use `filterwarnings = error` — any new deprecation warning will fail. | ||
| - The `clear_dandischema_modules_and_set_env_vars` fixture (conftest.py) | ||
| supports testing vendored configurations by clearing cached modules and | ||
| setting env vars. | ||
| - Network-dependent tests are skipped when `DANDI_TESTS_NONETWORK` is set. | ||
| - DataCite tests require `DATACITE_DEV_LOGIN` / `DATACITE_DEV_PASSWORD`. | ||
| - `test_models.py:test_duplicate_classes` checks for duplicate field qnames | ||
| across models; allowed duplicates are listed explicitly. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -860,11 +860,28 @@ class MeasurementTechniqueType(BaseType): | |
| class StandardsType(BaseType): | ||
| """Identifier for data standard used""" | ||
|
|
||
| version: Optional[str] = Field( | ||
| None, | ||
| description="Version of the standard used.", | ||
| json_schema_extra={"nskey": "schema"}, | ||
| ) | ||
| extensions: Optional[List[StandardsType]] = Field( | ||
| None, | ||
| description="Extensions to the standard used " | ||
| "(e.g. NWB extensions like ndx-*, HED library schemas).", | ||
| json_schema_extra={"nskey": DANDI_NSKEY}, | ||
| ) | ||
| # TODO: consider how to formalize BIDS extensions (BEPs) once BIDS | ||
| # has a machine-readable way to declare them. | ||
| schemaKey: Literal["StandardsType"] = Field( | ||
| "StandardsType", validate_default=True, json_schema_extra={"readOnly": True} | ||
| ) | ||
|
|
||
|
|
||
| # Self-referencing model needs rebuild after class definition | ||
| # https://docs.pydantic.dev/latest/concepts/postponed_annotations/#self-referencing-or-recursive-models | ||
| StandardsType.model_rebuild() | ||
|
|
||
| nwb_standard = StandardsType( | ||
| name="Neurodata Without Borders (NWB)", | ||
| identifier="RRID:SCR_015242", | ||
|
|
@@ -880,6 +897,11 @@ class StandardsType(BaseType): | |
| identifier="DOI:10.25504/FAIRsharing.9af712", | ||
| ).model_dump(mode="json", exclude_none=True) | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kabilar please add additional constructs for anything LINC related -- like those for bigtiff etc, and then here as to use them in dandi-cli |
||
|
|
||
| hed_standard = StandardsType( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Defining I think that should be done in a separate PR though.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed — keeping lowercase here for consistency with the existing pattern, and the rename to UPPER_CASE can be done in a separate PR as you suggest. |
||
| name="Hierarchical Event Descriptors (HED)", | ||
| identifier="RRID:SCR_014074", | ||
| ).model_dump(mode="json", exclude_none=True) | ||
|
|
||
|
|
||
| class ContactPoint(DandiBaseModel): | ||
| email: Optional[EmailStr] = Field( | ||
|
|
@@ -1841,6 +1863,12 @@ class BareAsset(CommonModel): | |
| json_schema_extra={"nskey": "prov"}, | ||
| ) | ||
|
|
||
| dataStandard: Optional[List[StandardsType]] = Field( | ||
| None, | ||
| description="Data standard(s) applicable to this asset.", | ||
| json_schema_extra={"nskey": DANDI_NSKEY}, | ||
| ) | ||
|
|
||
| # Bare asset is to be just Asset. | ||
| schemaKey: Literal["Asset"] = Field( | ||
| "Asset", validate_default=True, json_schema_extra={"readOnly": True} | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we set
"nskey"key in thejson_schema_extradict? Ifextensionsis not defined in a known ontology, we I think we should use ours, i.e.,There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed — added
"nskey": DANDI_NSKEYto extensions in 702f464RF: address PR review - replace readOnly with nskey, fix extensions field.