-
-
Notifications
You must be signed in to change notification settings - Fork 396
refactor: define codec and data type classes upstream in a subpackage #3875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
d-v-b
wants to merge
2
commits into
zarr-developers:main
Choose a base branch
from
d-v-b:refactor/upstream-apis
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # zarr-interfaces | ||
|
|
||
| Interface definitions (ABCs and protocols) for zarr codecs and data types. | ||
|
|
||
| This package provides the abstract base classes and protocols that external | ||
| codec and data type implementations should subclass or implement. It has | ||
| minimal dependencies (only numpy) and does not depend on zarr-python itself. | ||
|
|
||
| ## Usage | ||
|
|
||
| ```python | ||
| from zarr_interfaces.codec.v1 import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec | ||
| from zarr_interfaces.data_type.v1 import ZDType | ||
| ``` | ||
|
|
||
| Interfaces are versioned under a `v1` namespace to support future evolution | ||
| without breaking existing implementations. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
|
|
||
| [project] | ||
| name = "zarr-interfaces" | ||
| version = "0.1.0" | ||
| description = "Interface definitions (ABCs and protocols) for zarr codecs and data types" | ||
| readme = "README.md" | ||
| license = "BSD-3-Clause" | ||
| requires-python = ">=3.12" | ||
| dependencies = [ | ||
| "numpy>=2", | ||
| ] | ||
|
|
||
| [project.urls] | ||
| homepage = "https://github.com/zarr-developers/zarr-python" |
Empty file.
Empty file.
202 changes: 202 additions & 0 deletions
202
packages/zarr-interfaces/src/zarr_interfaces/codec/v1.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| """Codec interface definitions (v1). | ||
|
|
||
| This module defines the abstract interfaces for zarr codecs. | ||
| External codec implementations should subclass ``ArrayArrayCodec``, | ||
| ``ArrayBytesCodec``, or ``BytesBytesCodec`` from this module. | ||
|
|
||
| The ``Buffer`` and ``NDBuffer`` types here are protocols — they define | ||
| the structural interface that zarr's concrete buffer types implement. | ||
| Codec authors should type against these protocols, not zarr's concrete | ||
| buffer classes. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from abc import ABC, abstractmethod | ||
| from dataclasses import dataclass | ||
| from typing import TYPE_CHECKING, ClassVar, Protocol, Self, runtime_checkable | ||
|
|
||
| if TYPE_CHECKING: | ||
| import numpy as np | ||
| import numpy.typing as npt | ||
|
|
||
| from zarr_interfaces.data_type.v1 import JSON, TBaseDType, TBaseScalar, ZDType | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Buffer protocols | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class Buffer(Protocol): | ||
| """Protocol for a flat contiguous memory block (bytes-like).""" | ||
|
|
||
| def __len__(self) -> int: ... | ||
| def __getitem__(self, key: slice) -> Buffer: ... | ||
|
|
||
|
|
||
| class NDBuffer(Protocol): | ||
| """Protocol for an N-dimensional array buffer.""" | ||
|
|
||
| @property | ||
| def dtype(self) -> np.dtype[np.generic]: ... | ||
|
|
||
| @property | ||
| def shape(self) -> tuple[int, ...]: ... | ||
|
|
||
| def as_ndarray_like(self) -> npt.NDArray[np.generic]: ... | ||
|
|
||
| @classmethod | ||
| def from_ndarray_like(cls, data: npt.NDArray[np.generic]) -> NDBuffer: ... | ||
|
|
||
| def transpose(self, axes: tuple[int, ...]) -> NDBuffer: ... | ||
|
|
||
| def __getitem__(self, key: object) -> NDBuffer: ... | ||
|
|
||
| def __setitem__(self, key: object, value: object) -> None: ... | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # ArraySpec protocol | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class ArraySpec(Protocol): | ||
| """Protocol for the specification of a chunk's metadata.""" | ||
|
|
||
| @property | ||
| def shape(self) -> tuple[int, ...]: ... | ||
|
|
||
| @property | ||
| def dtype(self) -> ZDType[TBaseDType, TBaseScalar]: ... | ||
|
|
||
| @property | ||
| def fill_value(self) -> object: ... | ||
|
|
||
| @property | ||
| def ndim(self) -> int: ... | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Codec input/output type aliases | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| type CodecInput = NDBuffer | Buffer | ||
| type CodecOutput = NDBuffer | Buffer | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Sync codec protocol | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| @runtime_checkable | ||
| class SupportsSyncCodec[CI: CodecInput, CO: CodecOutput](Protocol): | ||
| """Protocol for codecs that support synchronous encode/decode. | ||
|
|
||
| The type parameters mirror ``BaseCodec``: ``CI`` is the decoded type | ||
| and ``CO`` is the encoded type. | ||
| """ | ||
|
|
||
| def _decode_sync(self, chunk_data: CO, chunk_spec: ArraySpec) -> CI: ... | ||
|
|
||
| def _encode_sync(self, chunk_data: CI, chunk_spec: ArraySpec) -> CO | None: ... | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Codec ABCs | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class BaseCodec[CI: CodecInput, CO: CodecOutput](ABC): | ||
| """Generic base class for codecs. | ||
|
|
||
| Subclass ``ArrayArrayCodec``, ``ArrayBytesCodec``, or | ||
| ``BytesBytesCodec`` instead of this class directly. | ||
| """ | ||
|
|
||
| is_fixed_size: ClassVar[bool] | ||
|
|
||
| @classmethod | ||
| def from_dict(cls, data: dict[str, JSON]) -> Self: | ||
| """Create an instance from a JSON dictionary.""" | ||
| return cls(**data) # type: ignore[arg-type] | ||
|
|
||
| def to_dict(self) -> dict[str, JSON]: | ||
| """Serialize this codec to a JSON dictionary.""" | ||
| raise NotImplementedError | ||
|
|
||
| @abstractmethod | ||
| def compute_encoded_size(self, input_byte_length: int, chunk_spec: ArraySpec) -> int: | ||
| """Return the encoded byte length for a given input byte length.""" | ||
| ... | ||
|
|
||
| def resolve_metadata(self, chunk_spec: ArraySpec) -> ArraySpec: | ||
| """Return the chunk spec after encoding by this codec. | ||
|
|
||
| Override this for codecs that change shape, dtype, or fill value. | ||
| """ | ||
| return chunk_spec | ||
|
|
||
| def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self: | ||
| """Fill in codec parameters that can be inferred from array metadata.""" | ||
| return self | ||
|
|
||
| def validate( | ||
| self, | ||
| *, | ||
| shape: tuple[int, ...], | ||
| dtype: ZDType[TBaseDType, TBaseScalar], | ||
| chunk_grid: object, | ||
| ) -> None: | ||
| """Validate that this codec is compatible with the array metadata. | ||
|
|
||
| The default implementation does nothing. Override to add checks. | ||
| """ | ||
|
|
||
| async def _decode_single(self, chunk_data: CO, chunk_spec: ArraySpec) -> CI: | ||
| """Decode a single chunk. Override this or ``_decode_sync``.""" | ||
| raise NotImplementedError | ||
|
|
||
| async def decode( | ||
| self, | ||
| chunks_and_specs: Iterable[tuple[CO | None, ArraySpec]], | ||
| ) -> Iterable[CI | None]: | ||
| """Decode a batch of chunks.""" | ||
| results: list[CI | None] = [] | ||
| for chunk_data, chunk_spec in chunks_and_specs: | ||
| if chunk_data is not None: | ||
| results.append(await self._decode_single(chunk_data, chunk_spec)) | ||
| else: | ||
| results.append(None) | ||
| return results | ||
|
|
||
| async def _encode_single(self, chunk_data: CI, chunk_spec: ArraySpec) -> CO | None: | ||
| """Encode a single chunk. Override this or ``_encode_sync``.""" | ||
| raise NotImplementedError | ||
|
|
||
| async def encode( | ||
| self, | ||
| chunks_and_specs: Iterable[tuple[CI | None, ArraySpec]], | ||
| ) -> Iterable[CO | None]: | ||
| """Encode a batch of chunks.""" | ||
| results: list[CO | None] = [] | ||
| for chunk_data, chunk_spec in chunks_and_specs: | ||
| if chunk_data is not None: | ||
| results.append(await self._encode_single(chunk_data, chunk_spec)) | ||
| else: | ||
| results.append(None) | ||
| return results | ||
|
|
||
|
|
||
| class ArrayArrayCodec(BaseCodec[NDBuffer, NDBuffer]): | ||
| """Base class for array-to-array codecs (e.g. transpose, scale_offset).""" | ||
|
|
||
|
|
||
| class ArrayBytesCodec(BaseCodec[NDBuffer, Buffer]): | ||
| """Base class for array-to-bytes codecs (e.g. bytes, sharding).""" | ||
|
|
||
|
|
||
| class BytesBytesCodec(BaseCodec[Buffer, Buffer]): | ||
| """Base class for bytes-to-bytes codecs (e.g. gzip, zstd).""" |
Empty file.
121 changes: 121 additions & 0 deletions
121
packages/zarr-interfaces/src/zarr_interfaces/data_type/v1.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| """Data type interface definitions (v1). | ||
|
|
||
| This module defines the abstract interface for zarr data types. | ||
| External data type implementations should subclass ``ZDType`` from this | ||
| module. The interface is intentionally minimal and stable. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from abc import ABC, abstractmethod | ||
| from dataclasses import dataclass | ||
| from typing import ( | ||
| ClassVar, | ||
| Literal, | ||
| Self, | ||
| TypeGuard, | ||
| overload, | ||
| ) | ||
|
|
||
| import numpy as np | ||
|
|
||
| # JSON-like type for serialization | ||
| type JSON = str | int | float | bool | dict[str, JSON] | list[JSON] | None | ||
| type ZarrFormat = Literal[2, 3] | ||
|
|
||
| # Bounds for the scalar and dtype type parameters | ||
| type TBaseScalar = np.generic | str | bytes | ||
| type TBaseDType = np.dtype[np.generic] | ||
|
|
||
| # JSON representations of data types | ||
| type DTypeJSON = JSON | ||
| type DTypeSpec_V2 = str | list[tuple[str, DTypeJSON]] | ||
| type DTypeSpec_V3 = str | dict[str, JSON] | ||
|
|
||
|
|
||
| @dataclass(frozen=True, kw_only=True, slots=True) | ||
| class ZDType[DType: TBaseDType, Scalar: TBaseScalar](ABC): | ||
| """Abstract base class for wrapping native array data types. | ||
|
|
||
| Subclasses must implement all abstract methods to support serialization, | ||
| deserialization, and scalar handling for their native data type. | ||
|
|
||
| Type Parameters | ||
| --------------- | ||
| DType | ||
| The native data type (e.g. ``np.dtype[np.float64]``). | ||
| Scalar | ||
| The scalar type produced by this data type (e.g. ``np.float64``). | ||
| """ | ||
|
|
||
| dtype_cls: ClassVar[type[TBaseDType]] | ||
| _zarr_v3_name: ClassVar[str] | ||
|
|
||
| @classmethod | ||
| def _check_native_dtype(cls: type[Self], dtype: TBaseDType) -> TypeGuard[DType]: | ||
| """Check that a native data type matches ``dtype_cls``.""" | ||
| return type(dtype) is cls.dtype_cls | ||
|
|
||
| @classmethod | ||
| @abstractmethod | ||
| def from_native_dtype(cls: type[Self], dtype: TBaseDType) -> Self: | ||
| """Create an instance from a native data type.""" | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def to_native_dtype(self: Self) -> DType: | ||
| """Return the native data type wrapped by this instance.""" | ||
| ... | ||
|
|
||
| @classmethod | ||
| @abstractmethod | ||
| def _from_json_v2(cls: type[Self], data: DTypeJSON) -> Self: ... | ||
|
|
||
| @classmethod | ||
| @abstractmethod | ||
| def _from_json_v3(cls: type[Self], data: DTypeJSON) -> Self: ... | ||
|
|
||
| @classmethod | ||
| def from_json(cls: type[Self], data: DTypeJSON, *, zarr_format: ZarrFormat) -> Self: | ||
| """Create an instance from JSON metadata.""" | ||
| if zarr_format == 2: | ||
| return cls._from_json_v2(data) | ||
| if zarr_format == 3: | ||
| return cls._from_json_v3(data) | ||
| raise ValueError(f"zarr_format must be 2 or 3, got {zarr_format}") | ||
|
|
||
| @overload | ||
| def to_json(self, zarr_format: Literal[2]) -> DTypeSpec_V2: ... | ||
|
|
||
| @overload | ||
| def to_json(self, zarr_format: Literal[3]) -> DTypeSpec_V3: ... | ||
|
|
||
| @abstractmethod | ||
| def to_json(self, zarr_format: ZarrFormat) -> DTypeSpec_V2 | DTypeSpec_V3: | ||
| """Serialize this data type to JSON.""" | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def _check_scalar(self, data: object) -> bool: | ||
| """Check that a python object is a valid scalar for this data type.""" | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def cast_scalar(self, data: object) -> Scalar: | ||
| """Cast a python object to the scalar type of this data type.""" | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def default_scalar(self) -> Scalar: | ||
| """Return the default scalar value for this data type.""" | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def from_json_scalar(self: Self, data: JSON, *, zarr_format: ZarrFormat) -> Scalar: | ||
| """Deserialize a JSON value to a scalar.""" | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def to_json_scalar(self, data: object, *, zarr_format: ZarrFormat) -> JSON: | ||
| """Serialize a scalar value to JSON.""" | ||
| ... |
Empty file.
25 changes: 25 additions & 0 deletions
25
packages/zarr-interfaces/src/zarr_interfaces/metadata/v1.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| """Metadata protocol (v1). | ||
|
|
||
| Defines the structural interface for objects that can be serialized | ||
| to and deserialized from JSON dictionaries. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from typing import Protocol, Self, runtime_checkable | ||
|
|
||
| type JSON = str | int | float | bool | dict[str, JSON] | list[JSON] | None | ||
|
|
||
|
|
||
| @runtime_checkable | ||
| class Metadata(Protocol): | ||
| """Protocol for objects that round-trip through JSON dictionaries.""" | ||
|
|
||
| @classmethod | ||
| def from_dict(cls, data: dict[str, JSON]) -> Self: | ||
| """Create an instance from a JSON dictionary.""" | ||
| ... | ||
|
|
||
| def to_dict(self) -> dict[str, JSON]: | ||
| """Serialize to a JSON dictionary.""" | ||
| ... |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the versioning will create confusion, because it is another version apart from the zarr package and the zarr data format versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it's not confusing! the goal here is to allow zarr-python to gracefully evolve things like the codec API. Since different codec APIs would not interact, we could define the current ABC-based API under
v1, and a newer protocol-based API underv2. I think only codec and data type developers would need to know about this, and I would count on that crowd being able to know what the versions mean.