-
Notifications
You must be signed in to change notification settings - Fork 20
Add versioned Zarr design doc #2702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
96c4761
e6d9829
2c6698d
7de38fb
e1f2ad5
2ca87f5
67f65ca
f177d6b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| # Versioned Zarrs | ||
|
|
||
| This document discusses the design of a *versioned Zarr* concept, which is needed to enable publishing of Zarr-bearing Dandisets. | ||
|
|
||
| ## Executive Summary | ||
|
|
||
| Zarr Archives are simply too big and too complex to copy when modified to create new versions (as is done with “blob” assets). Instead, we propose to only change the stored objects for individual shards (in Zarr version 3) or chunks (in Zarr version 2), using S3 bucket versioning to maintain previous versions, while tracking the association between shard/chunk paths and S3 objects in a database table. This will enable a lightweight model representing an immutable snapshot of a Zarr Archive suitable for publishing in a Dandiset, as well as optimized access to the latest version of the Zarr, and other services such as Zarr manifest files. | ||
|
waxlamp marked this conversation as resolved.
waxlamp marked this conversation as resolved.
|
||
|
|
||
| ## Current Situation | ||
|
|
||
| Zarr Archives (”Zarrs” for short) are a strategy for storing large multidimensional numeric array datasets in the cloud, optimized for parking data in a place one time, then bringing computation to it, rather than carting around such large amounts of data, which can be unwieldy, slow, and expensive. Conceptually, Zarrs consist of several “chunks” containing data or metadata and control information, organized into a standardized “folder” structure. The chunks may be thought of as “files” (particularly when a Zarr is stored on a filesystem), but the “files” and “folders” need not be literal files and folders, which in turn enables many “Zarr backends”, including systems that do not actually store folders and files (such as S3). | ||
|
waxlamp marked this conversation as resolved.
|
||
|
|
||
| DANDI considers Zarrs to be a special type of asset, one that is not associated with an “asset blob” (i.e., a single file) but rather a specialized Zarr record that knows how to refer to an S3 prefix containing all of the chunks for that Zarr. Because Zarrs are large and complex, making a copy of the Zarr when it is updated (as is done for blob assets) is not feasible. This is essential to publishing a Dandiset, since a published version must contain an immutable set of assets (which may go on to be “edited” in copy-on-write fashion in future versions); as such, DANDI currently does not allow publishing of Zarr-bearing Dandisets. | ||
|
|
||
| This design offers a way of handling Zarrs that enables making lightweight snapshots of a Zarr Archive that are suitable for publishing. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I think it might be worth noting that so far here there is nothing Zarr specific in our need from Zarrs for this design doc. It is rather to support a "folder container" (with multiple files in a hierarchy) as a single asset, as opposed to a single "file blob". I think thinking of it this way could help to avoid "overfitting for Zarr", and potentially then later allow for other, non-Zarr, use cases demanding similar "directories" support.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good point. Do you have examples of other directory-based storage formats that behave as a single entity? Still, I think it's better to keep the Zarr-specific framing for this doc. We have just the Zarr type to worry about now, and it's a heavy enough lift to get it right that I think generalizing to "folder assets" and then respecializing to Zarr will get us into trouble. But I can leave a note here (with examples of data formats) that the technology we're developing for Zarrs might generalize to "folder assets". |
||
|
|
||
| ## Requirements | ||
|
|
||
| 1. **Lightweight snapshots of Zarr Archives.** The fundamental need for publishing of Zarr-bearing Dandisets is an immutable snapshot of a Zarr that does not involve a naive copying of all the objects the Zarr comprises. | ||
| 2. **Publishing of Zarr-bearing Dandisets.** Once we are able to create lightweight Zarr snapshots, we will be able to update the publishing logic to enable published versions for Dandisets that contain Zarrs. | ||
| 3. **Zarr backend for versioned Zarrs.** The Zarr snapshots will enable the DANDI API to act as a Zarr backend by redirecting requests for “paths” within the Zarr to the appropriate object in S3. This will enable applications expecting a Zarr to receive an appropriate DANDI API endpoint (and function equivalently to Zarrs hosted on S3, etc.); in particular, this opens the possibility for previous versions of Zarrs to be analyzed this way, even if there are active updates to that Zarr occurring simultaneously. | ||
| 4. **Access to the latest version of a Zarr via S3.** As an optimization, the latest version of a Zarr will be available to access directly via S3. Under the assumption that most work on an evolving Zarr occurs at its cutting edge, making this version of the Zarr available via S3 avoids an excess burden on the DANDI API. Clients must be actively directed to this URL, rather than selecting a DANDI API URL that would bypass the optimization. | ||
|
waxlamp marked this conversation as resolved.
|
||
| 5. **Manifest files for Zarr versions.** When a Zarr snapshot is created and finalized, a manifest file containing information about the Zarr, including the associations between its chunk paths and corresponding S3 objects, must be published in S3. This manifest file serves as a valid snapshot of the Zarr, which clients can use to interact with it even as others are modifying the same Zarr to create newer versions thereof. | ||
| 6. **CLI support for Zarr modification.** The DANDI CLI will need to perform appropriate analysis to communicate to the API local changes made to a Zarr in order to make use of the proposed infrastructure for versioned Zarrs. | ||
| 7. **CLI support for Zarr download.** An optimization the CLI can offer would be to deal directly with Zarr manifest files when downloading or otherwise interacting with a Zarr. Currently, the CLI uses the DANDI API to download Zarr chunks; in this proposal, it could instead retrieve a manifest file and use that to drive its own logic around downloading the chunks, thus bypassing the API entirely (and sparing excess load on it). | ||
|
|
||
| ## Proposals | ||
|
|
||
| ### Zarr chunk storage | ||
|
|
||
| Zarr chunks will continue to be stored in the DANDI S3 bucket as normal (in particular, enabling S3 to continue serving as a Zarr backend for third-party applications). Whenever changes are made to a Zarr and it is then finalized, an additional step will occur to record the current makeup of the Zarr chunks as an immutable set of database rows mapping each chunk’s path to its versioned S3 object. As the Zarr continues to mutate, new such immutable snapshots will be produced; the latest such snapshot will become a permanent part of any published Dandiset. Garbage collection routines will work to clean up unreachable versions of Zarrs. As an example, consider a Zarr with the following chunks stored in S3 (listed with a notional version ID in parentheses): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so is GC part of this design or not -- should be made explicit
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The sentence
covers this, no?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. Will update the doc to clarify.
Good idea.
Generally, I think we can have a policy that will GC such half-baked entities. That applies to unfinished multipart uploads just as it does unfinalized Zarrs. But I do think it's a policy issue, in that we probably don't want to clear up large uploads without, e.g., warning the uploader before we do so, etc.
Comment on lines
+34
to
+38
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @waxlamp What about if we use S3 object tagging to denote the objects and their version id that belong to a given Zarr version? When a new Dandiset is published the DANDI API would add a tag to the objects that belong to the Zarr at that time point. S3 object tagging combined with an AWS From the AWS docs: the AWS
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an interesting idea but it's not clear to me what benefit this provides over storing our own manifest files. The reasons we keep manifest files whose schema we control are:
We could serve (2) with this proposed mechanism, but I think (1) might be harder (perhaps not). Tagging the large number of objects comprising a Zarr archive might itself be difficult or error-prone, and it would need to be justified by the benefits of doing so. We can discuss further. |
||
|
|
||
| - /zarr/foobar/.zattrs (0) | ||
| - /zarr/foobar/.zgroups (0) | ||
| - /zarr/foobar/0/0 (0) | ||
| - /zarr/foobar/0/1 (0) | ||
|
|
||
| Database rows tracking these four objects would comprise Version 0 of the Zarr: | ||
|
|
||
| - (0, .zattrs) → (/zarr/foobar/.zattrs, 0) | ||
| - (0, .zgroups) → (/zarr/foobar/.zgroups, 0) | ||
| - (0, 0/0) → (/zarr/foobar/0/0, 0) | ||
| - (0, 0/1) → (/zarr/foobar/0/1, 0) | ||
|
|
||
| Now imagine that a Dandiset owner modifies the Zarr by changing the data content of `0/0`, deleting `0/1` (for the sake of argument, even if that is not a realistic operation for a Zarr), and adding a new dimension’s worth of data in `1/0` and `1/1`, resulting in the following objects in S3 (with version *history* noted in parentheses): | ||
|
|
||
| - /zarr/foobar/.zattrs (0) | ||
| - /zarr/foobar/.zgroups (0) | ||
| - /zarr/foobar/0/0 (0, 1) | ||
| - /zarr/foobar/0/1 (0, D) | ||
| - /zarr/foobar/1/0 (0) | ||
| - /zarr/foobar/1/1 (0) | ||
|
|
||
| Once the Zarr is finalized, a new set of rows can be created as a snapshot of Version 1 of the same Zarr: | ||
|
|
||
| - (1, .zattrs) → (/zarr/foobar/.zattrs, 0) | ||
| - (1, .zgroups) → (/zarr/foobar/.zgroups, 0) | ||
| - (1, 0/0) → (/zarr/foobar/0/0, 1) | ||
| - (1, 1/0) → (/zarr/foobar/1/0, 0) | ||
| - (1, 1/1) → (/zarr/foobar/1/1, 0) | ||
|
|
||
| Note three things: first, that the view of the Zarr as it exists in S3 remains valid (reflecting the new Version 1 of the Zarr, and including the delete marker for the chunk at `0/1`); second, that the database rows for Version 1 reflect the same structure (specifically, that `0/0` now refers to Version 1 of the appropriate S3 object, and that no row appears for the deleted `0/1` chunk); third, that Version 0 of the Zarr is still available for use through the appropriate database rows, since the explicit object versions have been captured there as well. | ||
|
|
||
| ### Versioned Zarrs and Dandiset publishing | ||
|
|
||
| A “versioned Zarr” can now be formulated as a database model that records some appropriate metadata about the Zarr archive, and contains links to the rows of the database table discussed above, thus aggregating all the chunks of that version of the Zarr (along with how to find the data objects in S3). | ||
|
|
||
| Changes to the publishing procedure for Dandisets will accommodate these versioned Zarr models, manufacturing appropriate Asset model instances that point to a versioned Zarr model to facilitate a published Dandiset version containing that Zarr. | ||
|
|
||
| ### DANDI Zarr backend | ||
|
|
||
| Furthermore, a new DANDI API endpoint can now serve as a Zarr backend using the data described above. For a given versioned Zarr, an endpoint such as `/zarr/{id}/{version}/` would serve as the virtual root for chunks in the Zarr. A request such as `GET /zarr/{id}/{version}/0/1` would then redirect to the S3 chunk named in the database row for that Zarr, version, and chunk. Handing the endpoint to a third-party application would enable it to work with a specific version of a Zarr as it exists in, for example, a published Dandiset version. | ||
|
|
||
| The Zarr backend base URL will be discovered by asking the API for information about the Zarr; that response will contain a `location` that will be a DANDI API URL (for versioned Zarrs) or an S3 URL (for the latest, draft version of a Zarr). Giving the client this URL will, as previously noted, avoid excess burden on DANDI to provide redirects for each chunk. | ||
|
|
||
| ### Zarr manifest files | ||
|
|
||
| Because versioned Zarrs are not usually available directly via S3 (since many of their chunks will be “buried” under more current versions of that object), maintaining manifest files for each version of a Zarr appearing in a published Dandiset maintains the structure of these Zarrs directly in S3 (similarly to the manifest files currently recording which assets belong to a published Dandiset). This preserves the integrity of the bucket itself as a standalone data store (independent of the DANDI API and web application), but also provides an optimization for clients. | ||
|
|
||
| Specifically, clients such as the DANDI CLI currently interact with Zarrs by asking the API for each chunk. For an operation such as “download a Zarr”, this can result in hundreds of requests per second to the API. However, if the CLI learned to first retrieve the manifest file, then issue requests directly to S3 to retrieve the chunks, all of the redirection burden is relieved (both for the API and the client itself). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it might be best to immediately identify properties or even a schema of the manifest file(s) and what metadata to be contained there in so that we could inform design of client in the scope of There we reviewed the aspect of establishing storing a hierarchical checksum (so not only ETags on keys but ETags on folders), which apparently used to be dumped to S3 but that facility was removed in
since was not used yet. But in scope of RF dandi-cli and overall integrity checking I think it would be great to reintroduce that. Note that our currently implemented prototype of manifests (used for webdav) is including ETag per key but not per "folder": an example dump of a manifest head❯ curl --silent https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json | jq . | head -n 40
{
"schemaVersion": 2,
"fields": [
"versionId",
"lastModified",
"size",
"ETag"
],
"statistics": {
"entries": 113328,
"depth": 5,
"totalSize": 97037755426,
"lastModified": "2022-04-23T23:08:58+00:00",
"zarrChecksum": "326273bcc8730474323a66ea4e3daa49-113328--97037755426"
},
"entries": {
".zattrs": [
"JKOglMKYg0dIr1ngNR_dguFZUioWE8MZ",
"2022-04-23T23:08:58+00:00",
7901,
"b8421735714f196291810afd48aa012d"
],
".zgroup": [
"joh9Hhu.uOtmHTewXyAXbI4mzi8AKjcb",
"2022-03-16T02:39:41+00:00",
24,
"e20297935e73dd0154104d4ea53040ab"
],
".zmetadata": [
"lFCKAw3nHHT66S14.hEaGXm9sBL9eYvW",
"2022-04-23T23:08:58+00:00",
14970,
"358d9e012ae0154f66d5e6d73ced977c"
],
"0": {
".zarray": [
"I5ucOkyHjZcqQJhurzfi307VmxSSkm1M",
"2022-03-16T02:39:43+00:00",
449,
"a45bc329195c9d6fea8cdd9a46562771" |
||
|
|
||
| ### DANDI CLI changes | ||
|
|
||
| The CLI will need to update its logic to reflect the above proposals for how versioning of Zarrs will work. Specifically, it may need to compute a diff of sorts between Zarr archives on disk and the last version in the API; this diff can drive a series of changes requested to the API to update the appropriate objects making up a Zarr in S3. Together with changes to the server-side logic, the client will be able to process versioned Zarrs as well. | ||
|
|
||
| As mentioned in the previous section, the CLI may also learn to process published Zarrs by parsing the manifest file. This moves significant logic from the server to the client, but results in higher performance overall (the client retrieves chunks faster, and the API does not have to process a sustained volume of requests to redirect to S3). If this logic is valuable enough, it may be transferred into a standalone library (much as happened for `dandi-schema`). | ||
|
|
||
| ## Other Approaches | ||
|
|
||
| ### Icechunk | ||
|
|
||
| [Icechunk](https://icechunk.io/en/stable/overview/) is an open-source library that provides a similar service for Zarrs as those described in this document, plus many others designed to provide database-style safety guarantees for Zarr chunk storage. Integrating Icechunk with DANDI would likely be complexity and cost prohibitive due to its heavier nature, and its own unique schema for storing chunks. However, it remains an important point of comparison to DANDI’s approach. | ||
|
|
||
| ### Standalone Web Service | ||
|
|
||
| A proposal to build the redirection logic into a standalone web service was raised, in order to overcome perceived difficulties of performance if many rows end up stored in DANDI’s Postgres database. | ||
|
|
||
| A quick sketch of this hypothetical service follows: as manifest files are published in the S3 bucket, the service would learn the association between individual Zarrs and their manifests. A user request to learn about a Zarr would result in the service retrieving that manifest and loading its contents into a cache (either in-memory, or using some attached service such as Redis). The user would query for chunk paths and receive S3 object URLs, just as in our proposal. | ||
|
|
||
| The perception of performance is that caching this data in memory would result in faster responses. However, this hypothetical service has many disadvantages against a system like Postgres. Specifically, its backing data store is S3, which is very slow to access compared to Postgres’s own data stores; it seeks to reinvent the core function of Postgres itself, but does not have the decades of experience and optimizations that Postgres does; and very specifically, the perception that an in-memory caching system for manifest files would be faster that Postgres fails against the bevy of multilevel caching strategies already employed by Postgres. | ||
|
|
||
| The recommendation is to implement the design using Postgres directly, and only consider other options if and when performance problems do arise. | ||
Uh oh!
There was an error while loading. Please reload this page.