feat: Add `huggingface.LocalHFDataset` to `kedro-datasets` by iwhalen · Pull Request #1373 · kedro-org/kedro-plugins

iwhalen · 2026-04-05T20:44:39Z

Description

Adds datasets for interacting with Hugging Face datasets on a file system.

Development notes

Added docs, tests, ran in a fresh pipeline.

Iterable and in-memory versions have both been tested as well.

Note

I couldn't figure out a good way to save an IterableDataset without looping through it entirely first.

Maybe there's a better way someone knows about.

Updated jsonschema/kedro-catalog.1.00.json.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Updated jsonschema/kedro-catalog-X.XX.json if necessary
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

deepyaman

High-level concerns:

Use of fsspec.
Bespoke split/partition/multi-file handling (whatever you want to call it).
Handling too many types in one.

I think I'd personally it rather be a lightweight wrapper that delegates more to underlying Hugging Face APIs.

Unless you're convinced, or if you disagree, it's probably worth getting a second opinion/review on the above.

deepyaman · 2026-04-08T23:16:43Z

+        if protocol == "file":
+            _fs_args.setdefault("auto_mkdir", True)
+
+        self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)


The problem with this approach is that Hugging Face also supports remote URIs natively (e.g. https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_from_disk). I think we don't want to use fsspec in those cases, because Hugging Face very well could have a more efficient native path.

This is a hard problem to solve; we have similar concerns on the Ibis side (e.g. #1298), but intuitively I'd err on the side of leaving it up to the Hugging Face APIs.

See comment below on needing a filesystem.

In order to save a DatasetDict we have to be able to access it.

Hugging Face uses fsspec under the hood anyway 🤷

deepyaman · 2026-04-08T23:24:23Z

+
+        if self._fs.isdir(load_path):
+            paths = {
+                PurePosixPath(p).stem: p for p in self._fs.glob(f"{load_path}/*{ext}")


Does it have to have the extension, or is this too narrow? Will people come saying they have HDF5 files with .hdf5 extension instead of .h5?

Actually, look at the C4 example under https://huggingface.co/docs/datasets/en/loading#hugging-face-hub; there's an example with .json.gz extension.

Hmmm... maybe I'll just glob everything in the directory then?

deepyaman · 2026-04-08T23:34:39Z

+            }
+            return DatasetDict(
+                {
+                    split: loader(path, **self._load_args)


It seems you can specify splits in the data_files argument of load_dataset (e.g. https://huggingface.co/docs/datasets/loading#json); would this be preferable to constructing the DatasetDict with multiple Dataset.from_* calls? I haven't looked into the implementation, but I'd be curious if there's a good reason to not use the higher-level API.

Referencing the same C4 example from above (https://huggingface.co/docs/datasets/en/loading#hugging-face-hub), it seems you can also pass the wildcard directly under data_files.

There's two things happening here:

Loading a dataset from the Hub (handled by HFDataset now - its not perfect, but I'm ignoring it in this PR).

Loading / saving from a filesystem (what I'm trying to implement here).

Loading

The load_dataset can handle all our needed types with a call like:

load_dataset("csv", data_files="path/to/data.csv")

Same goes for parquet, lance, hdf5, json.

Arrow datasets don't play nice like this though and use load_from_disk("path/to/arrow").

Saving

This procedure works for everything but Arrow:

data = Dataset.from_dict({"a": [1,2], "b": [3,4]}) data.to_parquet("data.parquet") load_dataset("parquet", data_files="data.parquet")

I couldn't find a way to save a DatasetDict to a particular format.

This loop is what happens in the DatasetDict itself here:

https://github.com/huggingface/datasets/blob/4775eeba2d5e73349790f7575182d71d5cd8e1bf/src/datasets/dataset_dict.py#L1376-L1383

Which eventually makes it down to this private method that only will save to Arrow format.

This same goes for the individual Dataset objects. You have to use the to_<format> methods there's not a single method to rule them all like there is with loading. Lance and HDF5 don't actually have save methods handled by Hugging Face :(

Agree with your point on file name format strictness though.

deepyaman · 2026-04-08T23:40:03Z

+            glob_function=self._fs.glob,
+        )
+
+    def _load(self) -> DatasetLike:


How do you get an IterableDataset, for example? At the bottom of https://huggingface.co/docs/datasets/en/filesystems, it seems like you could do IterableDataset.from_dict(), but it's not clear you're ever doing that, right?

iwhalen · 2026-04-11T11:44:37Z

High-level concerns:
1. Use of fsspec.

2. Bespoke split/partition/multi-file handling (whatever you want to call it).

3. Handling too many types in one.
I think I'd personally it rather be a lightweight wrapper that delegates more to underlying Hugging Face APIs.

Unless you're convinced, or if you disagree, it's probably worth getting a second opinion/review on the above.

Thanks for the review @deepyaman! This was a little sloppy you're right.

My first thought was also to try to get the Hugging Face API to handle everything. I'll try again on this and update.

Otherwise, I'll just make a separate dataset for each format HF supports.

Thanks again!

iwhalen · 2026-04-11T16:44:54Z

Ok changes are all up. I think this is a bit better.

Each dataset has its own file.
Saving an loading a little simpler.

Unchanged:

Looping over DatasetDict objects to save (have to, see comment).
Awkward isinstance checks to control behavior for DatasetDict and Iterable* types.

ElenaKhaustova

Thanks for the contribution and for being so responsive to feedback! The split into separate dataset classes makes sense, and it matches how the rest of kedro-datasets is organized (e.g. pandas.CSVDataset, pandas.ParquetDataset). Also, Arrow genuinely behaves differently from the other formats.

A few high-level suggestions before we iterate on the details:

Scope this PR to the four round-trip formats. I'd suggest removing HDF5Dataset and LanceDataset (and their tests/docs) from this PR. Focus on Arrow, Parquet, CSV, and JSON — these all support full save + load round-trips and share the same base class, so they belong together. The read-only formats are a separate concern that deserves a small follow-up PR where we can get the save() error handling right (e.g. overriding save() directly so the error is raised immediately, rather than letting the base class do type checking, iterable materialization, and path resolution before the error surfaces in _save_dataset).

Deduplicate the tests. The test files for CSV, JSON, and Parquet are near-identical (differing only in class name and extension). Consider a parametrized shared test instead.

ElenaKhaustova · 2026-04-21T19:32:31Z

 from huggingface_hub import HfApi
 from kedro.io import AbstractDataset

+DatasetLike: TypeAlias = Dataset | DatasetDict | IterableDataset | IterableDatasetDict


DatasetLike is now defined identically in both _base.py and here. One of them should import from the other.

Still defined in both _base.py and here.

iwhalen · 2026-04-24T01:57:57Z

@ElenaKhaustova Ok! I think I addressed most of the changes. There's still a couple things I'm not in love with.

I see my tests are failing, so I'll get to those. Just wanted to send a note on some higher level things.

Checking for directory in non-Arrow datasets

In the FilesystemDataset we have to make an assumption somewhere on whether or not we're trying to load a DatasetDict.

I thought it was reasonable that, if the provided path looks like a directory (or is an existing directory), we assume its a directory.

Then, if the user doesn't tell us what we're looking for in the directory, we throw an error.

That's what's happening here in _validate_load_paths.

In other words, we can't call load_dataset("json", data_files="my/directory/") we have to call load_dataset("json", data_files={"data": "my/directory/data.json", "labels": "my/directory/labels.json").

I'm not sure if there's a smarter way to do this. Let me know what you think.

`data_files` convenience processing for non-Arrow datasets

As I say above, the right way to load from a directory is:

load_dataset(
    "json", data_files={"data": "my/directory/data.json", "labels": "my/directory/labels.json"
)

Where the values in the data_files dictionary are full paths to your data.

If we were 100% strict, our yaml entries for this dataset would have to look like:

reviews:
  type: huggingface.JSONDataset
  path: data/01_raw/reviews
  load_args:
    data_files:
      labels: data/01_raw/reviews/labels.json
      data: data/01_raw/reviews/data.json

Instead, I introduce the helper _build_data_files so that the yaml entries can look like:

reviews:
  type: huggingface.JSONDataset
  path: data/01_raw/reviews
  load_args:
    data_files:
      labels: labels.json
      data: data.json

Maybe this is overstepping! Happy to hear thoughts on this as well.

ElenaKhaustova

Thanks for the updates @iwhalen — this is much improved!

On your two questions:

Directory loading with data_files: You're right — my earlier suggestion to use data_dir was too optimistic about what HF handles automatically for non-Arrow formats. Your approach of requiring explicit data_files is the correct middle ground: it removes the fragile glob-based discovery while still delegating the actual loading to load_dataset.

_build_data_files helper: — this is a good UX call.

A few remaining items in inline comments below.

ElenaKhaustova · 2026-04-27T12:52:17Z

 from huggingface_hub import HfApi
 from kedro.io import AbstractDataset

+DatasetLike: TypeAlias = Dataset | DatasetDict | IterableDataset | IterableDatasetDict


Still defined in both _base.py and here.

Signed-off-by: iwhalen <[email protected]>

…edro-org#1364) * Add OpikEvaluationDataset Signed-off-by: Laura Couto <[email protected]> * Add unit tests Signed-off-by: Laura Couto <[email protected]> * Lint Signed-off-by: Laura Couto <[email protected]> * Lint Signed-off-by: Laura Couto <[email protected]> * Docstring Signed-off-by: Laura Couto <[email protected]> * Add OpikEvaluationDataset stuff to the readme Signed-off-by: Laura Couto <[email protected]> * Add OpikEvaluationDataset Signed-off-by: Laura Couto <[email protected]> * Add unit tests Signed-off-by: Laura Couto <[email protected]> * Lint Signed-off-by: Laura Couto <[email protected]> * Lint Signed-off-by: Laura Couto <[email protected]> * Docstring Signed-off-by: Laura Couto <[email protected]> * Add OpikEvaluationDataset stuff to the readme Signed-off-by: Laura Couto <[email protected]> * Docs and release note Signed-off-by: Laura Couto <[email protected]> * Typo Signed-off-by: Laura Couto <[email protected]> * Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py Co-authored-by: Ravi Kumar Pilla <[email protected]> Signed-off-by: L. R. Couto <[email protected]> * Add more explicit errors in case of connection failure Signed-off-by: Laura Couto <[email protected]> * Opik client flush to prevent async issues Signed-off-by: Laura Couto <[email protected]> * Explicitly explain remote sync behavior Signed-off-by: Laura Couto <[email protected]> * Fix release notes Signed-off-by: Laura Couto <[email protected]> * Rephrase docstrings Signed-off-by: Laura Couto <[email protected]> * Handle UUID more carefully Signed-off-by: Laura Couto <[email protected]> * Wrap _client.flush() in try/except for DatasetError Signed-off-by: Laura Couto <[email protected]> * Update README, add more explicit exception handling Signed-off-by: Laura Couto <[email protected]> * Lint Signed-off-by: Laura Couto <[email protected]> * Enforce UUIDv7 Signed-off-by: Laura Couto <[email protected]> * Lint Signed-off-by: Laura Couto <[email protected]> * Clarify interactions with UUIDv7 on docs Signed-off-by: Laura Couto <[email protected]> * Extract auxiliary functions Signed-off-by: Laura Couto <[email protected]> * Make it so 'non UUIDv7 creates a new row' is very explicit Signed-off-by: Laura Couto <[email protected]> * Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]> * Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]> * Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]> * Update kedro-datasets/kedro_datasets_experimental/opik/README.md Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]> * Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]> * Update kedro-datasets/kedro_datasets_experimental/opik/README.md Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: L. R. Couto <[email protected]> * Fix docstrings Signed-off-by: Laura Couto <[email protected]> * Minor fix on docstring Signed-off-by: Laura Couto <[email protected]> * Minor fix on docstring Signed-off-by: Laura Couto <[email protected]> * Doc indent Signed-off-by: Laura Couto <[email protected]> * Indent Signed-off-by: Laura Couto <[email protected]> * Lint Signed-off-by: Laura Couto <[email protected]> --------- Signed-off-by: Laura Couto <[email protected]> Signed-off-by: L. R. Couto <[email protected]> Co-authored-by: Ravi Kumar Pilla <[email protected]> Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: iwhalen <[email protected]>

Signed-off-by: iwhalen <[email protected]>

…asetdict key names. Signed-off-by: iwhalen <[email protected]>

iwhalen · 2026-04-29T23:09:28Z

@ElenaKhaustova seems like we're getting closer!

I addressed the changes you had above and fixed my doctest issues.

Now I'm seeing some failures in CI that seem to be unrelated to the changes in this branch.

Any advice?

iwhalen changed the title ~~Feat/add local hf dataset~~ feat: Add huggingface.LocalHFDataset to kedro-datasets Apr 5, 2026

iwhalen marked this pull request as ready for review April 5, 2026 20:48

deepyaman requested changes Apr 8, 2026

View reviewed changes

iwhalen force-pushed the feat/add-local-hf-dataset branch 2 times, most recently from 130d51c to 28c80df Compare April 11, 2026 16:37

iwhalen requested a review from deepyaman April 11, 2026 16:40

ElenaKhaustova reviewed Apr 21, 2026

View reviewed changes

iwhalen force-pushed the feat/add-local-hf-dataset branch 3 times, most recently from 0b4bb94 to 212769f Compare April 24, 2026 01:42

iwhalen requested a review from ElenaKhaustova April 24, 2026 12:29

ElenaKhaustova reviewed Apr 27, 2026

View reviewed changes

iwhalen and others added 15 commits April 29, 2026 18:00

Add huggingface.LocalHFDataset.

ac17987

Signed-off-by: iwhalen <[email protected]>

Add docs.

28035e4

Signed-off-by: iwhalen <[email protected]>

Add TypeAlias to DatasetLike.

b997eaa

Signed-off-by: iwhalen <[email protected]>

Break new HF datasets into multiple files, address PR comments.

e014315

Signed-off-by: iwhalen <[email protected]>

Udpate docs.

f32dc66

Signed-off-by: iwhalen <[email protected]>

Remove HDF5 and Lance hf datasets.

4d9fb3c

Signed-off-by: iwhalen <[email protected]>

Format tests.

48c44dc

Signed-off-by: iwhalen <[email protected]>

Add non-functioning FilesystemDataset changes.

8bf5c9b

Signed-off-by: iwhalen <[email protected]>

Simplify saving / loading, address PR comments.

8bcdfb4

Signed-off-by: iwhalen <[email protected]>

Fix RELEASE.md.

ffdeae0

Signed-off-by: iwhalen <[email protected]>

Clean up deleted file.

24cb1de

Signed-off-by: iwhalen <[email protected]>

Fix multiple definitions of DatasetLike.

d55e5a1

Signed-off-by: iwhalen <[email protected]>

Fix doc references to iterables and data_files loading logic.

3f6baa5

Signed-off-by: iwhalen <[email protected]>

Fix doctests.

a3647d8

Signed-off-by: iwhalen <[email protected]>

Add documentation and a error check for mispatched data_files and dat…

8dbdc32

…asetdict key names. Signed-off-by: iwhalen <[email protected]>

iwhalen force-pushed the feat/add-local-hf-dataset branch from 5910c0a to 8dbdc32 Compare April 29, 2026 23:02

iwhalen requested a review from ElenaKhaustova April 29, 2026 23:09

Conversation

iwhalen commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Development notes

Developer Certificate of Origin

Checklist

Uh oh!

deepyaman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iwhalen Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iwhalen Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Loading

Saving

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iwhalen commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iwhalen commented Apr 11, 2026

Uh oh!

ElenaKhaustova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iwhalen commented Apr 24, 2026

Checking for directory in non-Arrow datasets

data_files convenience processing for non-Arrow datasets

Uh oh!

ElenaKhaustova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iwhalen commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

iwhalen commented Apr 5, 2026 •

edited

Loading

iwhalen Apr 11, 2026 •

edited

Loading

iwhalen Apr 11, 2026 •

edited

Loading

iwhalen commented Apr 11, 2026 •

edited

Loading

ElenaKhaustova left a comment •

edited

Loading

`data_files` convenience processing for non-Arrow datasets