Skip to content
Open
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,5 +97,5 @@
}

html_js_files = [
('https://scripts.simpleanalyticscdn.com/latest.js', {'async': 'async', 'defer': 'defer'}),
]
("https://scripts.simpleanalyticscdn.com/latest.js", {"async": "async", "defer": "defer"}),
]
29 changes: 29 additions & 0 deletions docs/ml/datasets/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,35 @@ Basic Usage Examples
)
)

**ASE LMDB datasets** (for OMol25/OMat24/OPoly26-style ``*.aselmdb`` shards):

.. code-block:: python

from atomworks.ml.datasets import ASELMDBDataset
from atomworks.ml.datasets.loaders import create_ase_atoms_loader, create_ase_materials_loader

dataset = ASELMDBDataset.from_directory(
directory="/path/to/omol25/train",
name="omol25_train",
loader=create_ase_atoms_loader(),
)

example = dataset[0]
atoms = example["atoms"] # ASE Atoms
atom_array = example["atom_array"] # Biotite AtomArray

materials_dataset = ASELMDBDataset.from_directory(
directory="/path/to/omat24/rattled-300-subsampled",
name="omat24_rattled",
loader=create_ase_materials_loader(),
)

material = materials_dataset[0]
fractional_coordinates = material["fractional_coordinates"]
lattice_lengths = material["lattice_lengths"]
lattice_angles = material["lattice_angles"]
space_group = material["space_group"]

**Custom loaders** for specialized use cases:

.. code-block:: python
Expand Down
53 changes: 53 additions & 0 deletions src/atomworks/ml/datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,35 @@ dataset = PandasDataset(
)
```

### ASE LMDB Dataset

```python
from atomworks.ml.datasets import ASELMDBDataset
from atomworks.ml.datasets.loaders import create_ase_atoms_loader, create_ase_materials_loader

# OMol25/OMat24-style datasets are ASE DB-compatible LMDB files (*.aselmdb).
dataset = ASELMDBDataset.from_directory(
directory="/path/to/omol25/train",
name="omol25_train",
loader=create_ase_atoms_loader(), # Optional: adds an AtomArray for AtomWorks transforms
)

example = dataset[0]
atoms = example["atoms"] # ASE Atoms object with flattened atoms.info metadata
atom_array = example["atom_array"] # Biotite AtomArray created by the loader

materials_dataset = ASELMDBDataset.from_directory(
directory="/path/to/omat24/rattled-300-subsampled",
name="omat24_rattled",
loader=create_ase_materials_loader(),
)
material = materials_dataset[0]
fractional_coordinates = material["fractional_coordinates"]
lattice_lengths = material["lattice_lengths"]
lattice_angles = material["lattice_angles"]
space_group = material["space_group"]
```

## Core Concepts

### The Three-Step Pipeline
Expand Down Expand Up @@ -125,6 +154,30 @@ dataset = PandasDataset(

**ID-Based Access:** Set an `id_column` to enable `dataset.id_to_idx()` and `idx_to_id()` methods.

#### `ASELMDBDataset`

For ASE DB-compatible LMDB shards, including FAIR Chemistry datasets such as OMol25, OMat24, and OPoly26.

```python
dataset = ASELMDBDataset(
paths="/data/omol25/train", # Directory scanned recursively for *.aselmdb files
name="omol25_train",
return_type="record", # "record" (default), "atoms", or "row"
readonly=True,
readahead=False,
)
```

**Record output:** The default output is a dictionary containing `atoms` (ASE `Atoms`), `key_value_pairs`, `data`, `calculator_results`, and `extra_info`.

**AtomArray loading:** Use `create_ase_atoms_loader()` when ASE molecule records need to flow through AtomWorks transform pipelines.

**ID Mapping:** By default, IDs are generated as `<shard_id>:<ase_row_id>` for fast reversible lookup. To use an OMol/OPoly metadata field such as `sid`, pass `example_id_key="sid"` and `build_id_index=True`.

**Materials loading:** Use `create_ase_materials_loader()` for periodic materials datasets. It adds `fractional_coordinates`, `lattice_vectors`, `lattice_lengths`, `lattice_angles`, `cell_parameters`, `space_group`, and `parent_space_group`.

**Optional dependencies:** Install with `atomworks[ase]` to enable ASE LMDB support.

### Loader Functions

Loaders are functions that convert dataset-specific raw data into a standard format for Transforms.
Expand Down
3 changes: 3 additions & 0 deletions src/atomworks/ml/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,18 @@
from .base import ExampleIDMixin, MolecularDataset
from .concat_dataset import ConcatDatasetWithID, FallbackDatasetWrapper, get_row_and_index_by_example_id
from .file_dataset import FileDataset
from .lmdb_dataset import ASELMDBDataset, LMDBDataset
from .pandas_dataset import PandasDataset, StructuralDatasetWrapper

logger = logging.getLogger("datasets")

__all__ = [
"ASELMDBDataset",
"ConcatDatasetWithID",
"ExampleIDMixin",
"FallbackDatasetWrapper",
"FileDataset",
"LMDBDataset",
"MolecularDataset",
"PandasDataset",
"StructuralDatasetWrapper",
Expand Down
Loading