Add multimodal detector datasets (LArTPC + Water Cherenkov) by OmarAlterkait · Pull Request #2 · DeepLearnPhysics/particle-imaging-models

OmarAlterkait · 2026-04-13T16:37:43Z

Add LArTPCDataset and WCDataset for loading multimodal detector simulation data through pimm's existing pipeline. Each detector type has dedicated readers that produce flat dicts consumed by the standard transform/collation/Point infrastructure.

LArTPC (JAXTPC production):

LArTPCSegReader: 3D truth deposits from seg files
LArTPCRespReader: sparse wire signals from resp files
LArTPCLablReader: per-volume track_id->label lookup from labl files
LArTPCCorrReader: 3D->2D correspondence from corr files (vectorized)
Modality-driven coord ownership: seg->3D, corr+labl->2D labeled, resp->2D merged
Both resp and corr available as separate point clouds (resp_coord, corr_coord)
Volume filter for single-volume loading

Water Cherenkov (PMT-based):

WCSegReader: 3D track segments from flat CSR format
WCSensorReader: PMT response + per-particle sparse decomposition
Output modes: response (per-sensor), labels (per-particle sparse), separate

Minimal changes to existing pimm code (3 files, 19 lines):

index_valid_keys extended for LArTPC keys
collate_fn skips _-prefixed metadata keys
Dataset imports added

70 tests across both detector types verify all modality combinations, transform pipelines, collation, DataLoader workers, and toy model forward/backward passes.

Add LArTPCDataset and WCDataset for loading multimodal detector simulation data through pimm's existing pipeline. Each detector type has dedicated readers that produce flat dicts consumed by the standard transform/collation/Point infrastructure. LArTPC (JAXTPC production): - LArTPCSegReader: 3D truth deposits from seg files - LArTPCRespReader: sparse wire signals from resp files - LArTPCLablReader: per-volume track_id->label lookup from labl files - LArTPCCorrReader: 3D->2D correspondence from corr files (vectorized) - Modality-driven coord ownership: seg->3D, corr+labl->2D labeled, resp->2D merged - Both resp and corr available as separate point clouds (resp_coord, corr_coord) - Volume filter for single-volume loading Water Cherenkov (PMT-based): - WCSegReader: 3D track segments from flat CSR format - WCSensorReader: PMT response + per-particle sparse decomposition - Output modes: response (per-sensor), labels (per-particle sparse), separate Minimal changes to existing pimm code (3 files, 19 lines): - index_valid_keys extended for LArTPC keys - collate_fn skips _-prefixed metadata keys - Dataset imports added 70 tests across both detector types verify all modality combinations, transform pipelines, collation, DataLoader workers, and toy model forward/backward passes.

youngsm · 2026-04-14T17:37:37Z

May be better to make the name of this specific data format less generic than "detector dataset"? Would be helpful to have a name for the dataset now, lol. Maybe call it a jaxtpc dataset? For now until we figure out the big name.

So the idea here is to have this for both JAXTPC, LUCiD, ...
Hence why I named it a generic detector dataset, but I could be more specific and change the lartpc one to jaxtpc if you think that's better. But a dataset name would be convenient, I agree

Yeah, that's understandable. But I think it would be a good idea to be more specific. We could try to persuade people to use this dataset format by naming this something like DefaultLArTPCDataset and throw it in datasets/defaults.py, but we shouldn't force it, if that makes sense.

The hope to me is that this repo will be usable with any already-made dataset anyone is bringing in without needing to remake it to fit this specific format. Instead, they would just be required that to make a dataset single object in a single file where the output of the dataloader is the same. I think this is slightly divergent from your instructions in the "adding a new detector" section of this doc.

youngsm · 2026-04-14T17:38:58Z

Again LArTPCDataset is too generic of a name I think.

youngsm · 2026-04-14T17:39:33Z

Same as comment above

youngsm · 2026-04-16T20:09:20Z

Oh, also if you're wanting to add unit tests (which is a great idea) can you make a folder called tests in the root dir and put them there?

…irectly HEPDataset was a 33-line fake-abstract class that added nothing: it claimed a dict-with-coord/energy contract that half the subclasses already violated, and inherited torch.utils.data.Dataset purely as a type decoration. The three dataset classes share ~15 LOC of trivial plumbing (__init__ storing transform/loop/max_len, __len__, __getitem__ dispatch). Extracting that into a base class costs more in abstraction overhead than it saves in duplication. The test-mode fragment_list logic is LArTPC-specific (WC's PMT arrays don't slide, PILArNet built but never used the result), so forcing it into a base imposed a contract that only one subclass actually uses. Changes: - Delete pimm/datasets/hepdataset.py - LArTPCDataset, WCDataset, PILArNetH5Dataset now inherit torch.utils.data.Dataset - Remove ignore_index kwarg from LArTPC/WC (dead code — never used by dataset, belongs in loss config). PILArNet unchanged (pre-existing code). - Each dataset class is now self-contained and readable top-to-bottom: open the file, see the whole data flow without jumping to a parent class. Real code reuse lives where it's justified: readers, transforms, utility functions. Dataset classes are ~300-450 LOC wrappers that orchestrate readers and apply transforms. No forced abstractions. 70 tests pass (38 LArTPC + 32 WC).

…ests to /tests Datasets and readers are specific to the HDF5 schemas produced by their upstream production pipelines. Naming them by source rather than generic detector type is more honest and matches the existing PILArNetH5Dataset precedent. Renames: - LArTPCDataset -> JAXTPCDataset (lartpc_dataset.py -> jaxtpc_dataset.py) - WCDataset -> LUCiDDataset (wc_dataset.py -> lucid_dataset.py) - LArTPCSegReader -> JAXTPCSegReader (lartpc_seg_reader.py -> jaxtpc_seg_reader.py) - LArTPCRespReader -> JAXTPCRespReader (similar) - LArTPCLablReader -> JAXTPCLablReader (similar) - LArTPCCorrReader -> JAXTPCCorrReader (similar) - WCSegReader -> LUCiDSegReader (wc_seg_reader.py -> lucid_seg_reader.py) - WCSensorReader -> LUCiDSensorReader (wc_sensor_reader.py -> lucid_sensor_reader.py) File names use _dataset.py / _reader.py suffix so it's clear these are pimm integration modules, not the upstream projects themselves. Tests moved from tools/ to tests/ at repo root (standard Python layout): - tools/test_detector_dataset.py -> tests/test_jaxtpc_dataset.py - tools/test_wc_dataset.py -> tests/test_lucid_dataset.py Updates to all imports, configs, docs, and transform index_valid_keys comment. All 70 tests pass (38 JAXTPC + 32 LUCiD).

youngsm · 2026-04-17T18:05:20Z

Thanks a lot for the changes.

One last thing... it would be amazing if you could add a doc string like below to the get_data methods for the jaxtpc and lucid datasets. Would be helpful for skimming purposes.

    def get_data(self, idx):
        """Load a point cloud from h5 file.
        
        Output dictionary:
        - coord: (N, 3) array of coordinates
        - energy: (N, 1) array of energies
        - momentum: (N, 1) array of particle momentum (v2 only)
        - vertex: (N, 3) array of vertices (v2 only)
        - segment_motif: (N, 1) array of motif labels
        - segment_pid: (N, 1) array of PID labels (v2 only)
        - instance_particle: (N, 1) array of particle instance labels
        - instance_interaction: (N, 1) array of interaction instance labels
        - segment_interaction: (N, 1) array of interaction labels
        """

youngsm reviewed Apr 14, 2026

View reviewed changes

Delete pimm/datasets/readers/.gitignore

b1d5e0c

youngsm reviewed Apr 14, 2026

View reviewed changes

Comment thread pimm/datasets/transform.py

Copy link
Copy Markdown

Member

youngsm Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again LArTPCDataset is too generic of a name I think.

youngsm reviewed Apr 14, 2026

View reviewed changes

Comment thread pimm/datasets/lucid_dataset.py

Copy link
Copy Markdown

Member

youngsm Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comment above

youngsm reviewed Apr 14, 2026

View reviewed changes

Comment thread docs/DETECTOR_DATASET.md Outdated

youngsm reviewed Apr 16, 2026

View reviewed changes

OmarAlterkait added 2 commits April 17, 2026 22:55

youngsm merged commit 9211a47 into DeepLearnPhysics:main Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multimodal detector datasets (LArTPC + Water Cherenkov)#2

Add multimodal detector datasets (LArTPC + Water Cherenkov)#2
youngsm merged 4 commits into
DeepLearnPhysics:mainfrom
OmarAlterkait:multimodal-datasets

OmarAlterkait commented Apr 13, 2026

Uh oh!

youngsm Apr 14, 2026

Uh oh!

OmarAlterkait Apr 16, 2026

Uh oh!

youngsm Apr 16, 2026 •

edited

Loading

Uh oh!

youngsm Apr 14, 2026

Uh oh!

youngsm Apr 14, 2026

Uh oh!

Uh oh!

youngsm Apr 16, 2026

Uh oh!

youngsm commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OmarAlterkait commented Apr 13, 2026

Uh oh!

youngsm Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

OmarAlterkait Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

youngsm Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youngsm Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

youngsm Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

youngsm Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

youngsm commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

youngsm Apr 16, 2026 •

edited

Loading

youngsm commented Apr 17, 2026 •

edited

Loading