Skip to content

Add multimodal detector datasets (LArTPC + Water Cherenkov)#2

Merged
youngsm merged 4 commits into
DeepLearnPhysics:mainfrom
OmarAlterkait:multimodal-datasets
Apr 18, 2026
Merged

Add multimodal detector datasets (LArTPC + Water Cherenkov)#2
youngsm merged 4 commits into
DeepLearnPhysics:mainfrom
OmarAlterkait:multimodal-datasets

Conversation

@OmarAlterkait
Copy link
Copy Markdown
Contributor

Add LArTPCDataset and WCDataset for loading multimodal detector simulation data through pimm's existing pipeline. Each detector type has dedicated readers that produce flat dicts consumed by the standard transform/collation/Point infrastructure.

LArTPC (JAXTPC production):

  • LArTPCSegReader: 3D truth deposits from seg files
  • LArTPCRespReader: sparse wire signals from resp files
  • LArTPCLablReader: per-volume track_id->label lookup from labl files
  • LArTPCCorrReader: 3D->2D correspondence from corr files (vectorized)
  • Modality-driven coord ownership: seg->3D, corr+labl->2D labeled, resp->2D merged
  • Both resp and corr available as separate point clouds (resp_coord, corr_coord)
  • Volume filter for single-volume loading

Water Cherenkov (PMT-based):

  • WCSegReader: 3D track segments from flat CSR format
  • WCSensorReader: PMT response + per-particle sparse decomposition
  • Output modes: response (per-sensor), labels (per-particle sparse), separate

Minimal changes to existing pimm code (3 files, 19 lines):

  • index_valid_keys extended for LArTPC keys
  • collate_fn skips _-prefixed metadata keys
  • Dataset imports added

70 tests across both detector types verify all modality combinations, transform pipelines, collation, DataLoader workers, and toy model forward/backward passes.

Add LArTPCDataset and WCDataset for loading multimodal detector
simulation data through pimm's existing pipeline. Each detector type
has dedicated readers that produce flat dicts consumed by the standard
transform/collation/Point infrastructure.

LArTPC (JAXTPC production):
- LArTPCSegReader: 3D truth deposits from seg files
- LArTPCRespReader: sparse wire signals from resp files
- LArTPCLablReader: per-volume track_id->label lookup from labl files
- LArTPCCorrReader: 3D->2D correspondence from corr files (vectorized)
- Modality-driven coord ownership: seg->3D, corr+labl->2D labeled, resp->2D merged
- Both resp and corr available as separate point clouds (resp_coord, corr_coord)
- Volume filter for single-volume loading

Water Cherenkov (PMT-based):
- WCSegReader: 3D track segments from flat CSR format
- WCSensorReader: PMT response + per-particle sparse decomposition
- Output modes: response (per-sensor), labels (per-particle sparse), separate

Minimal changes to existing pimm code (3 files, 19 lines):
- index_valid_keys extended for LArTPC keys
- collate_fn skips _-prefixed metadata keys
- Dataset imports added

70 tests across both detector types verify all modality combinations,
transform pipelines, collation, DataLoader workers, and toy model
forward/backward passes.
Comment thread docs/DETECTOR_DATASET.md
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to make the name of this specific data format less generic than "detector dataset"? Would be helpful to have a name for the dataset now, lol. Maybe call it a jaxtpc dataset? For now until we figure out the big name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea here is to have this for both JAXTPC, LUCiD, ...
Hence why I named it a generic detector dataset, but I could be more specific and change the lartpc one to jaxtpc if you think that's better. But a dataset name would be convenient, I agree

Copy link
Copy Markdown
Member

@youngsm youngsm Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's understandable. But I think it would be a good idea to be more specific. We could try to persuade people to use this dataset format by naming this something like DefaultLArTPCDataset and throw it in datasets/defaults.py, but we shouldn't force it, if that makes sense.

The hope to me is that this repo will be usable with any already-made dataset anyone is bringing in without needing to remake it to fit this specific format. Instead, they would just be required that to make a dataset single object in a single file where the output of the dataloader is the same. I think this is slightly divergent from your instructions in the "adding a new detector" section of this doc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again LArTPCDataset is too generic of a name I think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comment above

Comment thread docs/DETECTOR_DATASET.md Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also if you're wanting to add unit tests (which is a great idea) can you make a folder called tests in the root dir and put them there?

…irectly

HEPDataset was a 33-line fake-abstract class that added nothing: it claimed
a dict-with-coord/energy contract that half the subclasses already violated,
and inherited torch.utils.data.Dataset purely as a type decoration.

The three dataset classes share ~15 LOC of trivial plumbing (__init__ storing
transform/loop/max_len, __len__, __getitem__ dispatch). Extracting that into
a base class costs more in abstraction overhead than it saves in duplication.
The test-mode fragment_list logic is LArTPC-specific (WC's PMT arrays don't
slide, PILArNet built but never used the result), so forcing it into a base
imposed a contract that only one subclass actually uses.

Changes:
- Delete pimm/datasets/hepdataset.py
- LArTPCDataset, WCDataset, PILArNetH5Dataset now inherit torch.utils.data.Dataset
- Remove ignore_index kwarg from LArTPC/WC (dead code — never used by dataset,
  belongs in loss config). PILArNet unchanged (pre-existing code).
- Each dataset class is now self-contained and readable top-to-bottom:
  open the file, see the whole data flow without jumping to a parent class.

Real code reuse lives where it's justified: readers, transforms, utility
functions. Dataset classes are ~300-450 LOC wrappers that orchestrate readers
and apply transforms. No forced abstractions.

70 tests pass (38 LArTPC + 32 WC).
…ests to /tests

Datasets and readers are specific to the HDF5 schemas produced by their
upstream production pipelines. Naming them by source rather than generic
detector type is more honest and matches the existing PILArNetH5Dataset
precedent.

Renames:
- LArTPCDataset     -> JAXTPCDataset     (lartpc_dataset.py -> jaxtpc_dataset.py)
- WCDataset         -> LUCiDDataset      (wc_dataset.py     -> lucid_dataset.py)
- LArTPCSegReader   -> JAXTPCSegReader   (lartpc_seg_reader.py -> jaxtpc_seg_reader.py)
- LArTPCRespReader  -> JAXTPCRespReader  (similar)
- LArTPCLablReader  -> JAXTPCLablReader  (similar)
- LArTPCCorrReader  -> JAXTPCCorrReader  (similar)
- WCSegReader       -> LUCiDSegReader    (wc_seg_reader.py     -> lucid_seg_reader.py)
- WCSensorReader    -> LUCiDSensorReader (wc_sensor_reader.py  -> lucid_sensor_reader.py)

File names use _dataset.py / _reader.py suffix so it's clear these are
pimm integration modules, not the upstream projects themselves.

Tests moved from tools/ to tests/ at repo root (standard Python layout):
- tools/test_detector_dataset.py -> tests/test_jaxtpc_dataset.py
- tools/test_wc_dataset.py       -> tests/test_lucid_dataset.py

Updates to all imports, configs, docs, and transform index_valid_keys comment.
All 70 tests pass (38 JAXTPC + 32 LUCiD).
@youngsm
Copy link
Copy Markdown
Member

youngsm commented Apr 17, 2026

Thanks a lot for the changes.

One last thing... it would be amazing if you could add a doc string like below to the get_data methods for the jaxtpc and lucid datasets. Would be helpful for skimming purposes.

    def get_data(self, idx):
        """Load a point cloud from h5 file.
        
        Output dictionary:
        - coord: (N, 3) array of coordinates
        - energy: (N, 1) array of energies
        - momentum: (N, 1) array of particle momentum (v2 only)
        - vertex: (N, 3) array of vertices (v2 only)
        - segment_motif: (N, 1) array of motif labels
        - segment_pid: (N, 1) array of PID labels (v2 only)
        - instance_particle: (N, 1) array of particle instance labels
        - instance_interaction: (N, 1) array of interaction instance labels
        - segment_interaction: (N, 1) array of interaction labels
        """

@youngsm youngsm merged commit 9211a47 into DeepLearnPhysics:main Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants