Add multimodal detector datasets (LArTPC + Water Cherenkov)#2
Conversation
Add LArTPCDataset and WCDataset for loading multimodal detector simulation data through pimm's existing pipeline. Each detector type has dedicated readers that produce flat dicts consumed by the standard transform/collation/Point infrastructure. LArTPC (JAXTPC production): - LArTPCSegReader: 3D truth deposits from seg files - LArTPCRespReader: sparse wire signals from resp files - LArTPCLablReader: per-volume track_id->label lookup from labl files - LArTPCCorrReader: 3D->2D correspondence from corr files (vectorized) - Modality-driven coord ownership: seg->3D, corr+labl->2D labeled, resp->2D merged - Both resp and corr available as separate point clouds (resp_coord, corr_coord) - Volume filter for single-volume loading Water Cherenkov (PMT-based): - WCSegReader: 3D track segments from flat CSR format - WCSensorReader: PMT response + per-particle sparse decomposition - Output modes: response (per-sensor), labels (per-particle sparse), separate Minimal changes to existing pimm code (3 files, 19 lines): - index_valid_keys extended for LArTPC keys - collate_fn skips _-prefixed metadata keys - Dataset imports added 70 tests across both detector types verify all modality combinations, transform pipelines, collation, DataLoader workers, and toy model forward/backward passes.
There was a problem hiding this comment.
May be better to make the name of this specific data format less generic than "detector dataset"? Would be helpful to have a name for the dataset now, lol. Maybe call it a jaxtpc dataset? For now until we figure out the big name.
There was a problem hiding this comment.
So the idea here is to have this for both JAXTPC, LUCiD, ...
Hence why I named it a generic detector dataset, but I could be more specific and change the lartpc one to jaxtpc if you think that's better. But a dataset name would be convenient, I agree
There was a problem hiding this comment.
Yeah, that's understandable. But I think it would be a good idea to be more specific. We could try to persuade people to use this dataset format by naming this something like DefaultLArTPCDataset and throw it in datasets/defaults.py, but we shouldn't force it, if that makes sense.
The hope to me is that this repo will be usable with any already-made dataset anyone is bringing in without needing to remake it to fit this specific format. Instead, they would just be required that to make a dataset single object in a single file where the output of the dataloader is the same. I think this is slightly divergent from your instructions in the "adding a new detector" section of this doc.
There was a problem hiding this comment.
Again LArTPCDataset is too generic of a name I think.
There was a problem hiding this comment.
Oh, also if you're wanting to add unit tests (which is a great idea) can you make a folder called tests in the root dir and put them there?
…irectly HEPDataset was a 33-line fake-abstract class that added nothing: it claimed a dict-with-coord/energy contract that half the subclasses already violated, and inherited torch.utils.data.Dataset purely as a type decoration. The three dataset classes share ~15 LOC of trivial plumbing (__init__ storing transform/loop/max_len, __len__, __getitem__ dispatch). Extracting that into a base class costs more in abstraction overhead than it saves in duplication. The test-mode fragment_list logic is LArTPC-specific (WC's PMT arrays don't slide, PILArNet built but never used the result), so forcing it into a base imposed a contract that only one subclass actually uses. Changes: - Delete pimm/datasets/hepdataset.py - LArTPCDataset, WCDataset, PILArNetH5Dataset now inherit torch.utils.data.Dataset - Remove ignore_index kwarg from LArTPC/WC (dead code — never used by dataset, belongs in loss config). PILArNet unchanged (pre-existing code). - Each dataset class is now self-contained and readable top-to-bottom: open the file, see the whole data flow without jumping to a parent class. Real code reuse lives where it's justified: readers, transforms, utility functions. Dataset classes are ~300-450 LOC wrappers that orchestrate readers and apply transforms. No forced abstractions. 70 tests pass (38 LArTPC + 32 WC).
…ests to /tests Datasets and readers are specific to the HDF5 schemas produced by their upstream production pipelines. Naming them by source rather than generic detector type is more honest and matches the existing PILArNetH5Dataset precedent. Renames: - LArTPCDataset -> JAXTPCDataset (lartpc_dataset.py -> jaxtpc_dataset.py) - WCDataset -> LUCiDDataset (wc_dataset.py -> lucid_dataset.py) - LArTPCSegReader -> JAXTPCSegReader (lartpc_seg_reader.py -> jaxtpc_seg_reader.py) - LArTPCRespReader -> JAXTPCRespReader (similar) - LArTPCLablReader -> JAXTPCLablReader (similar) - LArTPCCorrReader -> JAXTPCCorrReader (similar) - WCSegReader -> LUCiDSegReader (wc_seg_reader.py -> lucid_seg_reader.py) - WCSensorReader -> LUCiDSensorReader (wc_sensor_reader.py -> lucid_sensor_reader.py) File names use _dataset.py / _reader.py suffix so it's clear these are pimm integration modules, not the upstream projects themselves. Tests moved from tools/ to tests/ at repo root (standard Python layout): - tools/test_detector_dataset.py -> tests/test_jaxtpc_dataset.py - tools/test_wc_dataset.py -> tests/test_lucid_dataset.py Updates to all imports, configs, docs, and transform index_valid_keys comment. All 70 tests pass (38 JAXTPC + 32 LUCiD).
|
Thanks a lot for the changes. One last thing... it would be amazing if you could add a doc string like below to the |
Add LArTPCDataset and WCDataset for loading multimodal detector simulation data through pimm's existing pipeline. Each detector type has dedicated readers that produce flat dicts consumed by the standard transform/collation/Point infrastructure.
LArTPC (JAXTPC production):
Water Cherenkov (PMT-based):
Minimal changes to existing pimm code (3 files, 19 lines):
70 tests across both detector types verify all modality combinations, transform pipelines, collation, DataLoader workers, and toy model forward/backward passes.