Skip to content

feat: store backends (HDF5/Parquet/Zarr), store_format in manifest (v1.1.0)#388

Open
benjello wants to merge 18 commits intomasterfrom
feature/backend
Open

feat: store backends (HDF5/Parquet/Zarr), store_format in manifest (v1.1.0)#388
benjello wants to merge 18 commits intomasterfrom
feature/backend

Conversation

@benjello
Copy link
Copy Markdown
Member

@benjello benjello commented Mar 3, 2026

Résumé (v1.1.0)

Store backends (choix du format de stockage des tables)

  • io/backends : Backends HDF5, Parquet et Zarr (abstraction StoreBackend) ; get_backend(name), get_available_backend_names(), register_backend() pour étendre.
  • Zarr : backend optionnel (pip install openfisca-survey-manager[zarr]) ; une table = un groupe zarr dans un répertoire .zarr par survey.
  • Survey : attribut zarr_file_path ; fill_store(store_format="zarr") et lecture via get_values pour zarr.
  • Table : écriture/lecture et _is_stored délégués aux backends ; _get_store_path_and_format() unifie les chemins.
  • build-collection : option --zarr en plus de --parquet ; défaut HDF5 avec avertissement.
  • Docs : docs/ZARR-BACKEND.md (utilisation Zarr, compression, parallélisation).

Manifest (RFC-002) : store_format

  • manifest.yaml : clé optionnelle store_format (hdf5, parquet, zarr) au niveau dataset ; par défaut parquet au chargement.
  • SurveyCollection.load : depuis un manifest, applique store_format et déduit les chemins de store à partir de default_output_dir.
  • Script de migration : infère store_format depuis le JSON legacy et l’écrit dans le manifest généré.
  • RFC-002 : exemple de manifest avec store_format ; section 3.5 et 4.2 mises à jour.

Déjà en place (commits précédents)

  • RFC-002 : config.yaml + manifests, script de migration config.ini → config.yaml, DeprecationWarning legacy.

Version : 1.1.0 (CHANGELOG + pyproject.toml)

benjello added 16 commits March 10, 2026 18:29
…gration doc (v6.4.0)

- io/hdf: extract write_table_to_hdf5 to io.hdf, writers re-exports
- processing/harmonization: harmonize_data_frame_columns, used in Survey.get_values
- core: add core.table, core.survey, core.dataset; tables/surveys/survey_collections re-export
- logging: add to configuration.models, google_colab, statshelpers; fix typo in core.table
- docs: MIGRATION_IMPORTS.md (import mapping when re-exports removed), REFACTORING_PLAN §3.4
- CHANGELOG + bump to 6.4.0

Made-with: Cursor
- core: type hints Table, Survey, SurveyCollection (TYPE_CHECKING for circular refs)
- io: type hints readers (read_dbf cols), writers/hdf already typed
- processing: type hints cleaning, harmonization, calmar, Calibration
- REFACTORING_PLAN §3.3 updated; version 6.5.0

Made-with: Cursor
- Implémentation _nnd_hotdeck_python et _create_fused_python (Manhattan/Euclidean,
  donor_classes, tie-breaking aléatoire)
- API unifiée nnd_hotdeck() par défaut en Python, use_r=True pour StatMatch
- nnd_hotdeck_using_rpy2 conservée en alias pour compatibilité
- Corrections lint (noms variables, lambda->def, doublon supprimé, print->log)

Made-with: Cursor
- Suppression de la fonction dans common.misc
- Retrait des exports dans common et utils
- Mise à jour MIGRATION_IMPORTS et REFACTORING_PLAN

Made-with: Cursor
- Add openfisca_survey_manager.policy (simulations, simulation_builder, aggregates)
- Keep root modules as DeprecationWarning placeholders re-exporting from policy
- Move policy-related tests to policy/tests and update imports
- Add policy.legislation_asof and deprecate common.misc / utils helpers

Made-with: Cursor
…atshelpers/variables to policy

- input_dataframe_generator: moved to tests/ (used only by tests)
- coicop, matching, statshelpers, variables: moved to policy/
- update all imports accordingly
- coicop: paths -> configuration.paths

Made-with: Cursor
…precationWarnings

- calmar, calibration: processing/weights -> policy/
- processing/weights: re-export from policy + DeprecationWarning
- processing/__init__: lazy import for weights (avoid circular import)
- placeholders at root: coicop, input_dataframe_generator, matching,
  statshelpers, variables (with DeprecationWarnings)
- calmar, calibration placeholders: add DeprecationWarnings

Made-with: Cursor
…Warning

- scenarios/ -> policy/scenarios/ (abstract_scenario, reform_scenario)
- Placeholders at openfisca_survey_manager/scenarios/ with DeprecationWarning
- Remove common, processing/weights, root placeholders (coicop, matching, etc.)
- Update all imports to policy.scenarios
- Add missing Survey import in abstract_scenario

Made-with: Cursor
- policy: add py.typed marker; type legislation_asof, variables, coicop,
  matching, statshelpers, calmar, calibration, simulation_builder,
  aggregates, simulations, scenarios (abstract + reform)
- configuration/models: type Config.__init__ and save
- processing/__init__: type __getattr__ return

Made-with: Cursor
- Suppression des modules de compatibilité (config, paths, tables, surveys,
  survey_collections, read_sas, read_spss, read_dbf, calibration, calmar, utils)
- load_table déplacé dans core.dataset, imports migrés selon MIGRATION_IMPORTS.md
- Version 1.0.0, CHANGELOG et doc (REFACTORING_PLAN, RFC-001)

Made-with: Cursor
- Delete root simulations/simulation_builder/aggregates/utils modules in favor of policy.*
- Keep common.* as thin aliases to policy.legislation_asof (no DeprecationWarning)
- Update scenarios and tests to import from policy and core.dataset/core.survey
- Wire policy.simulations to SurveyCollection/load_table from core.dataset; tests still all pass

Made-with: Cursor
- Delete scenarios/abstract_scenario.py, scenarios/reform_scenario.py
- Remove DeprecationWarning in core/table.py (HDF5), keep log.warning
- build_collection: keep log.warning only (no DeprecationWarning)

Made-with: Cursor
- policy/simulations: log.warn -> log.warning (3), groupby(..., observed=False)
- io/hdf: hdf5_safe_key() for PyTables NaturalNameWarning; to_hdf key= keyword
- core/survey: use hdf5_safe_key when reading HDF5, backward compat fallback
- tests: fix PytestReturnNotNoneWarning (assert instead of return)

Made-with: Cursor
- Add config_loader (get_config_dir, load_config, load_manifest, manifest_survey_to_json)
- SurveyCollection.load() tries config.yaml + manifest first, else legacy config.ini + JSON
- Add migrate_config_to_rfc002 script (config.ini/raw_data.ini/JSON -> config.yaml + manifests)
- Emit DeprecationWarning when loading via legacy config.ini + JSON
- Add tests for RFC-002 and migration; legacy load test expects deprecation warning
- Add docs/RFC-002-METADATA-AND-CONFIG.md

Made-with: Cursor
…igration (v1.1.0)

- io/backends: backend registry (hdf5, parquet, zarr), get_backend, register_backend
- Survey: zarr_file_path, fill_store/get_values for zarr; build-collection --zarr
- Table: delegate write/read to backends via _get_store_path_and_format
- Manifest: store_format (hdf5|parquet|zarr) at dataset level; load applies it and sets store paths
- Migration script: infer store_format from legacy JSON and write in manifest
- Docs: ZARR-BACKEND.md, RFC-002 store_format example and migration note
- Changelog 1.1.0, pyproject 1.1.0

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant