diff --git a/.gitignore b/.gitignore index b188e6f0..b619b74b 100644 --- a/.gitignore +++ b/.gitignore @@ -102,3 +102,4 @@ tags # Tests files *.parquet test_*.json +.claude/settings.json diff --git a/CHANGELOG.md b/CHANGELOG.md index 8bf85a4b..4cbe1dd6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,4 +1,49 @@ -# Changelog +# Changelog + +# 8.0.0 + +* **Store backends** (choix du format de stockage des tables) + - **io/backends**: Backends HDF5, Parquet et Zarr (abstraction `StoreBackend`) ; `get_backend(name)`, `get_available_backend_names()`, `register_backend()` pour étendre. + - **Zarr** : backend optionnel (`pip install openfisca-survey-manager[zarr]`) ; une table = un groupe zarr dans un répertoire `.zarr` par survey. + - **Survey** : attribut `zarr_file_path` ; `fill_store(store_format="zarr")` et lecture via `get_values` pour zarr. + - **Table** : écriture/lecture et `_is_stored` délégués aux backends ; `_get_store_path_and_format()` unifie les chemins. + - **build-collection** : option `--zarr` en plus de `--parquet` ; défaut HDF5 avec avertissement. + - **Docs** : `docs/ZARR-BACKEND.md` (utilisation Zarr, compression, parallélisation). + +* **Manifest (RFC-002) : store_format** + - **manifest.yaml** : clé optionnelle `store_format` (hdf5, parquet, zarr) au niveau dataset ; par défaut `parquet` au chargement. + - **SurveyCollection.load** : depuis un manifest, applique `store_format` et déduit les chemins de store (`hdf5_file_path`, `parquet_file_path`, `zarr_file_path`) à partir de `default_output_dir`. + - **Script de migration** : infère `store_format` depuis le JSON legacy (`parquet_file_path` / `zarr_file_path` / `hdf5_file_path`) et l’écrit dans le manifest généré. + - **RFC-002** : exemple de manifest avec `store_format` ; section 3.5 et 4.2 mises à jour. + +# 7.0.0 + +* **Breaking**: Version 7.0 — retrait des ré-exports et des DeprecationWarning + - **Suppression des modules de compatibilité** : `config`, `paths`, `tables`, `surveys`, `survey_collections`, `read_sas`, `read_spss`, `read_dbf`, `calibration`, `calmar`, `utils` sont supprimés. Utiliser les imports canoniques (voir `docs/MIGRATION_IMPORTS.md`). + - **`load_table`** : déplacé de `utils` vers `openfisca_survey_manager.core.dataset` (et exporté depuis `core`). + - Tous les imports internes ont été migrés vers `configuration.paths`, `configuration.models`, `core.dataset`, `core.survey`, `core.table`, `io.readers`, `processing.weights`, `common.misc`. + +* **Typing** (no breaking API changes) + - **policy**: Add `py.typed` marker; type hints on `legislation_asof`, `variables`, `coicop`, `matching`, `statshelpers`, `calmar`, `calibration`, `simulation_builder`, `aggregates`, `simulations`, and scenarios (`abstract_scenario`, `reform_scenario`). + - **configuration**: Type hints on `Config.__init__` and `save` in `configuration.models`. + - **processing**: Type return of `__getattr__` in `processing/__init__.py`. + +# 6.5.0 + +* Typing (no breaking API changes) + - **core**: Type hints on `core.table` (Table), `core.survey` (Survey, NoMoreDataError), `core.dataset` (SurveyCollection); `TYPE_CHECKING` for circular refs; class attributes with defaults where needed + - **io**: Type hints on `io.readers` (read_sas, read_spss, read_dbf with `Optional[list[str]]` for cols); `io.writers` and `io.hdf` already typed + - **processing**: Type hints on `processing.cleaning`, `processing.harmonization`, `processing.weights.calmar` (linear, logit, calmar, check_calmar, etc.), `processing.weights.calibration` (Calibration class and methods) + - **Docs**: Update `REFACTORING_PLAN.md` §3.3 (typing core, io, processing done) + +# 6.4.0 + +* Refactor (no breaking API changes) + - **io/hdf**: Extract HDF5 write logic into `io.hdf` (`write_table_to_hdf5`); `io.writers` re-exports for compatibility + - **processing/harmonization**: Add `harmonize_data_frame_columns` (lowercase, rename ident); used in `Survey.get_values`; export from `processing` + - **core**: Add `core.table`, `core.survey`, `core.dataset` (Table, Survey, NoMoreDataError, SurveyCollection); root `tables.py`, `surveys.py`, `survey_collections.py` re-export for compatibility + - **Logging**: Extend to all modules — add logger to `configuration.models`, `google_colab`, `statshelpers`; fix typo "folloging" → "following" in `core.table` + - **Docs**: Add `docs/MIGRATION_IMPORTS.md` (import mapping and steps when re-exports will be removed, with breaking-change warning); update `REFACTORING_PLAN.md` (§3.4 Logging done) # 6.3.1 diff --git a/docs/MIGRATION_IMPORTS.md b/docs/MIGRATION_IMPORTS.md new file mode 100644 index 00000000..f6a9da10 --- /dev/null +++ b/docs/MIGRATION_IMPORTS.md @@ -0,0 +1,154 @@ +# Migration des imports après retrait des ré-exports + +Ce document décrit les changements à effectuer **lorsqu’on retirera les ré-exports** (fichiers de compatibilité) à la racine du package : mise à jour de tous les imports vers les nouveaux chemins, puis suppression des anciens modules. + +**Référence** : `docs/REFACTORING_PLAN.md`. + +--- + +## Mise en garde + +Le retrait des ré-exports est une **breaking change** : tout code (interne ou externe) qui importe depuis les anciens chemins (`config`, `paths`, `tables`, `surveys`, `survey_collections`, `read_sas`, `read_spss`, `read_dbf`, `calibration`, `calmar`, `utils`) verra ses imports **échouer** (`ModuleNotFoundError`). Il faut migrer tous les imports **avant** de supprimer les fichiers listés en section 3, et documenter le changement dans le CHANGELOG pour les projets dépendants (ex. openfisca-france-data). + +--- + +## 1. Correspondance ancien → nouveau + +| Ancien import (à supprimer) | Nouvel import (à utiliser) | +|-----------------------------|----------------------------| +| `from openfisca_survey_manager.config import Config` | `from openfisca_survey_manager.configuration.models import Config` | +| `from openfisca_survey_manager.paths import ...` | `from openfisca_survey_manager.configuration.paths import ...` | +| `from openfisca_survey_manager.tables import Table` | `from openfisca_survey_manager.core.table import Table` | +| `from openfisca_survey_manager.surveys import Survey` | `from openfisca_survey_manager.core.survey import Survey` | +| `from openfisca_survey_manager.surveys import NoMoreDataError` | `from openfisca_survey_manager.core.survey import NoMoreDataError` | +| `from openfisca_survey_manager.survey_collections import SurveyCollection` | `from openfisca_survey_manager.core.dataset import SurveyCollection` | +| `from openfisca_survey_manager.read_sas import read_sas` | `from openfisca_survey_manager.io.readers import read_sas` | +| `from openfisca_survey_manager.read_spss import read_spss` | `from openfisca_survey_manager.io.readers import read_spss` | +| `from openfisca_survey_manager.read_dbf import read_dbf` | `from openfisca_survey_manager.io.readers import read_dbf` | +| `from openfisca_survey_manager.calibration import Calibration` | `from openfisca_survey_manager.processing.weights import Calibration` | +| `from openfisca_survey_manager.calmar import calmar` | `from openfisca_survey_manager.processing.weights import calmar` | +| `from openfisca_survey_manager.calmar import check_calmar` | `from openfisca_survey_manager.processing.weights import check_calmar` | +| `from openfisca_survey_manager.utils import do_nothing, load_table, ...` | Voir section 2 (utils) | + +**Symboles exportés par `paths`** (même noms dans `configuration.paths`) : +`config_ini`, `default_config_files_directory`, `is_in_ci`, `openfisca_survey_manager_location`, `private_run_with_data`, `test_config_files_directory`. + +**Symboles exportés par `utils`** : +- Depuis `common.misc` : `asof`, `do_nothing`, `inflate_parameter_leaf`, `inflate_parameters`, `parameters_asof`, `variables_asof`. +- Définis dans `utils.py` : `load_table` (à déplacer vers un module adapté, ex. `core` ou `io`, avant suppression de `utils.py`). + +--- + +## 2. Fichiers à modifier quand on retire les ré-exports + +Avant (ou en même temps que) la suppression des fichiers listés en section 3, mettre à jour les imports dans les fichiers suivants. + +### 2.1 Imports depuis `config`, `paths` + +| Fichier | Remplacer | +|---------|-----------| +| `tests/input_dataframe_generator.py` | `paths` → `configuration.paths` (module déplacé dans `tests/`) | +| `scripts/build_collection.py` | `paths` → `configuration.paths` | +| `temporary.py` | `paths` → `configuration.paths` | +| `google_colab.py` | `paths` → `configuration.paths` | +| `coicop.py` | `paths` → `configuration.paths` | +| `matching.py` | `paths` → `configuration.paths` | +| `tests/test_read_sas.py` | `paths` → `configuration.paths` ; `read_sas` → `io.readers` | +| `tests/test_quantile.py` | `paths` → `configuration.paths` | +| `tests/test_scenario.py` | `paths` → `configuration.paths` | + +### 2.2 Imports depuis `survey_collections`, `surveys`, `tables` + +| Fichier | Remplacer | +|---------|-----------| +| `tests/input_dataframe_generator.py` | `survey_collections`, `surveys` → `core.dataset`, `core.survey` | +| `simulations.py` | `survey_collections`, `utils` → `core.dataset` ; utils → `common.misc` + module de `load_table` | +| `utils.py` | `survey_collections` → `core.dataset` (pour `load_table`) | +| `scripts/build_collection.py` | `survey_collections`, `surveys` → `core.dataset`, `core.survey` | +| `scenarios/abstract_scenario.py` | `calibration`, `surveys` → `processing.weights`, `core.survey` | +| `tests/test_surveys.py` | `survey_collections`, `surveys` → `core.dataset`, `core.survey` | +| `tests/test_coverage_boost.py` | `survey_collections`, `surveys`, `utils` → idem | +| `tests/test_add_survey_to_collection.py` | `survey_collections` → `core.dataset` | +| `tests/test_parquet.py` | `survey_collections` → `core.dataset` ; `surveys` (NoMoreDataError) → `core.survey` | + +### 2.3 Imports depuis `read_sas`, `read_spss`, `read_dbf` + +| Fichier | Remplacer | +|---------|-----------| +| `core/table.py` | `from openfisca_survey_manager import read_sas` → `from openfisca_survey_manager.io.readers import read_sas` ; `read_sas.read_sas` → `read_sas` dans `reader_by_source_format`. Puis `from openfisca_survey_manager.read_spss import read_spss` → `from openfisca_survey_manager.io.readers import read_spss` (dans le try/except). | +| `tests/test_read_sas.py` | `from ...paths import ...` → `configuration.paths` ; `from ...read_sas import read_sas` → `from ...io.readers import read_sas` | + +### 2.4 Imports depuis `calibration`, `calmar` + +| Fichier | Remplacer | +|---------|-----------| +| `scenarios/abstract_scenario.py` | `calibration` → `processing.weights` | +| `tests/test_calibration.py` | `calibration` → `processing.weights` | +| `tests/test_calmar.py` | `calmar` → `processing.weights` | + +### 2.5 Imports depuis `utils` + +| Fichier | Remplacer | +|---------|-----------| +| `simulations.py` | `utils.do_nothing`, `utils.load_table` → `common.misc.do_nothing` + module contenant `load_table` | +| `tests/test_coverage_boost.py` | `utils.do_nothing` → `common.misc.do_nothing` | +| `tests/test_legislation_inflator.py` | `utils.inflate_parameters`, `parameters_asof` → `common.misc` | +| `tests/test_tax_benefit_system_asof.py` | `utils.parameters_asof`, `variables_asof` → `common.misc` | + +**Note** : `load_table` dépend de `SurveyCollection` ; il doit vivre soit dans un module qui importe `core.dataset`, soit être déplacé (ex. `core.dataset` ou un module `io.loaders`) avant de supprimer `utils.py`. + +--- + +## 3. Fichiers à supprimer (ré-exports) + +Une fois tous les imports mis à jour selon les sections 1 et 2, on pourra supprimer les fichiers suivants (ils ne contiennent que des ré-exports) : + +- `config.py` +- `paths.py` +- `tables.py` +- `surveys.py` +- `survey_collections.py` +- `read_sas.py` +- `read_spss.py` +- `read_dbf.py` +- `calibration.py` +- `calmar.py` +- `utils.py` (après déplacement de `load_table` et mise à jour des imports listés en 2.5) + +--- + +## 4. Modules sans ré-export (imports canoniques) + +Ces modules n’ont pas de fichier ré-export à la racine ; le code interne les utilise déjà. Pour du code externe ou de la doc, les imports canoniques sont : + +| Symbole | Import canonique | +|---------|------------------| +| `harmonize_data_frame_columns` | `from openfisca_survey_manager.processing.harmonization import harmonize_data_frame_columns` (ou `from openfisca_survey_manager.processing import harmonize_data_frame_columns`) | +| `write_table_to_hdf5` | `from openfisca_survey_manager.io.hdf import write_table_to_hdf5` (ou `from openfisca_survey_manager.io.writers import write_table_to_hdf5`) | +| `write_table_to_parquet` | `from openfisca_survey_manager.io.writers import write_table_to_parquet` | + +--- + +## 5. Package racine `openfisca_survey_manager` + +Aujourd’hui le `__init__.py` du package n’expose que les exceptions. Si du code externe fait par exemple `from openfisca_survey_manager import read_sas`, il s’appuie sur le sous-module `read_sas.py`. **Après retrait des ré-exports**, ces chemins d’import ne seront plus valides (échec à l’import) ; les migrer vers `from openfisca_survey_manager.io.readers import read_sas` (voir section 1). + +À faire avant ou après la migration : vérifier dans ce dépôt et les projets dépendants (openfisca-france-data, etc.) les imports depuis la racine du package ou depuis les anciens modules listés en section 3. + +--- + +## 6. Ordre recommandé pour la migration + +1. **Déplacer `load_table`** vers un module définitif (ex. `core.dataset` ou `io.loaders`) et mettre à jour les appels (section 2.5). +2. **Mettre à jour tous les imports internes** (section 2) vers les nouveaux chemins, fichier par fichier. +3. **Lancer la suite de tests** : `pytest` ; corriger les oublis jusqu’à 0 échec. +4. **Supprimer les fichiers de ré-export** listés en section 3. +5. **Vérifier les usages externes** (section 5) et documenter les changements dans le CHANGELOG (breaking changes). + +--- + +## 7. Évolutions optionnelles ultérieures + +- Renommer le dossier `common/` en `utils/` une fois `utils.py` supprimé (comme prévu dans le plan de refactoring). +- Renommer `configuration/` en `config/` si on souhaite un nom plus court (en cohérence avec le plan). +- Ces renommages impliqueront une nouvelle vague de mise à jour des imports (configuration → config, common → utils). diff --git a/docs/REFACTORING_PLAN.md b/docs/REFACTORING_PLAN.md index bfd2494d..bd56bf6d 100644 --- a/docs/REFACTORING_PLAN.md +++ b/docs/REFACTORING_PLAN.md @@ -35,9 +35,10 @@ openfisca_survey_manager/ │ └── misc.py # helpers partagés (éviter imports circulaires) │ ├── scenarios/ # inchangé pour l’instant +├── policy/ # simulations, simulation_builder, aggregates (à terme autre paquet) ├── scripts/ ├── tests/ -└── ... (simulations, aggregates, etc. à placer selon responsabilité) +└── ... ``` **État actuel** : les dossiers suivants existent avec des `__init__.py` de préparation (pas de code déplacé encore) : @@ -49,10 +50,12 @@ Le déplacement effectif des modules se fera par étapes pour garder la compatib **Réalisé** : - `io/readers.py` : `read_sas`, `read_spss`, `read_dbf` (anciens modules en ré-export). -- `common/misc.py` : helpers sans dépendance survey (`do_nothing`, `inflate_parameters`, `asof`, `parameters_asof`, `variables_asof`, `stata_files_to_data_frames`) ; `utils.py` importe depuis `common.misc` et garde `load_table`. +- `common/misc.py` : helpers sans dépendance survey (`do_nothing`, `inflate_parameters`, `asof`, `parameters_asof`, `variables_asof`) ; `utils.py` importe depuis `common.misc` et garde `load_table`. - **Nettoyage** : `print()` remplacés par `logging` (matching, calmar, scenarios, scripts/build_collection, simulations). Exceptions génériques remplacées par `SurveyManagerError` / `SurveyConfigError` / `SurveyIOError` (survey_collections, tables, simulations, simulation_builder, surveys, scenarios, calmar). - **processing/weights** : `calmar` et `Calibration` déplacés dans `processing/weights/calmar.py` et `processing/weights/calibration.py` ; `calibration.py` et `calmar.py` à la racine sont des ré-exports pour compatibilité. - **processing/cleaning** : `clean_data_frame` déplacé dans `processing/cleaning.py` ; `tables.py` importe depuis `processing.cleaning` (compatibilité conservée). +- **policy/** : répertoire créé pour `simulations`, `simulation_builder`, `aggregates` (à terme déplacés dans un paquet dédié). Les modules à la racine (`simulations.py`, `simulation_builder.py`, `aggregates.py`) sont des placeholders avec `DeprecationWarning` qui ré-exportent depuis `policy`. +- **policy/tests/** : tests concernant le paquet policy (test_aggregates, test_compute_aggregate, test_compute_pivot_table, test_compute_winners_losers, test_create_data_frame_by_entity, test_marginal_tax_rate, test_summarize_variables). Ils importent depuis `openfisca_survey_manager.policy` et utilisent `create_randomly_initialized_survey_scenario` depuis `openfisca_survey_manager.tests.test_scenario`. --- @@ -83,24 +86,26 @@ Aujourd’hui ces couches sont entremêlées (ex. lecture + nettoyage dans `tabl ### 3.1 Fonctions longues (> 100 lignes) -- Découper les grosses fonctions en étapes nommées, par exemple : - - `load_survey()` → `_parse_config()`, `_load_raw_data()`, `_transform()`, `_store()`. -- Cible : lisibilité et testabilité, sans changer le comportement. +- **Entamé** : découpage en étapes nommées sans changer le comportement. + - `core.table.Table.read_source` → `_read_csv_with_inferred_encoding()`, `_apply_stata_categorical_strategy()` ; `read_source()` orchestre. + - `core.survey.Survey.get_values` → `_get_values_from_hdf5()`, `_get_values_from_parquet()` ; `get_values()` orchestre et applique l’harmonisation. +- À poursuivre : autres modules (simulations, scenarios, scripts, processing/weights/calmar, etc.). ### 3.2 Dépendances circulaires +- **Vérifié** (imports à froid) : aucune dépendance circulaire. Chaîne cohérente : `exceptions` → `configuration` → `io`/`processing` → `core.table` → `core.survey` → `core.dataset` ; `utils` → `common.misc`, `survey_collections` ; `core.table` n'importe `Survey` qu'en tardif dans `Table.__init__`. Si des cycles apparaissent : extraire la logique commune dans `common/` ou `configuration/`. - Si des modules s’importent mutuellement, extraire la logique commune dans `utils/` (ou `config/`) et faire dépendre les deux côtés de ce module commun. - Vérifier avec des imports à froid (démarrer l’app et importer les sous-modules). ### 3.3 Typage Python -- Ajouter progressivement des type hints sur les signatures publiques (arguments et retours). -- Priorité : `core/`, `io/`, puis `processing/`. +- **Entamé** : type hints sur les signatures publiques de `core/`, `io/` et `processing/` (cleaning, harmonization, weights/calmar, weights/calibration). +- À poursuivre : reste du package (scenarios, simulations, etc.). ### 3.4 Logging -- Remplacer les `print()` par du `logging` structuré (déjà entamé dans matching, calmar). -- Étendre à tous les modules (readers, writers, calibration, etc.). +- **Fait** : `print()` remplacés par du `logging` structuré (matching, calmar, scenarios, scripts/build_collection, simulations, readers, writers, calibration, core, processing, etc.). +- **Fait** : logging étendu à tous les modules métier (configuration/models, google_colab, statshelpers, et l’ensemble des modules concernés). ### 3.5 Gestion d’erreurs centralisée diff --git a/docs/RFC-001-OPENFISCA-DATA-STACK.md b/docs/RFC-001-OPENFISCA-DATA-STACK.md new file mode 100644 index 00000000..168b603f --- /dev/null +++ b/docs/RFC-001-OPENFISCA-DATA-STACK.md @@ -0,0 +1,134 @@ +# RFC-001 : OpenFisca Data Stack + +**Statut** : Draft +**Issue** : [#381](https://github.com/openfisca/openfisca-survey-manager/issues/381) +**Auteur(s)** : Équipe OpenFisca +**Date** : 2025-01 + +--- + +## Résumé + +Cette RFC formalise une **stack data OpenFisca** avec rôles et frontières explicites. Elle définit l’évolution cible de l’actuel `openfisca-survey-manager` vers un cœur data réutilisable (`openfisca-data-manager`), la place des dépôts pays (`openfisca--data`) et celle de la couche analyse (`openfisca-policy-analysis`). Elle sert de référence pour les PR de refactor et les évolutions à venir. + +--- + +## 1. Contexte et motivation + +### 1.1 Problème + +Aujourd’hui, la gestion des données d’enquête et l’analyse de politique sont fortement couplées dans `openfisca-survey-manager`. Il en résulte : + +- une frontière floue entre « accès aux microdata » et « analyse (scénarios, réformes, agrégats) » ; +- une réutilisation limitée du cœur data en dehors des cas d’usage policy ; +- une évolution difficile (backend, schémas, reproductibilité) sans impacter toute la stack. + +### 1.2 Objectif + +Définir une **OpenFisca Data Stack** claire : briques, responsabilités, APIs cibles et règles de dépendance, afin de guider le refactor et les futures évolutions. + +--- + +## 2. Objectifs et non-objectifs + +### 2.1 Objectifs + +- Séparer conceptuellement (et à terme en code) : **données** (accès, stockage, schémas) vs **analyse** (scénarios, réformes, agrégats). +- Proposer une API data minimale stable (v1.0) pour l’accès aux microdata. +- Clarifier le rôle de chaque brique (data-manager, country-data, policy-analysis) et leurs dépendances. +- Aligner les PR de refactor (survey-manager) et les décisions long terme sur cette vision. + +### 2.2 Non-objectifs + +- Cette RFC ne fixe pas de calendrier de mise en œuvre ni d’ordre précis de migration. +- Elle ne détaille pas l’implémentation technique (choix de librairies, formats internes) au-delà des principes et des APIs cibles. + +--- + +## 3. Spécification : OpenFisca Data Stack (cible) + +### 3.1 Vue d’ensemble + +``` +OpenFisca Data Stack +├── openfisca-data-manager (cœur data, pays-agnostique) +├── openfisca--data (adaptation microdata → OpenFisca par pays) +├── openfisca-policy-analysis( scénarios, réformes, agrégats, indicateurs ) +└── OpenFisca Core (moteur de calcul) +``` + +### 3.2 Brique 1 : openfisca-data-manager + +**Évolution cible de l’actuel openfisca-survey-manager (cœur data).** + +- **Rôle** : brique **universelle**, **indépendante des pays**. +- **Responsabilités** : + - abstraction backend (parquet par défaut, HDF en transition) ; + - gestion de datasets versionnés ; + - pipeline data (lecture, nettoyage, écriture) ; + - validation de schéma ; + - métadonnées reproductibles ; + - **API stable d’accès aux microdata**. +- **Ce qu’il ne doit pas faire** : + - dépendre d’un tax benefit system ; + - connaître des variables OpenFisca ; + - contenir de l’analyse policy. + +**API cible minimale stable (v1.0)** : + +```python +dataset = DataManager.load("lfs", year=2019) +df = dataset.to_pandas(columns=["income", "weight"]) +``` + +Et exposition de : + +- `dataset.metadata` +- `dataset.schema` +- `dataset.hash` + +### 3.3 Brique 2 : openfisca--data + +Exemples : `openfisca-france-data`, `openfisca-tunisia-data`. + +- **Rôle** : préparer les **microdata pour ingestion OpenFisca**. +- **Dépendances** : `openfisca-data-manager`, `openfisca-` (Core). +- **Responsabilités** : + - mapping variables enquête → variables OpenFisca ; + - création des entités et périodes ; + - validation de cohérence avec le TBS. + +**API possible** : + +```python +adapter = CountryDataAdapter(dataset) +of_input = adapter.to_openfisca_entities() +``` + +### 3.4 Brique 3 : openfisca-policy-analysis + +Contenu actuel du survey-manager à **migrer ou extraire** dans cette brique (ou un module dédié) : + +- survey scenarios (baseline vs reform) ; +- agrégations pondérées ; +- indicateurs d’inégalités ; +- diagnostics. + +Cette brique s’appuie sur les microdata (via data-manager ou country-data) et sur OpenFisca Core pour les calculs. + +--- + +## 4. Compatibilité et liaison avec le refactor + +- Les PR de **refactor openfisca-survey-manager** (réorganisation, nettoyage, processing/weights, core/io, typage, etc.) restent compatibles avec cette RFC : elles préparent la séparation des couches sans imposer de big-bang. +- Les évolutions ultérieures (découplage data-manager / policy-analysis, exposition de l’API v1.0) pourront référencer cette RFC (et l’issue #381) comme objectif de long terme. +- Aucune rupture d’API publique n’est requise à court terme ; la RFC décrit une cible et un cap. + +--- + +## 5. Références + +- [Issue #381](https://github.com/openfisca/openfisca-survey-manager/issues/381) (vision Data Stack). +- `docs/REFACTORING_PLAN.md` (réorganisation interne du survey-manager). +- `docs/MIGRATION_IMPORTS.md` (migration des imports après retrait des ré-exports). +- `docs/TICKET_OPENFISCA_DATA_STACK.md` (version ticket originale, à considérer comme remplacée par la présente RFC). diff --git a/docs/RFC-002-METADATA-AND-CONFIG.md b/docs/RFC-002-METADATA-AND-CONFIG.md new file mode 100644 index 00000000..e803f9de --- /dev/null +++ b/docs/RFC-002-METADATA-AND-CONFIG.md @@ -0,0 +1,213 @@ +# RFC-002 : Architecture des métadonnées et de la configuration + +**Statut** : Implémenté (chargement config.yaml + manifest, compat legacy) +**Branche** : feature/backend +**Date** : 2025-01 + +--- + +## 1. Résumé + +Cette RFC propose une architecture **plus simple et plus standard** pour la gestion des métadonnées et des chemins dans openfisca-survey-manager, en s’appuyant sur les conventions XDG, un seul format de configuration par répertoire, et une structure de répertoires prévisible. Elle prévoit une **migration progressive** de l’existant. + +--- + +## 2. État actuel (à migrer) + +### 2.1 Où est la config ? + +Le répertoire de configuration (« config_files_directory ») est résolu dans `configuration/paths.py` par une **cascade de hacks** : + +| Priorité | Condition | Répertoire | +|----------|-----------|------------| +| 1 | Package `taxipp` importé et répertoire existe | `taxipp_install/.config/openfisca-survey-manager` | +| 2 | Package `openfisca_france_data` importé et répertoire existe | `BaseDirectory.save_config_path("openfisca-survey-manager")` → **~/.config/openfisca-survey-manager** | +| 3 | CI ou pytest | `openfisca_survey_manager/tests/data_files` | +| 4 | Fallback | `~/.config/openfisca-survey-manager` (XDG) | + +Problèmes : ordre dépendant des imports, écriture possible de `config.ini` dans les tests à l’import, assertion si le répertoire n’existe pas. + +### 2.2 Fichiers dans le répertoire de config + +Aujourd’hui, **deux INI** + des **JSON** externes : + +- **config.ini** (obligatoire dans le répertoire) + - `[collections]` : `collections_directory` + paires `nom_collection` = chemin vers un fichier JSON. + - `[data]` : `output_directory`, `tmp_directory` (et en tests `input_directory`). +- **raw_data.ini** (utilisé uniquement par le script `build-collection`) + - Une section par collection : `[nom_collection]`. + - Clés = noms d’enquêtes, valeurs = chemins vers répertoire/fichier de données brutes. +- **Fichiers JSON** (un par collection, chemin dans config.ini ou sous `collections_directory`) + - Contenu : `name`, `label`, `surveys` : { `survey_name` → métadonnées du survey (tables, hdf5_file_path, parquet_file_path, **informations** dont `csv_files`, `sas_files`, etc.) }. + +Les métadonnées sont donc réparties entre : config.ini (où trouver les JSON), raw_data.ini (où sont les données brutes, seulement pour build), et les JSON (décriture des surveys, chemins de stockage, listes de fichiers sources). Redondance et deux formats INI différents. + +### 2.3 Utilisation dans le code + +- **SurveyCollection** : lit `Config(config_files_directory)` → `config.ini` ; get/set `collections` (nom → json_path) ; `config.get("data", "output_directory")` pour `fill_store`. +- **Survey** : `informations` (dict) contient p.ex. `csv_files`, `sas_files` ; utilisé dans `fill_store` pour savoir quels fichiers lire. +- **build_collection** : lit `raw_data.ini` pour savoir quels répertoires associer à quelles enquêtes, puis crée/met à jour la collection JSON et les données. + +--- + +## 3. Proposition : architecture cible + +### 3.1 Principes + +1. **Un seul répertoire de configuration** : XDG uniquement par défaut, ou chemin explicite (variable d’environnement ou argument). Plus de résolution selon `taxipp` / `openfisca_france_data`. +2. **Un seul fichier de config par répertoire** : tout ce qui est « config globale » (chemins de base, options) dans un seul fichier (voir 3.2). +3. **Métadonnées des datasets au plus près des données** : un « dataset » (ex-collection) = un répertoire dédié avec un manifeste (metadata) à l’intérieur, plutôt qu’un JSON éclaté référencé par un INI. +4. **Standard et lisible** : YAML ou INI clair pour la config ; YAML ou JSON pour les manifests (alignement possible avec RFC-001 Data Stack). + +### 3.2 Répertoire de configuration (XDG) + +**Emplacement par défaut** : `$XDG_CONFIG_HOME/openfisca-survey-manager/` (sinon `~/.config/openfisca-survey-manager/`). + +Contenu proposé : + +``` +~/.config/openfisca-survey-manager/ +├── config.yaml # unique fichier de config (remplace config.ini + raw_data.ini pour la partie “où sont les choses”) +``` + +**config.yaml** (exemple) : + +```yaml +# Répertoire où sont stockées les collections/datasets (manifests + données dérivées) +collections_dir: ~/.local/share/openfisca-survey-manager/collections + +# Répertoire de sortie par défaut pour build / fill_store (optionnel, peut être overridé par dataset) +default_output_dir: ~/.local/share/openfisca-survey-manager/output + +# Répertoire temporaire (optionnel) +tmp_dir: /tmp/openfisca-survey-manager +``` + +Alternative si on garde l’INI : un seul **config.ini** avec des sections claires, p.ex. : + +```ini +[paths] +collections_dir = ~/.local/share/openfisca-survey-manager/collections +default_output_dir = ~/.local/share/openfisca-survey-manager/output +tmp_dir = /tmp/openfisca-survey-manager +``` + +On supprime : `[collections]` avec une entrée par collection (les manifests seront dans chaque dataset, voir 3.3). On supprime **raw_data.ini** : les sources brutes seront décrites dans le manifest du dataset. + +### 3.3 Structure d’un dataset (ex-collection) + +Un dataset = un répertoire sous `collections_dir` (ou chemin absolu configuré), avec un **manifeste** à l’intérieur : + +``` +collections_dir/ +└── erfs/ + ├── manifest.yaml # métadonnées du dataset + liste des surveys + ├── erfs_2019/ # (optionnel) données dérivées par survey + │ ├── data.parquet + │ └── ... + └── erfs_2020/ + └── ... +``` + +**manifest.yaml** (exemple) : + +```yaml +name: erfs +label: "Enquête Revenus Fiscaux et Sociaux" + +# Backend de stockage des tables (hdf5, parquet, zarr) ; par défaut parquet +store_format: parquet + +# Par survey : sources brutes (remplace raw_data.ini + informations) +surveys: + erfs_2019: + label: "ERFS 2019" + source: + format: sas # ou csv, stata, parquet + path: /data/erfs/2019 # répertoire ou fichier + # optionnel : chemins de sortie relatifs au dataset + output_subdir: erfs_2019 + + erfs_2020: + label: "ERFS 2020" + source: + format: parquet + path: /data/erfs/2020 + output_subdir: erfs_2020 +``` + +Cela remplace : la section `[erfs]` de raw_data.ini + la partie « informations » (csv_files, sas_files, …) dans le JSON de collection. Un seul endroit pour « où sont les données brutes » et « où écrire les sorties ». + +Pour la rétrocompatibilité, on peut prévoir un **adaptateur** qui lit l’ancien JSON + raw_data.ini et produit (ou expose) un équivalent manifest. + +### 3.4 Résolution du répertoire de config (simplifiée) + +- **Valeur explicite** : toujours possible de passer `config_dir` (ou `config_files_directory`) en argument aux APIs et au CLI. +- **Par défaut** : `os.environ.get("OPENFISCA_SURVEY_CONFIG_DIR")` ou `xdg_config_home() / "openfisca-survey-manager"`. +- **Tests** : répertoire dédié (ex. `tests/data_files`) fourni explicitement par les tests ; plus d’effet de bord à l’import (plus d’écriture de config.ini au chargement de `paths`). + +On **ne** résout plus le répertoire en fonction de la présence de `taxipp` ou `openfisca_france_data`. Les projets (france-data, taxipp) peuvent : +- soit définir `OPENFISCA_SURVEY_CONFIG_DIR` vers leur répertoire, +- soit passer le chemin de config à chaque appel. + +### 3.5 Backends de stockage (store) + +Le stockage des tables d’enquête peut s’effectuer via différents **backends** (choix au build / `fill_store`) : + +| Backend | Format | Usage | +|----------|---------------------|--------------------------------------------| +| **hdf5** | Un fichier .h5 | Historique (déprécié à terme) | +| **parquet** | Répertoire, un .parquet par table | Recommandé (interop, colonnes) | +| **zarr** | Répertoire .zarr, un groupe par table | Optionnel (dépendance `[zarr]`) | + +- **API** : `io.backends.get_backend(name)`, `get_available_backend_names()`, `register_backend(name, backend)` pour étendre. +- **CLI** : `build-collection --parquet` ou `build-collection --zarr` ; par défaut HDF5 (avec avertissement). +- **Survey** : `store_format`, `hdf5_file_path` / `parquet_file_path` / `zarr_file_path` selon le backend. +- **Zarr (compression, parallélisation)** : voir [docs/ZARR-BACKEND.md](ZARR-BACKEND.md). + +### 3.6 API cible (alignement RFC-001) + +- Charger un dataset par nom : `DataManager.load("erfs", config_dir=...)` → lit `collections_dir/erfs/manifest.yaml` et les données associées. +- Accès aux métadonnées : `dataset.metadata` (provenant du manifest), `dataset.schema` (si on l’expose), chemins dérivés déterministes à partir de `collections_dir` + `name` + `output_subdir`. + +On garde une compatibilité avec l’API actuelle « SurveyCollection.load(collection=...) » pendant la transition, en faisant que cette API s’appuie en interne sur la nouvelle config + manifests (éventuellement via un bridge depuis l’ancien JSON). + +--- + +## 4. Migration de l’existant + +### 4.1 Conserver l’existant en parallèle + +- Garder la lecture de **config.ini** et **raw_data.ini** tant que la nouvelle config n’est pas présente. +- Si `config.yaml` (ou le nouveau config.ini [paths]) existe dans le répertoire de config : utiliser la nouvelle structure (manifests sous `collections_dir`). +- Sinon : comportement actuel (config.ini [collections] + [data], raw_data.ini, JSON externes). + +### 4.2 Script de migration + +Un script permet de migrer l’existant vers la nouvelle structure : + +- **Emplacement** : `openfisca_survey_manager.scripts.migrate_config_to_rfc002` +- **Usage** : + ```bash + python -m openfisca_survey_manager.scripts.migrate_config_to_rfc002 [--config-dir PATH] [--dry-run] [-v] + ``` +- **Comportement** : lit `config.ini` ([collections] + [data]) et, si présent, `raw_data.ini` ; pour chaque collection, charge le JSON, déduit `source.format` et `source.path` à partir de `informations` (csv_files, sas_files, etc.) ou de la section correspondante de raw_data.ini ; **infère `store_format`** (parquet, hdf5 ou zarr) à partir des champs `parquet_file_path` / `zarr_file_path` / `hdf5_file_path` des surveys du JSON legacy, et l’écrit dans le manifest ; crée `config.yaml` et `collections_dir//manifest.yaml` pour chaque collection. Avec `--dry-run`, n’écrit aucun fichier. +- **Répertoire de config par défaut** : celui retourné par `get_config_dir()` (env `OPENFISCA_SURVEY_CONFIG_DIR` ou XDG). On peut imposer un répertoire avec `--config-dir`. + +### 4.3 Dépréciation + +- À terme : annoncer comme dépréciés `config.ini` [collections] (mapping nom → JSON), `raw_data.ini`, et les JSON de collection « à l’ancienne ». Documenter la migration dans MIGRATION_IMPORTS.md ou un nouveau MIGRATION_CONFIG.md. + +--- + +## 5. Résumé des changements proposés + +| Actuel | Cible | +|--------|--------| +| Résolution config par taxipp / france_data / CI / XDG | XDG ou env `OPENFISCA_SURVEY_CONFIG_DIR` ou argument explicite | +| config.ini + raw_data.ini | Un seul fichier (config.yaml ou config.ini [paths]) | +| JSON de collection hors répertoire, référencé par config | Manifest (YAML/JSON) par dataset dans `collections_dir//manifest.yaml` | +| Sources brutes dans raw_data.ini + informations (JSON) | Sources dans le manifest du dataset (`surveys.*.source`) | +| Écriture config.ini au chargement des paths (tests) | Plus d’écriture à l’import ; tests passent un config_dir explicite | + +Cela donne une architecture **plus simple** (un format, un lieu par dataset), **plus standard** (XDG, chemins explicites), et **migrable** en gardant l’ancien comportement tant que la nouvelle config n’est pas en place. diff --git a/docs/TICKET_OPENFISCA_DATA_STACK.md b/docs/TICKET_OPENFISCA_DATA_STACK.md index fac3c70b..9939c187 100644 --- a/docs/TICKET_OPENFISCA_DATA_STACK.md +++ b/docs/TICKET_OPENFISCA_DATA_STACK.md @@ -1,4 +1,6 @@ -# 🏗️ Vision : OpenFisca Data Stack officielle +# 🏗️ Vision : OpenFisca Data Stack officielle ([issue #381](https://github.com/openfisca/openfisca-survey-manager/issues/381)) + +> **Voir la version RFC** : [RFC-001 : OpenFisca Data Stack](RFC-001-OPENFISCA-DATA-STACK.md). Le présent document conserve la formulation initiale du ticket ; la RFC en fixe la forme normative. **Objectif** : Formaliser une stack data OpenFisca claire, avec rôles et frontières bien définis. Ce ticket sert de référence pour les PR de refactor (survey-manager → data-manager, découplage, etc.) et les évolutions à venir. diff --git a/docs/ZARR-BACKEND.md b/docs/ZARR-BACKEND.md new file mode 100644 index 00000000..f375f609 --- /dev/null +++ b/docs/ZARR-BACKEND.md @@ -0,0 +1,126 @@ +# Utiliser Zarr avec OpenFisca Survey Manager + +Ce document explique **si et comment** utiliser le backend Zarr pour stocker les enquêtes, et ce qu’il en est de la **compression** et de la **parallélisation** en lecture/écriture. + +--- + +## 1. Utiliser Zarr avec OpenFisca + +### Oui, c’est possible + +Le backend **zarr** est disponible dans `openfisca-survey-manager` à condition d’installer la dépendance optionnelle : + +```bash +pip install openfisca-survey-manager[zarr] +# ou +pip install openfisca-survey-manager zarr numcodecs +``` + +(pandas 2.x utilise `to_zarr` / `read_zarr` ; le package **zarr** est requis.) + +### En ligne de commande (build-collection) + +Pour construire une collection en stockant les tables au format Zarr : + +```bash +build-collection -c ma_collection --zarr +``` + +Sans `--zarr`, le format par défaut reste HDF5 (avec avertissement) ou vous pouvez utiliser `--parquet`. + +### En Python (fill_store) + +```python +from openfisca_survey_manager.core.dataset import SurveyCollection + +col = SurveyCollection.load(collection="ma_collection", config_files_directory="...") +col.fill_store( + source_format="sas", # ou csv, parquet, etc. + store_format="zarr", +) +``` + +Après cela, chaque survey a un répertoire `{output}/{survey.name}.zarr`, et chaque table est un **groupe zarr** (sous-répertoire) dans ce store. La lecture se fait comme d’habitude avec `survey.get_values(table=..., variables=...)` ; le code utilise automatiquement le backend zarr si `store_format == "zarr"`. + +### Vérifier que Zarr est disponible + +```python +from openfisca_survey_manager.io.backends import get_available_backend_names, get_backend + +print(get_available_backend_names()) # doit contenir "zarr" si le package est installé +backend = get_backend("zarr") # lève ValueError si zarr absent +``` + +--- + +## 2. Compression + +### Comportement actuel + +Dans l’implémentation actuelle, l’écriture Zarr passe par `pandas.DataFrame.to_zarr(path, mode="w")` **sans options de compression explicites**. Zarr/pandas peuvent donc utiliser un comportement par défaut (par ex. compression légère ou aucune selon les versions). + +### Ce que Zarr permet en général + +Zarr gère la compression **par blocs (chunks)** via **numcodecs**. On peut utiliser par exemple : + +- **Blosc** (LZ4, Zstd, Zlib) : bon compromis vitesse / ratio, très utilisé +- **Zstd** : bon ratio, décompression rapide +- **LZ4** : très rapide, ratio moindre +- **Gzip** : standard, plus lent + +Ces options se configurent au moment de la **création** du tableau zarr (compressor, chunks). Avec **pandas** : + +- `df.to_zarr(path, ...)` peut accepter des arguments supplémentaires passés au store zarr sous-jacent (selon la version de pandas). +- Pour un contrôle fin (compression, chunking), on peut créer soi‑même un store zarr avec le bon `compressor` puis y écrire les colonnes, ou étendre le backend (voir ci‑dessous). + +### Évolution possible dans le survey-manager + +On peut faire évoluer le backend Zarr pour accepter des options (compression, chunks) soit : + +- via des **kwargs** dans `fill_store(..., store_format="zarr", **zarr_options)` transmis à `to_zarr`, +- soit via la **config** (manifest ou config.yaml) pour définir un compressor par défaut pour le format zarr. + +Aujourd’hui, si vous avez besoin d’une compression précise, vous pouvez : + +1. **Enregistrer un backend personnalisé** (`register_backend`) qui appelle `to_zarr` avec le `compressor` (et éventuellement les chunks) de votre choix. +2. Ou **post‑traiter** les répertoires `.zarr` générés (ré‑écriture avec d’autres options zarr) en dehors du survey-manager. + +--- + +## 3. Parallélisation lecture / écriture + +### Zarr en général + +- **Parallélisme par blocs** : Zarr est conçu pour que des **chunks différents** puissent être lus ou écrits en parallèle sans verrou global (chaque chunk est indépendant). +- **En Python** : le **GIL** limite le gain avec des threads pour la partie compression/décompression ; le parallélisme efficace passe souvent par **multi‑processus** ou des runtimes qui libèrent le GIL (Cython, C extensions utilisées par numcodecs/blosc). +- **Goulot d’étranglement** : en pratique, la **compression/décompression** peut saturer le CPU (~1 GB/s) alors que le disque ou le réseau peuvent aller plus vite ; des évolutions (batch encode/decode, GPU) sont en cours dans l’écosystème zarr. + +### Dans le survey-manager aujourd’hui + +- **Écriture** : `fill_store(store_format="zarr")` appelle `to_zarr` pour chaque table, de façon **séquentielle** (une table après l’autre, pas de parallélisation interne exposée). +- **Lecture** : `get_values()` utilise `read_zarr` pour une table donnée, également de façon **séquentielle** par appel. + +Donc **par défaut** : pas de parallélisation multi‑tables ni multi‑chunks exposée dans l’API actuelle. + +### Comment paralléliser quand même + +1. **Plusieurs tables / plusieurs surveys** + Vous pouvez paralléliser vous‑même au niveau applicatif : lancer plusieurs processus ou threads qui appellent `fill_store` (ou `get_values`) sur des collections/surveys/tables différents ; chaque processus écrira/lira ses propres fichiers ou groupes zarr sans conflit. + +2. **Dask** + Pour des tableaux zarr, **Dask** (dask.array, ou chargement des zarr en Dask) gère le chargement parallèle par chunks. Cela ne passe pas directement par l’API Survey/SurveyCollection actuelle : il faudrait soit exporter les chemins `.zarr` puis les ouvrir avec Dask, soit ajouter une couche d’intégration (p.ex. une fonction qui retourne un Dask DataFrame à partir d’un survey zarr). + +3. **Évolution du backend** + On pourrait ajouter plus tard un mode « écriture parallèle par table » (threads/processes) ou une option de lecture qui retourne un objet Dask pour exploiter le parallélisme par chunks côté zarr. + +--- + +## 4. Résumé pratique + +| Question | Réponse | +|----------|--------| +| **Utiliser Zarr avec OpenFisca ?** | Oui : `pip install openfisca-survey-manager[zarr]`, puis `build-collection --zarr` ou `fill_store(store_format="zarr")`. | +| **Compression ?** | Par défaut : comportement zarr/pandas (souvent léger). Pour plus de contrôle : backend personnalisé avec `to_zarr(..., compressor=...)` ou post‑traitement des stores zarr. | +| **Parallélisation lecture/écriture ?** | Pas exposée dans l’API actuelle (une table à la fois). Parallélisme possible : vous-même sur plusieurs tables/surveys, ou en utilisant Dask sur les chemins zarr générés. | + +Si vous voulez, on peut détailler une proposition d’API pour passer des options de compression (et éventuellement de chunking) au backend Zarr dans `fill_store` ou dans la config. diff --git a/openfisca_survey_manager/calibration.py b/openfisca_survey_manager/calibration.py deleted file mode 100644 index 1670538c..00000000 --- a/openfisca_survey_manager/calibration.py +++ /dev/null @@ -1,8 +0,0 @@ -"""Re-export for backward compatibility. - -Prefer: from openfisca_survey_manager.processing.weights import Calibration. -""" - -from openfisca_survey_manager.processing.weights import Calibration - -__all__ = ["Calibration"] diff --git a/openfisca_survey_manager/calmar.py b/openfisca_survey_manager/calmar.py deleted file mode 100644 index 77c2ba53..00000000 --- a/openfisca_survey_manager/calmar.py +++ /dev/null @@ -1,5 +0,0 @@ -"""Re-export for backward compatibility. Prefer: from openfisca_survey_manager.processing.weights import calmar.""" - -from openfisca_survey_manager.processing.weights import calmar, check_calmar - -__all__ = ["calmar", "check_calmar"] diff --git a/openfisca_survey_manager/common/__init__.py b/openfisca_survey_manager/common/__init__.py index 5ca887c8..cc8c0061 100644 --- a/openfisca_survey_manager/common/__init__.py +++ b/openfisca_survey_manager/common/__init__.py @@ -1,13 +1,9 @@ -# Target: shared helpers to avoid circular imports (from utils.py, paths, etc.). -# Final name will be utils/ once utils.py is migrated. See docs/REFACTORING_PLAN.md. - -from openfisca_survey_manager.common.misc import ( +from openfisca_survey_manager.policy.legislation_asof import ( # Backward-compat imports asof, do_nothing, inflate_parameter_leaf, inflate_parameters, parameters_asof, - stata_files_to_data_frames, variables_asof, ) @@ -17,6 +13,5 @@ "inflate_parameter_leaf", "inflate_parameters", "parameters_asof", - "stata_files_to_data_frames", "variables_asof", ] diff --git a/openfisca_survey_manager/common/misc.py b/openfisca_survey_manager/common/misc.py index 69ce1400..fe95bf70 100644 --- a/openfisca_survey_manager/common/misc.py +++ b/openfisca_survey_manager/common/misc.py @@ -1,251 +1,24 @@ -"""Shared helpers (no survey collection dependency) to avoid circular imports.""" - -import logging -from pathlib import Path - -import pandas as pd -from openfisca_core import periods -from openfisca_core.parameters import ParameterNode, Scale - -log = logging.getLogger(__name__) - - -def do_nothing(*args, **kwargs): - return None - - -def inflate_parameters( - parameters, - inflator, - base_year, - last_year=None, - ignore_missing_units=False, - start_instant=None, - round_ndigits=2, -): - """ - Inflate a Parameter node or a Parameter leaf for the years between base_year and last_year. - - ::parameters:: a Parameter node or a Parameter leaf - ::inflator:: rate used to inflate the parameter. The rate is unique for all the years - ::base_year:: base year of the parameter - ::last_year:: last year of inflation - ::ignore_missing_units:: if True, a parameter leaf without unit in metadata will not be inflated - ::start_instant:: Instant of the year when the update should start, if None will be January 1st - ::round_ndigits:: Number of digits to keep in the rounded result - """ - if (last_year is not None) and (last_year > base_year + 1): - for year in range(base_year + 1, last_year + 1): - inflate_parameters( - parameters, - inflator, - year - 1, - last_year=year, - ignore_missing_units=ignore_missing_units, - start_instant=start_instant, - round_ndigits=round_ndigits, - ) - else: - if last_year is None: - last_year = base_year + 1 - - assert last_year == base_year + 1 - - if isinstance(parameters, ParameterNode): - for sub_parameter in parameters.children.values(): - inflate_parameters( - sub_parameter, - inflator, - base_year, - last_year, - ignore_missing_units=ignore_missing_units, - start_instant=start_instant, - round_ndigits=round_ndigits, - ) - else: - acceptable_units = [ - "rate_unit", - "threshold_unit", - "unit", - ] - if ignore_missing_units: - if not hasattr(parameters, "metadata"): - return - if not bool(set(parameters.metadata.keys()) & set(acceptable_units)): - return - assert hasattr(parameters, "metadata"), f"{parameters.name} doesn't have metadata" - unit_types = set(parameters.metadata.keys()).intersection(set(acceptable_units)) - assert unit_types, ( - f"No admissible unit in metadata for parameter {parameters.name}. You may consider using " - "the option 'ignore_missing_units' from the inflate_parameters() function." - ) - if len(unit_types) > 1: - assert unit_types == {"threshold_unit", "rate_unit"}, ( - f"Too much admissible units in metadata for parameter {parameters.name}" - ) - unit_by_type = {unit_type: parameters.metadata[unit_type] for unit_type in unit_types} - for unit_type in unit_by_type: - if parameters.metadata[unit_type].startswith("currency"): - inflate_parameter_leaf( - parameters, - base_year, - inflator, - unit_type=unit_type, - start_instant=start_instant, - round_ndigits=round_ndigits, - ) - - -def inflate_parameter_leaf(sub_parameter, base_year, inflator, unit_type="unit", start_instant=None, round_ndigits=2): - """ - Inflate a Parameter leaf according to unit type for the year after base_year. - - ::sub_parameter:: a Parameter leaf - ::base_year:: base year of the parameter - ::inflator:: rate used to inflate the parameter - ::unit_type:: unit supposed by default. Other admissible unit types are threshold_unit and rate_unit - ::start_instant:: Instant of the year when the update should start, if None will be January 1st - ::round_ndigits:: Number of digits to keep in the rounded result - """ - if isinstance(sub_parameter, Scale): - if unit_type == "threshold_unit": - for bracket in sub_parameter.brackets: - threshold = bracket.children["threshold"] - inflate_parameter_leaf( - threshold, base_year, inflator, start_instant=start_instant, round_ndigits=round_ndigits - ) - return - else: - kept_instants_str = [ - parameter_at_instant.instant_str - for parameter_at_instant in sub_parameter.values_list - if periods.instant(parameter_at_instant.instant_str).year <= base_year - ] - if not kept_instants_str: - return - - last_admissible_instant_str = max(kept_instants_str) - sub_parameter.update(start=last_admissible_instant_str, value=sub_parameter(last_admissible_instant_str)) - if start_instant is not None: - assert periods.instant(start_instant).year == (base_year + 1), ( - "Year of start_instant should be base_year + 1" - ) - value = ( - round(sub_parameter(f"{base_year}-12-31") * (1 + inflator), round_ndigits) - if sub_parameter(f"{base_year}-12-31") is not None - else None - ) - sub_parameter.update( - start=start_instant, - value=value, - ) - else: - restricted_to_base_year_value_list = [ - parameter_at_instant - for parameter_at_instant in sub_parameter.values_list - if periods.instant(parameter_at_instant.instant_str).year == base_year - ] - if restricted_to_base_year_value_list: - for parameter_at_instant in reversed(restricted_to_base_year_value_list): - if parameter_at_instant.instant_str.startswith(str(base_year)): - value = ( - round(parameter_at_instant.value * (1 + inflator), round_ndigits) - if parameter_at_instant.value is not None - else None - ) - sub_parameter.update( - start=parameter_at_instant.instant_str.replace(str(base_year), str(base_year + 1)), - value=value, - ) - else: - value = ( - round(sub_parameter(f"{base_year}-12-31") * (1 + inflator), round_ndigits) - if sub_parameter(f"{base_year}-12-31") is not None - else None - ) - sub_parameter.update( - start=f"{base_year + 1}-01-01", - value=value, - ) - - -def asof(tax_benefit_system, instant): - parameters = tax_benefit_system.parameters - parameters_asof(parameters, instant) - variables_asof(tax_benefit_system, instant) - - -def leaf_asof(sub_parameter, instant): - kept_instants_str = [ - parameter_at_instant.instant_str - for parameter_at_instant in sub_parameter.values_list - if periods.instant(parameter_at_instant.instant_str) <= instant - ] - if not kept_instants_str: - sub_parameter.values_list = [] - return - - last_admissible_instant_str = max(kept_instants_str) - sub_parameter.update(start=last_admissible_instant_str, value=sub_parameter(last_admissible_instant_str)) - - -def parameters_asof(parameters, instant): - if isinstance(instant, str): - instant = periods.instant(instant) - assert isinstance(instant, periods.Instant) - - for sub_parameter in parameters.children.values(): - if isinstance(sub_parameter, ParameterNode): - parameters_asof(sub_parameter, instant) - else: - if isinstance(sub_parameter, Scale): - for bracket in sub_parameter.brackets: - threshold = bracket.children["threshold"] - rate = bracket.children.get("rate") - amount = bracket.children.get("amount") - leaf_asof(threshold, instant) - if rate: - leaf_asof(rate, instant) - if amount: - leaf_asof(amount, instant) - else: - leaf_asof(sub_parameter, instant) - - -def variables_asof(tax_benefit_system, instant, variables_list=None): - if isinstance(instant, str): - instant = periods.instant(instant) - assert isinstance(instant, periods.Instant) - - if variables_list is None: - variables_list = tax_benefit_system.variables.keys() - - for variable_name, variable in tax_benefit_system.variables.items(): - if variable_name in variables_list: - formulas = variable.formulas - for instant_str in list(formulas.keys()): - if periods.instant(instant_str) > instant: - del formulas[instant_str] - - if variable.end is not None and periods.instant(variable.end) >= instant: - variable.end = None - - -def stata_files_to_data_frames(data, period=None): - assert period is not None - period = periods.period(period) - - stata_file_by_entity = data.get("stata_file_by_entity") - if stata_file_by_entity is None: - return - - variables_from_stata_files = [] - input_data_frame_by_entity_by_period = {} - input_data_frame_by_entity_by_period[periods.period(period)] = input_data_frame_by_entity = {} - for entity, file_path in stata_file_by_entity.items(): - assert Path(file_path).exists(), f"Invalid file path: {file_path}" - entity_data_frame = input_data_frame_by_entity[entity] = pd.read_stata(file_path) - variables_from_stata_files += list(entity_data_frame.columns) - data["input_data_frame_by_entity_by_period"] = input_data_frame_by_entity_by_period - - return variables_from_stata_files +"""Backward-compatibility wrapper for legislation helpers. + +Use ``openfisca_survey_manager.policy.legislation_asof`` as canonical import path. +""" + +from openfisca_survey_manager.policy.legislation_asof import ( + asof, + do_nothing, + inflate_parameter_leaf, + inflate_parameters, + leaf_asof, + parameters_asof, + variables_asof, +) + +__all__ = [ + "asof", + "do_nothing", + "inflate_parameter_leaf", + "inflate_parameters", + "leaf_asof", + "parameters_asof", + "variables_asof", +] diff --git a/openfisca_survey_manager/config.py b/openfisca_survey_manager/config.py deleted file mode 100644 index 2e09a43c..00000000 --- a/openfisca_survey_manager/config.py +++ /dev/null @@ -1,8 +0,0 @@ -"""Re-export for backward compatibility. - -Prefer: from openfisca_survey_manager.configuration.models import Config. -""" - -from openfisca_survey_manager.configuration.models import Config - -__all__ = ["Config"] diff --git a/openfisca_survey_manager/configuration/__init__.py b/openfisca_survey_manager/configuration/__init__.py index b8cab758..137908e1 100644 --- a/openfisca_survey_manager/configuration/__init__.py +++ b/openfisca_survey_manager/configuration/__init__.py @@ -1,6 +1,11 @@ # Config and paths; config.py and paths.py re-export for compatibility. -# See docs/REFACTORING_PLAN.md. +# See docs/REFACTORING_PLAN.md. RFC-002: config_loader for config.yaml + manifest. +from openfisca_survey_manager.configuration.config_loader import ( + get_config_dir, + load_config, + load_manifest, +) from openfisca_survey_manager.configuration.models import Config from openfisca_survey_manager.configuration.paths import ( config_ini, @@ -15,7 +20,10 @@ "Config", "config_ini", "default_config_files_directory", + "get_config_dir", "is_in_ci", + "load_config", + "load_manifest", "openfisca_survey_manager_location", "private_run_with_data", "test_config_files_directory", diff --git a/openfisca_survey_manager/configuration/config_loader.py b/openfisca_survey_manager/configuration/config_loader.py new file mode 100644 index 00000000..44c982cf --- /dev/null +++ b/openfisca_survey_manager/configuration/config_loader.py @@ -0,0 +1,97 @@ +"""RFC-002: New config and manifest loading (YAML-based).""" + +from __future__ import annotations + +import logging +from pathlib import Path +from typing import Any, Optional + +import yaml +from xdg import BaseDirectory + +log = logging.getLogger(__name__) + +CONFIG_FILENAME = "config.yaml" +MANIFEST_FILENAME = "manifest.yaml" +ENV_CONFIG_DIR = "OPENFISCA_SURVEY_CONFIG_DIR" + + +def get_config_dir(explicit: Optional[Path | str] = None) -> Path: + """Return config directory: explicit path, or env OPENFISCA_SURVEY_CONFIG_DIR, or XDG.""" + if explicit is not None: + return Path(explicit).expanduser().resolve() + import os + + env_path = os.environ.get(ENV_CONFIG_DIR) + if env_path: + return Path(env_path).expanduser().resolve() + return Path(BaseDirectory.save_config_path("openfisca-survey-manager")) + + +def load_config(config_dir: Path) -> Optional[dict[str, Any]]: + """ + Load new-style config from config_dir/config.yaml. + Returns dict with collections_dir, default_output_dir, tmp_dir (paths expanded), + or None if config.yaml is missing or invalid. + """ + config_path = config_dir / CONFIG_FILENAME + if not config_path.is_file(): + return None + try: + with config_path.open() as f: + data = yaml.safe_load(f) + except Exception as e: + log.warning("Failed to load %s: %s", config_path, e) + return None + if not data or not isinstance(data, dict): + return None + collections_dir = data.get("collections_dir") + if not collections_dir: + return None + out = { + "collections_dir": Path(collections_dir).expanduser().resolve(), + "default_output_dir": Path(data.get("default_output_dir", ".")).expanduser().resolve(), + "tmp_dir": Path(data.get("tmp_dir", "/tmp")).expanduser().resolve(), + } + return out + + +def load_manifest(collections_dir: Path, name: str) -> Optional[dict[str, Any]]: + """ + Load dataset manifest from collections_dir/name/manifest.yaml. + Returns manifest dict (name, label, surveys) or None if missing. + """ + manifest_path = collections_dir / name / MANIFEST_FILENAME + if not manifest_path.is_file(): + return None + try: + with manifest_path.open() as f: + data = yaml.safe_load(f) + except Exception as e: + log.warning("Failed to load manifest %s: %s", manifest_path, e) + return None + if not data or not isinstance(data, dict) or "surveys" not in data: + return None + return data + + +def manifest_survey_to_json(survey_name: str, entry: dict[str, Any]) -> dict[str, Any]: + """ + Convert a manifest survey entry to the dict shape expected by Survey.create_from_json. + entry: { label?, source: { format, path }, output_subdir? } + """ + source = entry.get("source") or {} + fmt = source.get("format", "csv") + path = source.get("path", "") + # Survey expects e.g. csv_files, sas_files list in informations + files_key = f"{fmt}_files" + informations = {files_key: [path] if path else []} + return { + "name": survey_name, + "label": entry.get("label", survey_name), + "hdf5_file_path": None, + "parquet_file_path": None, + "zarr_file_path": None, + "tables": entry.get("tables"), + "informations": informations, + } diff --git a/openfisca_survey_manager/configuration/models.py b/openfisca_survey_manager/configuration/models.py index 571a6ac8..7a52d399 100644 --- a/openfisca_survey_manager/configuration/models.py +++ b/openfisca_survey_manager/configuration/models.py @@ -1,25 +1,36 @@ """Configuration model (Config class from config.ini).""" +from __future__ import annotations + import configparser +import logging from pathlib import Path +from typing import Optional, Union + +log = logging.getLogger(__name__) class Config(configparser.ConfigParser): """Parser for config.ini; used by SurveyCollection and build scripts.""" - config_ini = None + config_ini: Optional[Path] = None - def __init__(self, config_files_directory=None): + def __init__( + self, + config_files_directory: Optional[Union[Path, str]] = None, + ) -> None: configparser.ConfigParser.__init__(self) if config_files_directory is not None: config_ini = Path(config_files_directory) / "config.ini" assert config_ini.exists(), f"{config_ini} is not a valid path" self.config_ini = config_ini self.read([config_ini]) + log.debug("Loaded config from %s", config_ini) - def save(self): + def save(self) -> None: assert self.config_ini, "configuration file path is not defined" assert self.config_ini.exists() config_file = self.config_ini.open("w") self.write(config_file) config_file.close() + log.debug("Saved config to %s", self.config_ini) diff --git a/openfisca_survey_manager/core/__init__.py b/openfisca_survey_manager/core/__init__.py index c7be67cc..4ea97a4d 100644 --- a/openfisca_survey_manager/core/__init__.py +++ b/openfisca_survey_manager/core/__init__.py @@ -1,2 +1,7 @@ -# Target: Survey (surveys.py), SurveyCollection, dataset orchestration. -# See docs/REFACTORING_PLAN.md for migration steps. +# Survey, Table, SurveyCollection, load_table. + +from openfisca_survey_manager.core.dataset import SurveyCollection, load_table +from openfisca_survey_manager.core.survey import NoMoreDataError, Survey +from openfisca_survey_manager.core.table import Table + +__all__ = ["NoMoreDataError", "Survey", "SurveyCollection", "Table", "load_table"] diff --git a/openfisca_survey_manager/core/dataset.py b/openfisca_survey_manager/core/dataset.py new file mode 100644 index 00000000..4eb008fe --- /dev/null +++ b/openfisca_survey_manager/core/dataset.py @@ -0,0 +1,248 @@ +"""SurveyCollection: collection of surveys (dataset orchestration).""" + +from __future__ import annotations + +import codecs +import collections +import configparser +import json +import logging +import warnings +from pathlib import Path +from typing import List, Optional, Union + +import pandas as pd + +from openfisca_survey_manager.configuration.config_loader import ( + load_config, + load_manifest, + manifest_survey_to_json, +) +from openfisca_survey_manager.configuration.models import Config +from openfisca_survey_manager.configuration.paths import default_config_files_directory +from openfisca_survey_manager.core.survey import Survey +from openfisca_survey_manager.exceptions import SurveyConfigError + +log = logging.getLogger(__name__) + + +class SurveyCollection: + """A collection of Surveys.""" + + name: Optional[str] = None + label: Optional[str] = None + json_file_path: Optional[str] = None + surveys: List[Survey] # set in __init__ + config: Optional[Config] = None + output_directory: Optional[str] = None # RFC-002: used when config is None (manifest-based) + + def __init__( + self, + config_files_directory: Optional[Union[Path, str]] = default_config_files_directory, + label: Optional[str] = None, + name: Optional[str] = None, + json_file_path: Optional[str] = None, + ) -> None: + self.name = name + self.label = label + self.json_file_path = json_file_path + self.surveys = [] + log.debug(f"Initializing SurveyCollection from config file found in {config_files_directory} ..") + config = Config(config_files_directory=config_files_directory) + if label is not None: + self.label = label + if name is not None: + self.name = name + if json_file_path is not None: + self.json_file_path = json_file_path + if "collections" not in config.sections(): + config["collections"] = {} + config.set("collections", self.name, str(self.json_file_path)) + config.save() + elif config is not None: + if config.has_option("collections", self.name): + self.json_file_path = config.get("collections", self.name) + elif config.get("collections", "collections_directory") is not None: + self.json_file_path = str(Path(config.get("collections", "collections_directory")) / (name + ".json")) + + self.config = config + + def __repr__(self) -> str: + header = f"""{self.name} +Survey collection of {self.label} +Contains the following surveys : +""" + surveys = [f" {survey.name} : {survey.label} \n" for survey in self.surveys] + return header + "".join(surveys) + + def dump( + self, + config_files_directory: Optional[Union[Path, str]] = None, + json_file_path: Optional[str] = None, + ) -> None: + if json_file_path is not None: + self.json_file_path = json_file_path + + if self.config is None: + # RFC-002: manifest-based collection; no config.ini to update + return + + config = self.config + if self.json_file_path is None: + assert self.json_file_path is not None, "A json_file_path should be provided" + + config.set("collections", self.name, str(self.json_file_path)) + config.save() + with codecs.open(str(self.json_file_path), "w", encoding="utf-8") as _file: + json.dump(self.to_json(), _file, ensure_ascii=False, indent=2) + + def fill_store( + self, + source_format: Optional[str] = None, + surveys: Optional[List[Survey]] = None, + tables: Optional[List[str]] = None, + overwrite: bool = False, + keep_original_parquet_file: bool = False, + encoding: Optional[str] = None, + store_format: str = "hdf5", + categorical_strategy: str = "unique_labels", + ) -> None: + if surveys is None: + surveys = self.surveys + for survey in surveys: + survey.fill_store( + source_format=source_format, + tables=tables, + overwrite=overwrite, + keep_original_parquet_file=keep_original_parquet_file, + encoding=encoding, + store_format=store_format, + categorical_strategy=categorical_strategy, + ) + self.dump() + + def get_survey(self, survey_name: str) -> Survey: + available_surveys_names = [survey.name for survey in self.surveys] + assert survey_name in available_surveys_names, ( + f"Survey {survey_name} cannot be found for survey collection {self.name}.\n" + f"Available surveys are :{available_surveys_names}" + ) + return [survey for survey in self.surveys if survey.name == survey_name].pop() + + @classmethod + def load( + cls, + json_file_path: Optional[str] = None, + collection: Optional[str] = None, + config_files_directory: Optional[Union[Path, str]] = default_config_files_directory, + ) -> SurveyCollection: + config_dir = Path(config_files_directory).expanduser().resolve() + assert config_dir.exists(), f"Config directory does not exist: {config_dir}" + + # RFC-002: try new config.yaml + manifest first + new_cfg = load_config(config_dir) + if json_file_path is None and collection is not None and new_cfg is not None: + manifest = load_manifest(new_cfg["collections_dir"], collection) + if manifest is not None: + self = cls.__new__(cls) + self.name = manifest.get("name", collection) + self.label = manifest.get("label", self.name) + self.json_file_path = str(new_cfg["collections_dir"] / collection / "manifest.yaml") + self.config = None + self.output_directory = str(new_cfg["default_output_dir"]) + self.surveys = [] + store_format = manifest.get("store_format", "parquet") + output_dir = Path(self.output_directory) + for survey_name, entry in manifest.get("surveys", {}).items(): + survey_json = manifest_survey_to_json(survey_name, entry) + survey = Survey(name=survey_name) + survey = survey.create_from_json(survey_json) + survey.survey_collection = self + survey.store_format = store_format + if store_format == "hdf5": + survey.hdf5_file_path = str(output_dir / (survey.name + ".h5")) + elif store_format == "parquet": + survey.parquet_file_path = str(output_dir / survey.name) + elif store_format == "zarr": + survey.zarr_file_path = str(output_dir / (survey.name + ".zarr")) + self.surveys.append(survey) + return self + + # Legacy: config.ini + JSON + warnings.warn( + "Loading collections from config.ini and JSON files is deprecated. " + "Migrate to config.yaml and manifest.yaml using: " + "python -m openfisca_survey_manager.scripts.migrate_config_to_rfc002 --config-dir " + "See docs/RFC-002-METADATA-AND-CONFIG.md.", + DeprecationWarning, + stacklevel=2, + ) + config = Config(config_files_directory=config_files_directory) + if json_file_path is None: + assert collection is not None, "A collection is needed" + try: + json_file_path = config.get("collections", collection) + except (configparser.NoOptionError, configparser.NoSectionError) as error: + msg = f"Looking for config file in {config_files_directory}" + log.debug(msg) + log.error(error) + raise error + except Exception as error: + msg = f"Looking for config file in {config_files_directory}" + log.debug(msg) + log.error(error) + raise SurveyConfigError(msg) from error + + with Path(json_file_path).open("r") as _file: + self_json = json.load(_file) + name = self_json["name"] + + self = cls(config_files_directory=config_files_directory, name=name) + self.config = config + with Path(json_file_path).open("r") as _file: + self_json = json.load(_file) + self.json_file_path = json_file_path + self.label = self_json.get("label") + self.name = self_json.get("name") + + surveys = self_json["surveys"] + for survey_name, survey_json in surveys.items(): + survey = Survey(name=survey_name) + self.surveys.append(survey.create_from_json(survey_json)) + return self + + def to_json(self) -> dict: + self_json = collections.OrderedDict(()) + self_json["name"] = self.name + self_json["surveys"] = collections.OrderedDict(()) + for survey in self.surveys: + self_json["surveys"][survey.name] = survey.to_json() + return self_json + + +def load_table( + config_files_directory, + variables: Optional[list] = None, + collection: Optional[str] = None, + survey: Optional[str] = None, + input_data_survey_prefix: Optional[str] = None, + data_year=None, + table: Optional[str] = None, + batch_size=None, + batch_index=0, + filter_by=None, +) -> pd.DataFrame: + """Load table from a survey in a collection.""" + survey_collection = SurveyCollection.load(collection=collection, config_files_directory=config_files_directory) + survey_name = survey if survey is not None else f"{input_data_survey_prefix}_{data_year}" + survey_ = survey_collection.get_survey(survey_name) + log.debug("Loading table %s in survey %s from collection %s", table, survey_name, collection) + if batch_size: + return survey_.get_values( + table=table, + variables=variables, + batch_size=batch_size, + batch_index=batch_index, + filter_by=filter_by, + ) + return survey_.get_values(table=table, variables=variables, filter_by=filter_by) diff --git a/openfisca_survey_manager/core/survey.py b/openfisca_survey_manager/core/survey.py new file mode 100644 index 00000000..05375ad9 --- /dev/null +++ b/openfisca_survey_manager/core/survey.py @@ -0,0 +1,376 @@ +"""Survey: describes survey data and tables.""" + +from __future__ import annotations + +import collections +import logging +import re +from pathlib import Path +from typing import TYPE_CHECKING, Any, List, Optional, Union + +import pandas +import pyarrow as pa +import pyarrow.parquet as pq +import yaml + +from openfisca_survey_manager.core.table import Table +from openfisca_survey_manager.exceptions import SurveyIOError, SurveyManagerError +from openfisca_survey_manager.io.backends import get_backend +from openfisca_survey_manager.io.hdf import hdf5_safe_key +from openfisca_survey_manager.processing.harmonization import harmonize_data_frame_columns + +if TYPE_CHECKING: + from openfisca_survey_manager.core.dataset import SurveyCollection + +log = logging.getLogger(__name__) + +source_format_by_extension = { + "csv": "csv", + "sas7bdat": "sas", + "dta": "stata", + "Rdata": "Rdata", + "spss": "sav", + "parquet": "parquet", +} + +admissible_source_formats = list(source_format_by_extension.values()) + + +class NoMoreDataError(Exception): + """Raised when the user asks for more data than available in file.""" + + pass + + +class Survey: + """An object to describe survey data.""" + + hdf5_file_path: Optional[str] = None + parquet_file_path: Optional[str] = None + zarr_file_path: Optional[str] = None + label: Optional[str] = None + name: Optional[str] = None + survey_collection: Optional[SurveyCollection] = None + store_format: Optional[str] = None + + def __init__( + self, + name: Optional[str] = None, + label: Optional[str] = None, + hdf5_file_path: Optional[str] = None, + parquet_file_path: Optional[str] = None, + survey_collection: Optional[SurveyCollection] = None, + **kwargs: Any, + ) -> None: + assert name is not None, "A survey should have a name" + self.name = name + self.tables = collections.OrderedDict() + self.informations = {} + self.tables_index = {} + + if label is not None: + self.label = label + + if hdf5_file_path is not None: + self.hdf5_file_path = hdf5_file_path + + if parquet_file_path is not None: + self.parquet_file_path = parquet_file_path + + if survey_collection is not None: + self.survey_collection = survey_collection + + self.informations = kwargs + + def __repr__(self) -> str: + header = f"""{self.name} : survey data {self.label} +Contains the following tables : \n""" + tables = yaml.safe_dump(list(self.tables.keys()), default_flow_style=False) + informations = yaml.safe_dump(self.informations, default_flow_style=False) + return header + tables + informations + + @classmethod + def create_from_json(cls, survey_json: dict) -> Survey: + # Top-level store paths; exclude from informations to avoid duplicate kwargs + store_path_keys = {"hdf5_file_path", "parquet_file_path", "zarr_file_path"} + infos = {k: v for k, v in survey_json.get("informations", {}).items() if k not in store_path_keys} + self = cls( + name=survey_json.get("name"), + label=survey_json.get("label"), + hdf5_file_path=survey_json.get("hdf5_file_path"), + parquet_file_path=survey_json.get("parquet_file_path"), + zarr_file_path=survey_json.get("zarr_file_path"), + **infos, + ) + self.tables = survey_json.get("tables") + return self + + def dump(self) -> None: + assert self.survey_collection is not None + self.survey_collection.dump() + + def fill_store( + self, + source_format: Optional[str] = None, + tables: Optional[List[str]] = None, + overwrite: Union[bool, List[str]] = True, + keep_original_parquet_file: bool = False, + encoding: Optional[str] = None, + store_format: str = "hdf5", + categorical_strategy: str = "unique_labels", + ) -> None: + assert self.survey_collection is not None + assert isinstance(overwrite, (bool, list)) + survey = self + sc = survey.survey_collection + if sc.config is not None: + directory_path = sc.config.get("data", "output_directory") + else: + directory_path = getattr(sc, "output_directory", None) + assert directory_path is not None, "SurveyCollection has no config and no output_directory; cannot fill_store" + if not Path(directory_path).is_dir(): + log.warning( + f"{directory_path} who should be the store data directory does not exist: we create the directory" + ) + Path(directory_path).mkdir(parents=True) + + if source_format == "parquet": + store_format = "parquet" + + if store_format == "hdf5" and survey.hdf5_file_path is None: + survey.hdf5_file_path = str(Path(directory_path) / (survey.name + ".h5")) + + if store_format == "parquet" and survey.parquet_file_path is None: + survey.parquet_file_path = str(Path(directory_path) / survey.name) + + if store_format == "zarr" and survey.zarr_file_path is None: + survey.zarr_file_path = str(Path(directory_path) / (survey.name + ".zarr")) + + self.store_format = store_format + + if source_format is not None: + assert source_format in admissible_source_formats, f"Data source format {source_format} is unknown" + source_formats = [source_format] + else: + source_formats = admissible_source_formats + + for source_format in source_formats: + files = f"{source_format}_files" + for data_file in survey.informations.get(files, []): + name = Path(data_file).stem + extension = Path(data_file).suffix + if tables is None or name in tables: + if keep_original_parquet_file: + if re.match(r".*-\d$", name): + name = name.split("-")[0] + parquet_file = str(Path(data_file).parent) + survey.parquet_file_path = str(Path(data_file).parent.parent) + else: + parquet_file = data_file + survey.parquet_file_path = str(Path(data_file).parent) + table = Table( + label=name, + name=name, + source_format=source_format_by_extension[extension[1:]], + survey=survey, + parquet_file=parquet_file, + ) + table.read_parquet_columns(data_file) + + else: + table = Table( + label=name, + name=name, + source_format=source_format_by_extension[extension[1:]], + survey=survey, + ) + table.fill_store( + data_file, + clean=True, + overwrite=overwrite if isinstance(overwrite, bool) else table.name in overwrite, + encoding=encoding, + categorical_strategy=categorical_strategy, + ) + self.dump() + + def get_value( + self, + variable: str, + table: Optional[str] = None, + lowercase: bool = False, + ignorecase: bool = False, + ) -> pandas.DataFrame: + return self.get_values([variable], table) + + def _get_values_from_hdf5(self, table: str, ignorecase: bool = False) -> tuple[pandas.DataFrame, str]: + """Read table from HDF5 store. Returns (df, resolved_table_name).""" + assert Path(self.hdf5_file_path).exists(), ( + f"{self.hdf5_file_path} is not a valid path. This could happen because " + "your data were not builded yet. Please consider using a rebuild option in your code." + ) + store = pandas.HDFStore(self.hdf5_file_path, "r") + try: + # Use same key normalization as at write time (PyTables NaturalNameWarning) + hdf5_key = hdf5_safe_key(table) + if ignorecase: + keys = store.keys() + eligible_tables = [k for k in keys if hdf5_safe_key(k.lstrip("/")).lower() == hdf5_key.lower()] + if len(eligible_tables) > 1: + raise SurveyManagerError( + f"{table} is ambiguous since the following tables are available: {eligible_tables}" + ) + if len(eligible_tables) == 0: + raise SurveyIOError(f"No eligible available table in {keys}") + hdf5_key = eligible_tables[0].lstrip("/") + try: + df = store.select(hdf5_key) + except KeyError: + # Backward compat: try raw table name (old files may have keys with hyphens) + df = store.select(table) + return df, table + except KeyError: + log.error("No table %s in the file %s", table, self.hdf5_file_path) + log.error( + "This could happen because your data were not builded yet. Available tables are: %s", + store.keys(), + ) + raise + finally: + store.close() + + def _get_values_from_parquet( + self, + table: str, + variables: Optional[List[str]], + filter_by: Optional[List[tuple]], + batch_size: Optional[int], + batch_index: int, + ) -> pandas.DataFrame: + """Read table from parquet. Resolves variables from table content if None.""" + if table is None: + raise SurveyIOError("A table name is needed to retrieve data from a parquet file") + for table_name, table_content in self.tables.items(): + if table != table_name: + continue + parquet_file = table_content.get("parquet_file") + if Path(parquet_file).is_dir(): + for file in Path(parquet_file).iterdir(): + if file.suffix == ".parquet": + one_parquet_file = str(Path(parquet_file) / file) + break + else: + raise SurveyIOError(f"No parquet file found in {parquet_file}") + else: + one_parquet_file = parquet_file + parquet_schema = pq.read_schema(one_parquet_file) + assert len(parquet_schema.names) >= 1, f"The parquet file {table_content.get('parquet_file')} is empty" + if variables is None: + variables = table_content.get("variables") + if filter_by: + return pq.ParquetDataset(parquet_file, filters=filter_by).read(columns=variables).to_pandas() + if batch_size is not None: + paths = ( + [str(p) for p in Path(parquet_file).glob("*.parquet")] + if Path(parquet_file).is_dir() + else [parquet_file] + ) + tables_list = [pq.read_table(fp, columns=variables) for fp in paths] + final_table = pa.concat_tables(tables_list) if len(tables_list) > 1 else tables_list[0] + record_batches = final_table.to_batches(max_chunksize=batch_size) + if len(record_batches) <= batch_index: + raise NoMoreDataError( + f"Batch {batch_index} not found in {table_name}. Max index is {len(record_batches)}" + ) + return record_batches[batch_index].to_pandas() + return pq.ParquetDataset(parquet_file).read(columns=variables).to_pandas() + raise SurveyIOError(f"No table {table} found in {self.parquet_file_path}") + + def _get_values_from_zarr( + self, + table: str, + variables: Optional[List[str]] = None, + **kwargs: Any, + ) -> pandas.DataFrame: + """Read table from zarr store.""" + if self.zarr_file_path is None: + raise SurveyIOError("No zarr store path for survey") + backend = get_backend("zarr") + return backend.read_table( + self.zarr_file_path, + table, + variables=variables, + **kwargs, + ) + + def get_values( + self, + variables: Optional[List[str]] = None, + table: Optional[str] = None, + lowercase: bool = False, + ignorecase: bool = False, + rename_ident: bool = True, + batch_size: Optional[int] = None, + batch_index: int = 0, + filter_by: Optional[List[tuple]] = None, + ) -> pandas.DataFrame: + if self.parquet_file_path is None and self.hdf5_file_path is None and self.zarr_file_path is None: + raise SurveyIOError(f"No data file found for survey {self.name}") + if self.store_format == "zarr" and self.zarr_file_path is not None: + df = self._get_values_from_zarr(table or "", variables=variables) + elif self.hdf5_file_path is not None: + df, _ = self._get_values_from_hdf5(table or "", ignorecase=ignorecase) + else: + df = self._get_values_from_parquet(table, variables, filter_by, batch_size, batch_index) + harmonize_data_frame_columns(df, lowercase=lowercase, rename_ident=rename_ident) + if variables is None: + return df + diff = set(variables) - set(df.columns) + if diff: + raise SurveyIOError(f"The following variable(s) {diff} are missing") + variables = list(set(variables).intersection(df.columns)) + return df[variables] + + def insert_table( + self, + label: Optional[str] = None, + name: Optional[str] = None, + **kwargs: Any, + ) -> None: + parquet_file = kwargs.pop("parquet_file", None) + data_frame = kwargs.pop("data_frame", None) + if data_frame is None: + data_frame = kwargs.pop("dataframe", None) + + if data_frame is not None: + assert isinstance(data_frame, pandas.DataFrame) + variables = kwargs.pop("variables", None) + if variables is not None: + assert set(variables) < set(data_frame.columns) + else: + variables = list(data_frame.columns) + if label is None: + label = name + table = Table(label=label, name=name, survey=self, variables=variables, parquet_file=parquet_file) + assert (table.survey.hdf5_file_path is not None) or (table.survey.parquet_file_path is not None) + if parquet_file is not None: + log.debug(f"Saving table {name} in {table.survey.parquet_file_path}") + data_frame.to_parquet(parquet_file) + else: + log.debug(f"Saving table {name} in {table.survey.hdf5_file_path}") + to_hdf_kwargs = kwargs.pop("to_hdf_kwargs", {}) + table.save_data_frame_to_hdf5(data_frame, **to_hdf_kwargs) + + if name not in self.tables: + self.tables[name] = {} + for key, val in kwargs.items(): + self.tables[name][key] = val + + def to_json(self) -> dict: + self_json = collections.OrderedDict(()) + self_json["hdf5_file_path"] = str(self.hdf5_file_path) if self.hdf5_file_path else None + self_json["parquet_file_path"] = str(self.parquet_file_path) if self.parquet_file_path else None + self_json["label"] = self.label + self_json["name"] = self.name + self_json["tables"] = self.tables + self_json["informations"] = collections.OrderedDict(sorted(self.informations.items())) + return self_json diff --git a/openfisca_survey_manager/core/table.py b/openfisca_survey_manager/core/table.py new file mode 100644 index 00000000..4fe3fa76 --- /dev/null +++ b/openfisca_survey_manager/core/table.py @@ -0,0 +1,341 @@ +"""Table: a table of a survey (core I/O and storage).""" + +from __future__ import annotations + +import collections +import csv +import datetime +import errno +import gc +import logging +import os +from pathlib import Path +from typing import TYPE_CHECKING, Any, Optional + +import pandas +from chardet.universaldetector import UniversalDetector +from pyarrow import parquet as pq + +from openfisca_survey_manager.exceptions import SurveyIOError +from openfisca_survey_manager.io.backends import get_backend +from openfisca_survey_manager.io.readers import read_sas +from openfisca_survey_manager.io.writers import write_table_to_hdf5, write_table_to_parquet +from openfisca_survey_manager.processing.cleaning import clean_data_frame + +try: + from openfisca_survey_manager.io.readers import read_spss +except ImportError: + read_spss = None # optional dependency (savReaderWriter) + +if TYPE_CHECKING: + from openfisca_survey_manager.core.survey import Survey + +log = logging.getLogger(__name__) + +reader_by_source_format = { + "csv": pandas.read_csv, + "sas": read_sas, + "spss": read_spss, + "stata": pandas.read_stata, + "parquet": pandas.read_parquet, +} + + +class Table: + """A table of a survey.""" + + label: Optional[str] = None + name: Optional[str] = None + source_format: Optional[str] = None + survey: Optional[Survey] = None + variables: Optional[list[str]] = None + parquet_file: Optional[str] = None + + def __init__( + self, + survey: Optional[Survey] = None, + name: Optional[str] = None, + label: Optional[str] = None, + source_format: Optional[str] = None, + variables: Optional[list[str]] = None, + parquet_file: Optional[str] = None, + **kwargs: Any, + ) -> None: + assert name is not None, "A table should have a name" + self.name = name + self.label = label + self.source_format = source_format + self.variables = variables + self.parquet_file = parquet_file + self.informations = kwargs + + from openfisca_survey_manager.core.survey import Survey + + assert isinstance(survey, Survey), f"survey is of type {type(survey)} and not {Survey}" + self.survey = survey + if not survey.tables: + survey.tables = collections.OrderedDict() + + survey.tables[name] = collections.OrderedDict( + source_format=source_format, + variables=variables, + parquet_file=parquet_file, + ) + + def _check_and_log(self, data_file_path: str, store_file_path: Optional[str]) -> None: + assert store_file_path is not None, "Store file path cannot be None" + if not Path(data_file_path).is_file(): + raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), data_file_path) + + log.info( + f"Inserting table {self.name} from file {data_file_path} in store file {store_file_path} " + f"at point {self.name}" + ) + + def _get_store_path_and_format(self) -> Optional[tuple[str, str]]: + """Return (store_path, store_format) for the survey's current backend, or None.""" + fmt = getattr(self.survey, "store_format", None) or "hdf5" + if fmt == "hdf5" and self.survey.hdf5_file_path is not None: + return (self.survey.hdf5_file_path, "hdf5") + if fmt == "parquet" and self.survey.parquet_file_path is not None: + return (self.survey.parquet_file_path, "parquet") + if fmt == "zarr" and getattr(self.survey, "zarr_file_path", None) is not None: + return (self.survey.zarr_file_path, "zarr") + if self.survey.hdf5_file_path is not None: + return (self.survey.hdf5_file_path, "hdf5") + if self.survey.parquet_file_path is not None: + return (self.survey.parquet_file_path, "parquet") + return None + + def _is_stored(self) -> bool: + path_fmt = self._get_store_path_and_format() + if path_fmt is None: + return False + store_path, store_format = path_fmt + backend = get_backend(store_format) + if backend.table_exists(store_path, self.name): + log.info( + "Exiting without overwriting %s in %s (%s)", + self.name, + store_path, + store_format, + ) + return True + return False + + def _save( + self, + data_frame: Optional[pandas.DataFrame] = None, + store_format: str = "hdf5", + ) -> None: + assert data_frame is not None + variables = self.variables + + if variables: + stored_variables = list(set(variables).intersection(set(data_frame.columns))) + log.info("The following variables are stored: %s", stored_variables) + if set(stored_variables) != set(variables): + log.info( + "variables wanted by the user that were not available: " + f"{list(set(variables) - set(stored_variables))}" + ) + data_frame = data_frame[stored_variables].copy() + + path_fmt = self._get_store_path_and_format() + if path_fmt is None: + raise SurveyIOError( + f"No store path set for survey (store_format={store_format}). " + "Set hdf5_file_path, parquet_file_path, or zarr_file_path." + ) + store_path, resolved_format = path_fmt + if store_format != resolved_format: + store_format = resolved_format + backend = get_backend(store_format) + if store_format == "hdf5": + log.warning( + "HDF5 will no longer be the default format in a future version. " + "Please use parquet or zarr format instead." + ) + log.info("Inserting table %s in %s store %s", self.name, store_format, store_path) + backend.write_table(store_path, self.name, data_frame) + self.variables = list(data_frame.columns) + self.survey.tables[self.name]["variables"] = self.variables + if store_format == "parquet": + self.parquet_file = str(Path(store_path) / f"{self.name}.parquet") + self.survey.tables[self.name]["parquet_file"] = self.parquet_file + gc.collect() + + def fill_store( + self, + data_file: str, + overwrite: bool = False, + clean: bool = False, + **kwargs: Any, + ) -> None: + if not overwrite and self._is_stored(): + return + + start_table_time = datetime.datetime.now() + if self.source_format in ["sas", "parquet"] and "encoding" in kwargs: + del kwargs["encoding"] + data_frame = self.read_source(data_file, **kwargs) + try: + if clean: + clean_data_frame(data_frame) + self._save(data_frame=data_frame, store_format=self.survey.store_format) + log.info(f"File {data_file} has been processed in {datetime.datetime.now() - start_table_time}") + except Exception as e: + log.info(f"Skipping file {data_file} because of following error \n {e}") + raise e + + def read_parquet_columns(self, parquet_file: Optional[str] = None) -> list[str]: + if parquet_file is None: + parquet_file = self.parquet_file + log.info(f"Initializing table {self.name} from parquet file {parquet_file}") + self.source_format = "parquet" + parquet_schema = pq.read_schema(parquet_file) + self.variables = parquet_schema.names + self.survey.tables[self.name]["variables"] = self.variables + return self.variables + + def _read_csv_with_inferred_encoding( + self, data_file: str, reader: Any, kwargs: dict[str, Any] + ) -> pandas.DataFrame: + """Read CSV, inferring encoding and dialect if default read fails.""" + log.debug("Failing to read %s, trying to infer encoding and dialect/separator", data_file) + detector = UniversalDetector() + with Path(data_file).open("rb") as csvfile: + for line in csvfile: + detector.feed(line) + if detector.done: + break + detector.close() + encoding = detector.result["encoding"] + confidence = detector.result["confidence"] + try: + with Path(data_file).open("r", newline="", encoding=encoding) as csvfile: + dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,") + except Exception: + dialect = None + delimiter = ";" + log.debug( + "dialect.delimiter = %s, encoding = %s, confidence = %s", + dialect.delimiter if dialect is not None else delimiter, + encoding, + confidence, + ) + kwargs = {**kwargs, "engine": "python", "encoding": encoding} + if dialect: + kwargs["dialect"] = dialect + else: + kwargs["delimiter"] = delimiter + return reader(data_file, **kwargs) + + def _apply_stata_categorical_strategy( + self, + data_frame: pandas.DataFrame, + data_file: str, + categorical_strategy: str, + ) -> None: + """Apply categorical_strategy (unique_labels, codes, skip) to Stata value labels in place.""" + from pandas.io.stata import StataReader + + stata_reader = StataReader(data_file) + value_labels = stata_reader.value_labels() + for col_name, labels in value_labels.items(): + if col_name not in data_frame.columns: + continue + if categorical_strategy == "unique_labels": + unique_labels = {} + seen_labels = {} + for code, label in labels.items(): + if pandas.isna(code): + unique_labels[code] = label + elif label in seen_labels: + unique_labels[code] = f"{label} ({code})" + else: + unique_labels[code] = label + seen_labels[label] = code + code_to_label = {code: unique_labels[code] for code in sorted(labels.keys())} + data_frame[col_name] = data_frame[col_name].map(code_to_label) + data_frame[col_name] = pandas.Categorical( + data_frame[col_name], + categories=list(code_to_label.values()), + ordered=False, + ) + elif categorical_strategy == "codes": + codes = sorted([c for c in labels if pandas.notna(c)]) + if codes: + data_frame[col_name] = pandas.Categorical(data_frame[col_name], categories=codes, ordered=False) + elif categorical_strategy != "skip": + log.warning("Unknown categorical_strategy %r, using 'skip'", categorical_strategy) + + def read_source(self, data_file: str, **kwargs: Any) -> pandas.DataFrame: + source_format = self.source_format + path_fmt = self._get_store_path_and_format() + store_file_path = path_fmt[0] if path_fmt else None + self._check_and_log(data_file, store_file_path=store_file_path) + reader = reader_by_source_format[source_format] + categorical_strategy = ( + kwargs.pop("categorical_strategy", "unique_labels") + if source_format == "stata" + else kwargs.pop("categorical_strategy", None) + ) + try: + if source_format == "csv": + try: + data_frame = reader(data_file, **kwargs) + if len(data_frame.columns) == 1 and ";" in data_frame.columns[0]: + raise SurveyIOError( + "A ';' is present in the unique column name. Looks like we got the wrong separator." + ) + except Exception: + data_frame = self._read_csv_with_inferred_encoding(data_file, reader, kwargs) + elif source_format == "stata": + if "encoding" in kwargs: + kwargs.pop("encoding") + try: + data_frame = reader(data_file, **kwargs) + except ValueError as e: + if "not unique" not in str(e) and "Categorical categories must be unique" not in str(e): + raise + log.info( + "Non-unique value labels detected in %s, using strategy %r", + data_file, + categorical_strategy, + ) + kwargs_no_cat = {**kwargs, "convert_categoricals": False} + data_frame = reader(data_file, **kwargs_no_cat) + self._apply_stata_categorical_strategy(data_frame, data_file, categorical_strategy) + else: + data_frame = reader(data_file, **kwargs) + except Exception as e: + log.info("Error while reading %s", data_file) + raise e + gc.collect() + return data_frame + + def save_data_frame_to_hdf5(self, data_frame: pandas.DataFrame, **kwargs: Any) -> None: + hdf5_file_path = self.survey.hdf5_file_path + log.info(f"Inserting table {self.name} in HDF file {hdf5_file_path}") + store_path = self.name + write_table_to_hdf5( + data_frame, + hdf5_file_path=hdf5_file_path, + store_path=store_path, + **kwargs, + ) + + self.variables = list(data_frame.columns) + + def save_data_frame_to_parquet(self, data_frame: pandas.DataFrame) -> None: + parquet_file_path = self.survey.parquet_file_path + self.parquet_file = write_table_to_parquet( + data_frame, + parquet_dir_path=parquet_file_path, + table_name=self.name, + ) + self.variables = list(data_frame.columns) + + self.survey.tables[self.name]["parquet_file"] = self.parquet_file + self.survey.tables[self.name]["variables"] = self.variables diff --git a/openfisca_survey_manager/google_colab.py b/openfisca_survey_manager/google_colab.py deleted file mode 100644 index b5cf98a7..00000000 --- a/openfisca_survey_manager/google_colab.py +++ /dev/null @@ -1,23 +0,0 @@ -import configparser -from pathlib import Path - -from openfisca_survey_manager.paths import default_config_files_directory as config_files_directory - - -def create_raw_data_ini(value_by_option_by_section=None): - """Creates raw_data.ini configureation file - - Args: - value_by_option_by_section(dict(dict)): Options value by section (Default value = None) - - """ - config_parser = configparser.ConfigParser() - - if value_by_option_by_section is not None: - for section, value_by_option in value_by_option_by_section.items(): - config_parser.add_section(section) - for option, value in value_by_option.items(): - config_parser.set(section, option, value) - - with (Path(config_files_directory) / "raw_data.ini").open("w") as raw_data_config_file: - config_parser.write(raw_data_config_file) diff --git a/openfisca_survey_manager/io/backends.py b/openfisca_survey_manager/io/backends.py new file mode 100644 index 00000000..a6f8273e --- /dev/null +++ b/openfisca_survey_manager/io/backends.py @@ -0,0 +1,227 @@ +"""Store backends for survey tables: HDF5, Parquet, Zarr. + +Allows choosing the storage format (backend) when building or filling the store. +""" + +from __future__ import annotations + +import logging +from pathlib import Path +from typing import Any, Optional, Protocol + +import pandas as pd + +from openfisca_survey_manager.io.hdf import hdf5_safe_key, write_table_to_hdf5 +from openfisca_survey_manager.io.writers import write_table_to_parquet + +log = logging.getLogger(__name__) + + +# Supported store format names (zarr only if zarr package is installed) +def get_available_backend_names() -> tuple[str, ...]: + return tuple(_backends.keys()) + + +class StoreBackend(Protocol): + """Protocol for a store backend (write/read tables).""" + + def write_table( + self, + store_path: str, + table_name: str, + data_frame: pd.DataFrame, + **kwargs: Any, + ) -> Optional[str]: + """Write a table. Returns path used for the table (e.g. file path) or None.""" + ... + + def read_table( + self, + store_path: str, + table_name: str, + variables: Optional[list[str]] = None, + **kwargs: Any, + ) -> pd.DataFrame: + """Read a table as DataFrame.""" + ... + + def table_exists(self, store_path: str, table_name: str) -> bool: + """Return True if the table exists in the store.""" + ... + + +class HDF5Backend: + """Store tables in a single HDF5 file.""" + + def write_table( + self, + store_path: str, + table_name: str, + data_frame: pd.DataFrame, + **kwargs: Any, + ) -> Optional[str]: + write_table_to_hdf5( + data_frame, + hdf5_file_path=store_path, + store_path=table_name, + **kwargs, + ) + return None + + def read_table( + self, + store_path: str, + table_name: str, + variables: Optional[list[str]] = None, + **kwargs: Any, + ) -> pd.DataFrame: + key = hdf5_safe_key(table_name) + store = pd.HDFStore(store_path, "r") + try: + df = store.select(key) + finally: + store.close() + if variables is not None: + df = df[[c for c in variables if c in df.columns]] + return df + + def table_exists(self, store_path: str, table_name: str) -> bool: + if not Path(store_path).is_file(): + return False + key = hdf5_safe_key(table_name) + store = pd.HDFStore(store_path, "r") + try: + keys = store.keys() + return key in keys or any(k.lstrip("/") == key for k in keys) + finally: + store.close() + + +class ParquetBackend: + """Store each table as a parquet file in a directory (store_path/table_name.parquet).""" + + def write_table( + self, + store_path: str, + table_name: str, + data_frame: pd.DataFrame, + **kwargs: Any, + ) -> Optional[str]: + return write_table_to_parquet( + data_frame, + parquet_dir_path=store_path, + table_name=table_name, + ) + + def read_table( + self, + store_path: str, + table_name: str, + variables: Optional[list[str]] = None, + **kwargs: Any, + ) -> pd.DataFrame: + path = Path(store_path) / f"{table_name}.parquet" + if not path.is_file(): + raise FileNotFoundError(f"No table {table_name} at {path}") + return pd.read_parquet(path, columns=variables) + + def table_exists(self, store_path: str, table_name: str) -> bool: + return (Path(store_path) / f"{table_name}.parquet").is_file() + + +def _write_table_to_zarr( + data_frame: pd.DataFrame, + zarr_dir_path: str, + table_name: str, +) -> str: + """Write a DataFrame to a zarr group (store_path/table_name).""" + import pandas as pd + + zarr_path = str(Path(zarr_dir_path) / table_name) + Path(zarr_path).parent.mkdir(parents=True, exist_ok=True) + # Object columns can cause issues; coerce to string like parquet backend + for col in data_frame.columns: + if data_frame[col].dtype == "object": + try: + data_frame[col] = data_frame[col].astype(str) + except Exception: + data_frame[col] = data_frame[col].apply(lambda x: str(x) if pd.notna(x) else None) + data_frame.to_zarr(zarr_path, mode="w") + return zarr_path + + +def _read_table_from_zarr( + zarr_dir_path: str, + table_name: str, + variables: Optional[list[str]] = None, +) -> pd.DataFrame: + """Read a table from a zarr group.""" + zarr_path = str(Path(zarr_dir_path) / table_name) + df = pd.read_zarr(zarr_path) + if variables is not None: + df = df[[c for c in variables if c in df.columns]] + return df + + +class ZarrBackend: + """Store each table as a zarr group in a directory (store_path/table_name).""" + + def write_table( + self, + store_path: str, + table_name: str, + data_frame: pd.DataFrame, + **kwargs: Any, + ) -> Optional[str]: + return _write_table_to_zarr(data_frame, store_path, table_name) + + def read_table( + self, + store_path: str, + table_name: str, + variables: Optional[list[str]] = None, + **kwargs: Any, + ) -> pd.DataFrame: + return _read_table_from_zarr(store_path, table_name, variables) + + def table_exists(self, store_path: str, table_name: str) -> bool: + return (Path(store_path) / table_name).is_dir() + + +def _build_backends() -> dict[str, StoreBackend]: + backends: dict[str, StoreBackend] = { + "hdf5": HDF5Backend(), + "parquet": ParquetBackend(), + } + try: + import zarr # noqa: F401 + + backends["zarr"] = ZarrBackend() + except ImportError: + log.debug("zarr not installed; zarr store backend unavailable") + return backends + + +_backends = _build_backends() + +STORE_BACKEND_NAMES = get_available_backend_names() + +__all__ = [ + "get_backend", + "get_available_backend_names", + "register_backend", + "StoreBackend", + "STORE_BACKEND_NAMES", +] + + +def get_backend(name: str) -> StoreBackend: + """Return the store backend for the given format name.""" + if name not in _backends: + raise ValueError(f"Unknown store backend: {name}. Choose from {list(_backends.keys())}") + return _backends[name] + + +def register_backend(name: str, backend: StoreBackend) -> None: + """Register a custom store backend (e.g. for testing or extensions).""" + _backends[name] = backend diff --git a/openfisca_survey_manager/io/hdf.py b/openfisca_survey_manager/io/hdf.py new file mode 100644 index 00000000..33d74ee7 --- /dev/null +++ b/openfisca_survey_manager/io/hdf.py @@ -0,0 +1,56 @@ +"""HDF5 write support for survey tables.""" + +from __future__ import annotations + +import logging +import re +from typing import Any + +import pandas as pd + +log = logging.getLogger(__name__) + +# PyTables / pandas-HDF5 require node names to match ^[a-zA-Z_][a-zA-Z0-9_]*$ +# to avoid NaturalNameWarning. We normalize table names (e.g. person_2017-01 -> person_2017_01). +_HDF5_SAFE_PATTERN = re.compile(r"[^a-zA-Z0-9_]") + + +def hdf5_safe_key(name: str) -> str: + """Return an HDF5 node name safe for PyTables (valid Python identifier).""" + return _HDF5_SAFE_PATTERN.sub("_", name) + + +def write_table_to_hdf5( + data_frame: pd.DataFrame, + *, + hdf5_file_path: str, + store_path: str, + **kwargs: Any, +) -> None: + """Write a DataFrame to HDF5. + + Mirrors historical behavior from `tables.Table.save_data_frame_to_hdf5`. + May mutate `data_frame` (type conversions) to ensure it can be written. + """ + key = hdf5_safe_key(store_path) + try: + data_frame.to_hdf(hdf5_file_path, key=key, append=False, **kwargs) + except (TypeError, NotImplementedError): + log.info("Type problem(s) when creating %s in %s", store_path, hdf5_file_path) + dtypes = data_frame.dtypes + # Checking for strings + converted_dtypes = dtypes.isin(["mixed", "unicode"]) + if converted_dtypes.any(): + log.info("The following types are converted to strings %s", dtypes[converted_dtypes]) + for column in dtypes[converted_dtypes].index: + data_frame[column] = data_frame[column].copy().astype(str) + + # Checking for remaining categories + dtypes = data_frame.dtypes + converted_dtypes = dtypes.isin(["category"]) + if not converted_dtypes.empty: # With category table format is needed + log.info( + "The following types are added as category using the table format %s", + dtypes[converted_dtypes], + ) + data_frame.to_hdf(hdf5_file_path, key=key, append=False, format="table", **kwargs) diff --git a/openfisca_survey_manager/io/readers.py b/openfisca_survey_manager/io/readers.py index b372c957..5feccab4 100644 --- a/openfisca_survey_manager/io/readers.py +++ b/openfisca_survey_manager/io/readers.py @@ -1,5 +1,7 @@ """Readers for survey data (SAS, SPSS, DBF, etc.).""" +from __future__ import annotations + import logging from typing import Optional @@ -45,7 +47,7 @@ def read_spss(spss_file_path: str) -> DataFrame: def read_dbf( dbf_path: str, index: Optional[str] = None, - cols: Optional[list] = None, + cols: Optional[list[str]] = None, incl_index: bool = False, ) -> DataFrame: """ diff --git a/openfisca_survey_manager/io/writers.py b/openfisca_survey_manager/io/writers.py index 8481fe47..25f7063f 100644 --- a/openfisca_survey_manager/io/writers.py +++ b/openfisca_survey_manager/io/writers.py @@ -4,46 +4,14 @@ import logging from pathlib import Path -from typing import Any import pandas as pd -log = logging.getLogger(__name__) - +from openfisca_survey_manager.io.hdf import write_table_to_hdf5 -def write_table_to_hdf5( - data_frame: pd.DataFrame, - *, - hdf5_file_path: str, - store_path: str, - **kwargs: Any, -) -> None: - """Write a DataFrame to HDF5. - - Mirrors historical behavior from `tables.Table.save_data_frame_to_hdf5`. - May mutate `data_frame` (type conversions) to ensure it can be written. - """ - try: - data_frame.to_hdf(hdf5_file_path, store_path, append=False, **kwargs) - except (TypeError, NotImplementedError): - log.info("Type problem(s) when creating %s in %s", store_path, hdf5_file_path) - dtypes = data_frame.dtypes - # Checking for strings - converted_dtypes = dtypes.isin(["mixed", "unicode"]) - if converted_dtypes.any(): - log.info("The following types are converted to strings %s", dtypes[converted_dtypes]) - for column in dtypes[converted_dtypes].index: - data_frame[column] = data_frame[column].copy().astype(str) +log = logging.getLogger(__name__) - # Checking for remaining categories - dtypes = data_frame.dtypes - converted_dtypes = dtypes.isin(["category"]) - if not converted_dtypes.empty: # With category table format is needed - log.info( - "The following types are added as category using the table format %s", - dtypes[converted_dtypes], - ) - data_frame.to_hdf(hdf5_file_path, store_path, append=False, format="table", **kwargs) +__all__ = ["write_table_to_hdf5", "write_table_to_parquet"] def write_table_to_parquet( diff --git a/openfisca_survey_manager/matching.py b/openfisca_survey_manager/matching.py deleted file mode 100644 index 8fa93756..00000000 --- a/openfisca_survey_manager/matching.py +++ /dev/null @@ -1,145 +0,0 @@ -import logging -from pathlib import Path - -import pandas as pd - -from openfisca_survey_manager.paths import openfisca_survey_manager_location - -log = logging.getLogger(__name__) - - -config_files_directory = Path(openfisca_survey_manager_location) - - -def nnd_hotdeck_using_feather(receiver=None, donor=None, matching_variables=None, z_variables=None): - """ - Not working - """ - import feather - - assert receiver is not None and donor is not None - assert matching_variables is not None - - temporary_directory_path = config_files_directory / "tmp" - assert temporary_directory_path.exists() - receiver_path = temporary_directory_path / "receiver.feather" - donor_path = temporary_directory_path / "donor.feather" - feather.write_dataframe(receiver, receiver_path) - feather.write_dataframe(donor, donor_path) - if isinstance(matching_variables, str): - match_vars = f'"{matching_variables}"' - elif len(matching_variables) == 1: - match_vars = f'"{matching_variables[0]}"' - else: - match_vars = '"{}"'.format("todo") - - r_script = f""" -rm(list=ls()) -gc() -devtools::install_github("wesm/feather/R") -library(feather) -library(StatMatch) - -receiver <- read_feather({receiver_path}) -donor <- read_feather({donor_path}) -summary(receiver) -summary(donor) - -# variables -receiver = as.data.frame(receiver) -donor = as.data.frame(donor) -gc() -match_vars = {match_vars} -# don_class = c("sexe") -out.nnd <- NND.hotdeck( - data.rec = receiver, data.don = donor, match.vars = match_vars - ) - -# out.nndsummary(out.nnd$mtc.ids) -# head(out.nnd$mtc.ids, 10) -# head(receiver, 10) - -fused.nnd.m <- create.fused( - data.rec = receiver, data.don = donor, - mtc.ids = out.nnd$mtc.ids, - z.vars = "{z_variables}" - ) -summary(fused.nnd.m) -""" - log.debug("%s", r_script) - - -def nnd_hotdeck_using_rpy2(receiver=None, donor=None, matching_variables=None, z_variables=None, donor_classes=None): - from rpy2.robjects import pandas2ri - from rpy2.robjects.packages import importr - - assert receiver is not None and donor is not None - assert matching_variables is not None - - pandas2ri.activate() - stat_match = importr("StatMatch") - - if isinstance(donor_classes, str): - assert donor_classes in receiver, "Donor class not present in receiver" - assert donor_classes in donor, "Donor class not present in donor" - - try: - if donor_classes: - out_nnd = stat_match.NND_hotdeck( - data_rec=receiver, - data_don=donor, - match_vars=pd.Series(matching_variables), - don_class=pd.Series(donor_classes), - ) - else: - out_nnd = stat_match.NND_hotdeck( - data_rec=receiver, - data_don=donor, - match_vars=pd.Series(matching_variables), - # don_class = pd.Series(donor_classes) - ) - except Exception as e: - log.debug("receiver: %s", receiver) - log.debug("donor: %s", donor) - log.debug("matching_variables: %s", pd.Series(matching_variables)) - log.exception("NND hotdeck failed: %s", e) - raise - - # create synthetic data.set, without the - # duplication of the matching variables - - fused_0 = pandas2ri.ri2py( - stat_match.create_fused(data_rec=receiver, data_don=donor, mtc_ids=out_nnd[0], z_vars=pd.Series(z_variables)) - ) - - # create synthetic data.set, with the "duplication" - # of the matching variables - - fused_1 = pandas2ri.ri2py( - stat_match.create_fused( - data_rec=receiver, - data_don=donor, - mtc_ids=out_nnd[0], - z_vars=pd.Series(z_variables), - dup_x=True, - match_vars=pd.Series(matching_variables), - ) - ) - - return fused_0, fused_1 - - -if __name__ == "__main__": - log.setLevel(logging.INFO) - - receiver = pd.DataFrame() - donor = pd.DataFrame() - matching_variables = "sexe" - z_variables = "ident" - - nnd_hotdeck_using_feather( - receiver=receiver, - donor=donor, - matching_variables=matching_variables, - z_variables=z_variables, - ) diff --git a/openfisca_survey_manager/paths.py b/openfisca_survey_manager/paths.py deleted file mode 100644 index 288c0a4a..00000000 --- a/openfisca_survey_manager/paths.py +++ /dev/null @@ -1,22 +0,0 @@ -"""Re-export for backward compatibility. - -Prefer: from openfisca_survey_manager.configuration.paths import ... -""" - -from openfisca_survey_manager.configuration.paths import ( - config_ini, - default_config_files_directory, - is_in_ci, - openfisca_survey_manager_location, - private_run_with_data, - test_config_files_directory, -) - -__all__ = [ - "config_ini", - "default_config_files_directory", - "is_in_ci", - "openfisca_survey_manager_location", - "private_run_with_data", - "test_config_files_directory", -] diff --git a/openfisca_survey_manager/policy/__init__.py b/openfisca_survey_manager/policy/__init__.py new file mode 100644 index 00000000..60888238 --- /dev/null +++ b/openfisca_survey_manager/policy/__init__.py @@ -0,0 +1,53 @@ +# Policy-related modules (simulations, simulation_builder, aggregates). +# À terme ces briques pourront être déplacées dans un paquet dédié. +# Voir docs/REFACTORING_PLAN.md. + +from openfisca_survey_manager.policy.aggregates import AbstractAggregates +from openfisca_survey_manager.policy.simulation_builder import ( + SimulationBuilder, + diagnose_variable_mismatch, +) +from openfisca_survey_manager.policy.simulations import ( + SecretViolationError, + Simulation, + adaptative_calculate_variable, + assert_variables_in_same_entity, + compute_aggregate, + compute_pivot_table, + compute_quantiles, + compute_winners_losers, + create_data_frame_by_entity, + get_words, + inflate, + init_entity_data, + init_simulation, + init_variable_in_entity, + new_from_tax_benefit_system, + print_memory_usage, + set_weight_variable_by_entity, + summarize_variable, +) + +__all__ = [ + "AbstractAggregates", + "Simulation", + "SimulationBuilder", + "SecretViolationError", + "adaptative_calculate_variable", + "assert_variables_in_same_entity", + "compute_aggregate", + "compute_pivot_table", + "compute_quantiles", + "compute_winners_losers", + "create_data_frame_by_entity", + "diagnose_variable_mismatch", + "get_words", + "inflate", + "init_entity_data", + "init_simulation", + "init_variable_in_entity", + "new_from_tax_benefit_system", + "print_memory_usage", + "set_weight_variable_by_entity", + "summarize_variable", +] diff --git a/openfisca_survey_manager/aggregates.py b/openfisca_survey_manager/policy/aggregates.py similarity index 91% rename from openfisca_survey_manager/aggregates.py rename to openfisca_survey_manager/policy/aggregates.py index 7d8059c4..7ed440fc 100644 --- a/openfisca_survey_manager/aggregates.py +++ b/openfisca_survey_manager/policy/aggregates.py @@ -1,8 +1,12 @@ +"""Aggregates computation for survey scenarios.""" + +from __future__ import annotations + import collections import logging from datetime import datetime from pathlib import Path -from typing import Optional +from typing import Any, Optional, Union import numpy as np import pandas as pd @@ -29,11 +33,11 @@ class AbstractAggregates: def __init__( self, - survey_scenario=None, - absolute_minimal_detected_variation=0, - relative_minimal_detected_variation=0, - observations_threshold=0, - ): + survey_scenario: Any = None, + absolute_minimal_detected_variation: float = 0, + relative_minimal_detected_variation: float = 0, + observations_threshold: float = 0, + ) -> None: assert survey_scenario is not None self.period = survey_scenario.period @@ -270,7 +274,7 @@ def compute_variable_aggregates( return variable_data_frame - def create_description(self): + def create_description(self) -> pd.DataFrame: """Create a description dataframe.""" now = datetime.now() return pd.DataFrame( @@ -284,14 +288,14 @@ def create_description(self): def to_csv( self, - path=None, - absolute=True, - amount=True, - beneficiaries=True, - default="actual", - relative=True, - target="reform", - ): + path: Optional[Union[Path, str]] = None, + absolute: bool = True, + amount: bool = True, + beneficiaries: bool = True, + default: str = "actual", + relative: bool = True, + target: str = "reform", + ) -> None: """Saves the table to csv.""" assert path is not None @@ -313,14 +317,14 @@ def to_csv( def to_excel( self, - path=None, - absolute=True, - amount=True, - beneficiaries=True, - default="actual", - relative=True, - target="reform", - ): + path: Optional[Union[Path, str]] = None, + absolute: bool = True, + amount: bool = True, + beneficiaries: bool = True, + default: str = "actual", + relative: bool = True, + target: str = "reform", + ) -> None: """Save the table to excel.""" assert path is not None @@ -346,14 +350,14 @@ def to_excel( def to_html( self, - path=None, - absolute=True, - amount=True, - beneficiaries=True, - default="actual", - relative=True, - target="reform", - ): + path: Optional[Union[Path, str]] = None, + absolute: bool = True, + amount: bool = True, + beneficiaries: bool = True, + default: str = "actual", + relative: bool = True, + target: str = "reform", + ) -> str: """Get or saves the table to html format.""" df = self.get_data_frame( absolute=absolute, @@ -377,14 +381,14 @@ def to_html( def to_markdown( self, - path=None, - absolute=True, - amount=True, - beneficiaries=True, - default="actual", - relative=True, - target="reform", - ): + path: Optional[Union[Path, str]] = None, + absolute: bool = True, + amount: bool = True, + beneficiaries: bool = True, + default: str = "actual", + relative: bool = True, + target: str = "reform", + ) -> str: """Get or saves the table to markdown format.""" df = self.get_data_frame( absolute=absolute, @@ -425,7 +429,7 @@ def get_data_frame( relative: bool = True, target: str = "reform", ignore_labels: bool = False, - ): + ) -> pd.DataFrame: assert target is None or target in ["reform", "baseline"] columns = self.labels.keys() @@ -509,10 +513,14 @@ def get_data_frame( return df - def load_actual_data(self, period=None): + def load_actual_data(self, period: Any = None) -> None: pass - def compute_winners_losers(self, variable: str, filter_by: Optional[str] = None): + def compute_winners_losers( + self, + variable: str, + filter_by: Optional[str] = None, + ) -> pd.DataFrame: if "reform" not in self.simulations or "baseline" not in self.simulations: log.warning("Cannot compute winners and losers without a reform and a baseline simulation.") return pd.DataFrame() @@ -546,7 +554,10 @@ def compute_winners_losers(self, variable: str, filter_by: Optional[str] = None) ) return winners_losers_df - def compute_all_winners_losers(self, filter_by: Optional[str] = None): + def compute_all_winners_losers( + self, + filter_by: Optional[str] = None, + ) -> pd.DataFrame: all_winners_losers = pd.DataFrame() for variable in self.aggregate_variables: winners_losers = self.compute_winners_losers(variable, filter_by=filter_by) diff --git a/openfisca_survey_manager/processing/weights/calibration.py b/openfisca_survey_manager/policy/calibration.py similarity index 89% rename from openfisca_survey_manager/processing/weights/calibration.py rename to openfisca_survey_manager/policy/calibration.py index acf6b212..a4ee84aa 100644 --- a/openfisca_survey_manager/processing/weights/calibration.py +++ b/openfisca_survey_manager/policy/calibration.py @@ -1,14 +1,18 @@ """Calibration of survey weights (SurveyScenario).""" +from __future__ import annotations + import logging import re +from typing import Any, Optional import numpy +import numpy as np import pandas as pd from numpy import logical_not from openfisca_core.model_api import Enum -from openfisca_survey_manager.processing.weights.calmar import calmar +from openfisca_survey_manager.policy.calmar import calmar log = logging.getLogger(__name__) @@ -16,32 +20,32 @@ class Calibration: """An object to calibrate survey data of a SurveyScenario.""" - filter_by = None - initial_entity_count = None - _initial_weight_name = None - initial_weight_by_entity = None - target_margins = None - margins_by_variable = None - parameters = None - period = None - simulation = None - target_entity_count = None - other_entity_count = None - target_entity = None - weight_name = None - entities = None + filter_by: Any = None + initial_entity_count: Optional[float] = None + _initial_weight_name: Optional[str] = None + initial_weight_by_entity: dict[str, Any] # set in __init__ + target_margins: Optional[dict[str, Any]] = None + margins_by_variable: Optional[dict[str, Any]] = None + parameters: Optional[dict[str, Any]] = None + period: Any = None + simulation: Any = None + target_entity_count: Optional[float] = None + other_entity_count: Optional[float] = None + target_entity: Optional[str] = None + weight_name: Optional[str] = None + entities: Optional[list[str]] = None def __init__( self, - simulation, - target_margins, - period, - target_entity_count=None, - other_entity_count=None, - parameters=None, - filter_by=None, - entity=None, - ): + simulation: Any, + target_margins: dict[str, Any], + period: Any, + target_entity_count: Optional[float] = None, + other_entity_count: Optional[float] = None, + parameters: Optional[dict[str, Any]] = None, + filter_by: Any = None, + entity: Optional[str] = None, + ) -> None: target_entity = entity self.parameters = parameters or { "use_proportions": True, @@ -197,14 +201,14 @@ def _build_calmar_data(self) -> dict: return data - def calibrate(self, inplace=False): + def calibrate(self, inplace: bool = False) -> Optional[np.ndarray]: """Apply the calibrations by updating weights and margins. Args: - inplace (bool, optional): Whether to return the calibrated or to setthem inplace. Defaults to False. + inplace: Whether to apply in place (no return) or return calibrated weights. Returns: - numpy.array: calibrated weights + Calibrated weights array, or None if inplace=True. """ assert self.margins_by_variable is not None, "Margins by variable should be set" margins_by_variable = self.margins_by_variable @@ -230,7 +234,7 @@ def calibrate(self, inplace=False): return self.weight - def get_parameters(self) -> dict: + def get_parameters(self) -> dict[str, Any]: """Get the parameters. Returns: @@ -252,12 +256,12 @@ def get_parameters(self) -> dict: p["initial_weight"] = self.weight_name + "" return p - def set_target_margin(self, variable, target): + def set_target_margin(self, variable: str, target: Any) -> None: """Set variable target margin. Args: - variable: Target variable - target: Target value + variable: Target variable name or expression. + target: Target value (scalar or dict of category -> value). """ simulation = self.simulation period = self.period @@ -293,13 +297,13 @@ def set_target_margin(self, variable, target): self.margins_by_variable[variable]["target"] = target_by_category or target self._update_margins() - def reset(self): + def reset(self) -> None: """Reset the calibration to its initial state.""" simulation = self.simulation simulation.delete_arrays(self.weight_name, self.period) simulation.set_input(self.weight_name, self.period, numpy.array(self.initial_weight)) - def set_calibrated_weights(self): + def set_calibrated_weights(self) -> None: """Modify the weights to use the calibrated weights.""" period = self.period simulation = self.simulation @@ -314,7 +318,7 @@ def set_calibrated_weights(self): elif weight_variable.formulas: simulation.delete_arrays(weight_variable.name, period) - def summary(self): + def summary(self) -> pd.DataFrame: """Summarize margins.""" margins_df = pd.DataFrame.from_dict(self.margins_by_variable).T margins_df.loc["entity_count", "actual"] = (self.weight * self.filter_by).sum() @@ -322,7 +326,7 @@ def summary(self): margins_df.loc["entity_count", "target"] = self.target_entity_count return margins_df - def _update_margins(self): + def _update_margins(self) -> None: """Update margins.""" for variable in self.margins_by_variable: simulation = self.simulation @@ -381,17 +385,12 @@ def _update_margins(self): } self.margins_by_variable[variable].update(margin_by_type) - def _update_weights(self, margins, parameters=None): - """Run calmar, stores new weights and returns adjusted margins. - - Args: - margins: margins - parameters: Parameters (Default value = {}) - - Returns: - dict: Updated margins - - """ + def _update_weights( + self, + margins: dict[str, Any], + parameters: Optional[dict[str, Any]] = None, + ) -> dict[str, Any]: + """Run calmar, store new weights and return adjusted margins.""" if parameters is None: parameters = {} diff --git a/openfisca_survey_manager/processing/weights/calmar.py b/openfisca_survey_manager/policy/calmar.py similarity index 88% rename from openfisca_survey_manager/processing/weights/calmar.py rename to openfisca_survey_manager/policy/calmar.py index dd4986d4..c54d0497 100644 --- a/openfisca_survey_manager/processing/weights/calmar.py +++ b/openfisca_survey_manager/policy/calmar.py @@ -1,8 +1,12 @@ """CALMAR: Calibrates weights to satisfy margins constraints.""" +from __future__ import annotations + import logging import operator +from typing import Any, Optional +import numpy as np import pandas as pd from numpy import array, dot, exp, float64, ones, sqrt, unique, zeros from numpy import log as ln @@ -12,7 +16,7 @@ log = logging.getLogger(__name__) -def linear(u): +def linear(u: np.ndarray) -> np.ndarray: """ Args: @@ -24,68 +28,27 @@ def linear(u): return 1 + u -def linear_prime(u): - """ - - Args: - u: - - Returns: - - """ +def linear_prime(u: np.ndarray) -> np.ndarray: + """Derivative of linear (constant 1).""" return ones(u.shape, dtype=float) -def raking_ratio(u): - """ - - Args: - u: - - Returns: - - """ +def raking_ratio(u: np.ndarray) -> np.ndarray: + """Raking ratio (exponential) calibration function.""" return exp(u) -def raking_ratio_prime(u): - """ - - Args: - u: - - Returns: - - """ +def raking_ratio_prime(u: np.ndarray) -> np.ndarray: + """Derivative of raking_ratio.""" return exp(u) -def logit(u, low, up): - """ - - Args: - u: - low: - up: - - Returns: - - """ +def logit(u: np.ndarray, low: float, up: float) -> np.ndarray: a = (up - low) / ((1 - low) * (up - 1)) return (low * (up - 1) + up * (1 - low) * exp(a * u)) / (up - 1 + (1 - low) * exp(a * u)) -def logit_prime(u, low, up): - """ - - Args: - u: - low: - up: - - Returns: - - """ +def logit_prime(u: np.ndarray, low: float, up: float) -> np.ndarray: a = (up - low) / ((1 - low) * (up - 1)) return ( (a * up * (1 - low) * exp(a * u)) * (up - 1 + (1 - low) * exp(a * u)) @@ -93,12 +56,12 @@ def logit_prime(u, low, up): ) / (up - 1 + (1 - low) * exp(a * u)) ** 2 -def hyperbolic_sinus(u, alpha): +def hyperbolic_sinus(u: np.ndarray, alpha: float) -> np.ndarray: logarithm = ln(2 * alpha * u + sqrt(4 * (alpha**2) * (u**2) + 1)) return 0.5 * (logarithm / alpha + sqrt((logarithm / alpha) ** 2 + 4)) -def hyperbolic_sinus_prime(u, alpha): +def hyperbolic_sinus_prime(u: np.ndarray, alpha: float) -> np.ndarray: square = sqrt(4 * (alpha**2) * (u**2) + 1) return 0.5 * ( ((4 * (alpha**2) * u) / square + 2 * alpha) / (alpha * (square + 2 * alpha * u)) @@ -107,16 +70,8 @@ def hyperbolic_sinus_prime(u, alpha): ) -def build_dummies_dict(data): - """ - - Args: - data: - - Returns: - - - """ +def build_dummies_dict(data: np.ndarray | pd.Series) -> dict[Any, np.ndarray | pd.Series]: + """Build a dict mapping each unique value to a boolean mask (data == value).""" unique_val_list = unique(data) output = {} for val in unique_val_list: @@ -125,17 +80,17 @@ def build_dummies_dict(data): def calmar( - data_in, - margins: dict, + data_in: dict[str, Any], + margins: dict[str, Any], initial_weight: str, - method="linear", - lo=None, - up=None, - alpha=None, + method: str = "linear", + lo: Optional[float] = None, + up: Optional[float] = None, + alpha: Optional[float] = None, use_proportions: bool = False, xtol: float = 1.49012e-08, maxfev: int = 256, -): +) -> tuple[np.ndarray, np.ndarray, dict[str, Any]]: """Calibrates weights to satisfy margins constraints. Args: @@ -235,20 +190,20 @@ def calmar( assert lo is not None, "When method == 'logit', a value < 1 for lo is mandatory" assert lo < 1, "lo should be < 1" - def f(x): + def f(x: np.ndarray) -> np.ndarray: return logit(x, lo, up) - def f_prime(x): + def f_prime(x: np.ndarray) -> np.ndarray: return logit_prime(x, lo, up) elif method == "hyperbolic sinus": assert alpha is not None, "When method == 'hyperbolic sinus', a value > 0 for alpha is mandatory" assert alpha > 0, "alpha should be > 0" - def f(x): + def f(x: np.ndarray) -> np.ndarray: return hyperbolic_sinus(x, alpha) - def f_prime(x): + def f_prime(x: np.ndarray) -> np.ndarray: return hyperbolic_sinus_prime(x, alpha) margins = margins.copy() @@ -345,10 +300,10 @@ def f_prime(x): margins_dict[var] = val # Résolution des équations du premier ordre - def constraint(lambda_): + def constraint(lambda_: np.ndarray) -> np.ndarray: return dot(d * f(dot(x, lambda_)), x) - xmargins - def constraint_prime(lambda_): + def constraint_prime(lambda_: np.ndarray) -> np.ndarray: return dot(d * (x.T * f_prime(dot(x, lambda_))), x) # le jacobien ci-dessus est constraintprime = @(lambda) x*(d.*Fprime(x'*lambda)*x'); @@ -388,16 +343,13 @@ def constraint_prime(lambda_): return pondfin_out, lambdasol, margins_new_dict -def check_calmar(margins, margins_new_dict=None): - """ - - Args: - margins: - margins_new_dict: (Default value = None) - - Returns: - - """ +def check_calmar( + margins: dict[str, Any], + margins_new_dict: Optional[dict[str, Any]] = None, +) -> None: + """Log relative difference between initial margins and calibrated margins.""" + if margins_new_dict is None: + return for variable, margin in margins.items(): if variable != "total_population": rel_diff = abs(margin - margins_new_dict[variable]) / abs(margin) diff --git a/openfisca_survey_manager/coicop.py b/openfisca_survey_manager/policy/coicop.py similarity index 85% rename from openfisca_survey_manager/coicop.py rename to openfisca_survey_manager/policy/coicop.py index b30223d5..f6785ed5 100644 --- a/openfisca_survey_manager/coicop.py +++ b/openfisca_survey_manager/policy/coicop.py @@ -1,9 +1,14 @@ +"""COICOP nomenclature helpers.""" + +from __future__ import annotations + import logging from pathlib import Path +from typing import Literal import pandas as pd -from openfisca_survey_manager.paths import openfisca_survey_manager_location +from openfisca_survey_manager.configuration.paths import openfisca_survey_manager_location log = logging.getLogger(__name__) @@ -11,11 +16,18 @@ legislation_directory = Path(openfisca_survey_manager_location) / "openfisca_survey_manager" / "assets" -sub_levels = ["divisions", "groupes", "classes", "sous_classes", "postes"] +sub_levels: tuple[str, ...] = ("divisions", "groupes", "classes", "sous_classes", "postes") divisions = [f"0{i}" for i in range(1, 10)] + ["11", "12"] +CoicopLevel = Literal["divisions", "groupes", "classes", "sous_classes", "postes"] + -def build_coicop_level_nomenclature(level, year=2016, keep_code=False, to_csv=False): +def build_coicop_level_nomenclature( + level: CoicopLevel, + year: int = 2016, + keep_code: bool = False, + to_csv: bool = False, +) -> pd.DataFrame: assert level in sub_levels log.debug(f"Reading nomenclature coicop {year} source data for level {level}") try: @@ -69,7 +81,7 @@ def build_coicop_level_nomenclature(level, year=2016, keep_code=False, to_csv=Fa return data_frame -def build_raw_coicop_nomenclature(year=2016): +def build_raw_coicop_nomenclature(year: int = 2016) -> pd.DataFrame: """Builds raw COICOP nomenclature from ecoicop levels""" coicop_nomenclature = None diff --git a/openfisca_survey_manager/policy/legislation_asof.py b/openfisca_survey_manager/policy/legislation_asof.py new file mode 100644 index 00000000..362fa6e4 --- /dev/null +++ b/openfisca_survey_manager/policy/legislation_asof.py @@ -0,0 +1,243 @@ +"""Shared helpers (no survey collection dependency) to avoid circular imports.""" + +from __future__ import annotations + +import logging +from typing import Any, Optional + +from openfisca_core import periods +from openfisca_core.parameters import ParameterNode, Scale + +log = logging.getLogger(__name__) + + +def do_nothing(*args: Any, **kwargs: Any) -> None: + return None + + +def inflate_parameters( + parameters: ParameterNode | Scale | Any, + inflator: float, + base_year: int, + last_year: Optional[int] = None, + ignore_missing_units: bool = False, + start_instant: Optional[str] = None, + round_ndigits: int = 2, +) -> None: + """ + Inflate a Parameter node or a Parameter leaf for the years between base_year and last_year. + + ::parameters:: a Parameter node or a Parameter leaf + ::inflator:: rate used to inflate the parameter. The rate is unique for all the years + ::base_year:: base year of the parameter + ::last_year:: last year of inflation + ::ignore_missing_units:: if True, a parameter leaf without unit in metadata will not be inflated + ::start_instant:: Instant of the year when the update should start, if None will be January 1st + ::round_ndigits:: Number of digits to keep in the rounded result + """ + if (last_year is not None) and (last_year > base_year + 1): + for year in range(base_year + 1, last_year + 1): + inflate_parameters( + parameters, + inflator, + year - 1, + last_year=year, + ignore_missing_units=ignore_missing_units, + start_instant=start_instant, + round_ndigits=round_ndigits, + ) + else: + if last_year is None: + last_year = base_year + 1 + + assert last_year == base_year + 1 + + if isinstance(parameters, ParameterNode): + for sub_parameter in parameters.children.values(): + inflate_parameters( + sub_parameter, + inflator, + base_year, + last_year, + ignore_missing_units=ignore_missing_units, + start_instant=start_instant, + round_ndigits=round_ndigits, + ) + else: + acceptable_units = [ + "rate_unit", + "threshold_unit", + "unit", + ] + if ignore_missing_units: + if not hasattr(parameters, "metadata"): + return + if not bool(set(parameters.metadata.keys()) & set(acceptable_units)): + return + assert hasattr(parameters, "metadata"), f"{parameters.name} doesn't have metadata" + unit_types = set(parameters.metadata.keys()).intersection(set(acceptable_units)) + assert unit_types, ( + f"No admissible unit in metadata for parameter {parameters.name}. You may consider using " + "the option 'ignore_missing_units' from the inflate_parameters() function." + ) + if len(unit_types) > 1: + assert unit_types == {"threshold_unit", "rate_unit"}, ( + f"Too much admissible units in metadata for parameter {parameters.name}" + ) + unit_by_type = {unit_type: parameters.metadata[unit_type] for unit_type in unit_types} + for unit_type in unit_by_type: + if parameters.metadata[unit_type].startswith("currency"): + inflate_parameter_leaf( + parameters, + base_year, + inflator, + unit_type=unit_type, + start_instant=start_instant, + round_ndigits=round_ndigits, + ) + + +def inflate_parameter_leaf( + sub_parameter: Any, + base_year: int, + inflator: float, + unit_type: str = "unit", + start_instant: Optional[str] = None, + round_ndigits: int = 2, +) -> None: + """ + Inflate a Parameter leaf according to unit type for the year after base_year. + + ::sub_parameter:: a Parameter leaf + ::base_year:: base year of the parameter + ::inflator:: rate used to inflate the parameter + ::unit_type:: unit supposed by default. Other admissible unit types are threshold_unit and rate_unit + ::start_instant:: Instant of the year when the update should start, if None will be January 1st + ::round_ndigits:: Number of digits to keep in the rounded result + """ + if isinstance(sub_parameter, Scale): + if unit_type == "threshold_unit": + for bracket in sub_parameter.brackets: + threshold = bracket.children["threshold"] + inflate_parameter_leaf( + threshold, base_year, inflator, start_instant=start_instant, round_ndigits=round_ndigits + ) + return + else: + kept_instants_str = [ + parameter_at_instant.instant_str + for parameter_at_instant in sub_parameter.values_list + if periods.instant(parameter_at_instant.instant_str).year <= base_year + ] + if not kept_instants_str: + return + + last_admissible_instant_str = max(kept_instants_str) + sub_parameter.update(start=last_admissible_instant_str, value=sub_parameter(last_admissible_instant_str)) + if start_instant is not None: + assert periods.instant(start_instant).year == (base_year + 1), ( + "Year of start_instant should be base_year + 1" + ) + value = ( + round(sub_parameter(f"{base_year}-12-31") * (1 + inflator), round_ndigits) + if sub_parameter(f"{base_year}-12-31") is not None + else None + ) + sub_parameter.update( + start=start_instant, + value=value, + ) + else: + restricted_to_base_year_value_list = [ + parameter_at_instant + for parameter_at_instant in sub_parameter.values_list + if periods.instant(parameter_at_instant.instant_str).year == base_year + ] + if restricted_to_base_year_value_list: + for parameter_at_instant in reversed(restricted_to_base_year_value_list): + if parameter_at_instant.instant_str.startswith(str(base_year)): + value = ( + round(parameter_at_instant.value * (1 + inflator), round_ndigits) + if parameter_at_instant.value is not None + else None + ) + sub_parameter.update( + start=parameter_at_instant.instant_str.replace(str(base_year), str(base_year + 1)), + value=value, + ) + else: + value = ( + round(sub_parameter(f"{base_year}-12-31") * (1 + inflator), round_ndigits) + if sub_parameter(f"{base_year}-12-31") is not None + else None + ) + sub_parameter.update( + start=f"{base_year + 1}-01-01", + value=value, + ) + + +def asof(tax_benefit_system: Any, instant: str | periods.Instant) -> None: + parameters = tax_benefit_system.parameters + parameters_asof(parameters, instant) + variables_asof(tax_benefit_system, instant) + + +def leaf_asof(sub_parameter: Any, instant: periods.Instant) -> None: + kept_instants_str = [ + parameter_at_instant.instant_str + for parameter_at_instant in sub_parameter.values_list + if periods.instant(parameter_at_instant.instant_str) <= instant + ] + if not kept_instants_str: + sub_parameter.values_list = [] + return + + last_admissible_instant_str = max(kept_instants_str) + sub_parameter.update(start=last_admissible_instant_str, value=sub_parameter(last_admissible_instant_str)) + + +def parameters_asof(parameters: ParameterNode | Any, instant: str | periods.Instant) -> None: + if isinstance(instant, str): + instant = periods.instant(instant) + assert isinstance(instant, periods.Instant) + + for sub_parameter in parameters.children.values(): + if isinstance(sub_parameter, ParameterNode): + parameters_asof(sub_parameter, instant) + else: + if isinstance(sub_parameter, Scale): + for bracket in sub_parameter.brackets: + threshold = bracket.children["threshold"] + rate = bracket.children.get("rate") + amount = bracket.children.get("amount") + leaf_asof(threshold, instant) + if rate: + leaf_asof(rate, instant) + if amount: + leaf_asof(amount, instant) + else: + leaf_asof(sub_parameter, instant) + + +def variables_asof( + tax_benefit_system: Any, + instant: str | periods.Instant, + variables_list: Optional[list[str]] = None, +) -> None: + if isinstance(instant, str): + instant = periods.instant(instant) + assert isinstance(instant, periods.Instant) + + if variables_list is None: + variables_list = tax_benefit_system.variables.keys() + + for variable_name, variable in tax_benefit_system.variables.items(): + if variable_name in variables_list: + formulas = variable.formulas + for instant_str in list(formulas.keys()): + if periods.instant(instant_str) > instant: + del formulas[instant_str] + + if variable.end is not None and periods.instant(variable.end) >= instant: + variable.end = None diff --git a/openfisca_survey_manager/policy/matching.py b/openfisca_survey_manager/policy/matching.py new file mode 100644 index 00000000..06f8e996 --- /dev/null +++ b/openfisca_survey_manager/policy/matching.py @@ -0,0 +1,380 @@ +"""Nearest-neighbor donor (NND) hot deck matching — pure Python or R (StatMatch).""" + +from __future__ import annotations + +import logging +from pathlib import Path +from typing import List, Optional, Union + +import numpy as np +import pandas as pd + +from openfisca_survey_manager.configuration.paths import openfisca_survey_manager_location + +log = logging.getLogger(__name__) + +config_files_directory = Path(openfisca_survey_manager_location) + + +def _normalize_list( + x: Optional[Union[str, List[str]]], + name: str = "variables", +) -> Optional[list[str]]: + """Return a list of variable names from str or list.""" + if x is None: + return None + if isinstance(x, str): + return [x] + return list(x) + + +def _nnd_hotdeck_python( + receiver: pd.DataFrame, + donor: pd.DataFrame, + matching_variables: list[str], + donor_classes: list[str] | str | None = None, + dist_fun: str = "Manhattan", + random_state: int | None = None, +) -> np.ndarray: + """ + Nearest-neighbor donor matching in pure Python (pandas + numpy). + + For each receiver row, finds the donor row that minimizes distance on + `matching_variables`. Optionally restricts to donors in the same + `donor_classes`. Ties are broken at random. + + Parameters + ---------- + receiver, donor : DataFrame + Recipient and donor datasets; must contain `matching_variables` + (and `donor_classes` if provided). Matching variables must be numeric + for Manhattan/Euclidean. + matching_variables : list of str + Column names used to compute distance. + donor_classes : str or list of str, optional + Columns defining donation classes; matching is done only within + the same class. Must not contain missing values. + dist_fun : str + "Manhattan" (default) or "Euclidean". + random_state : int, optional + Seed for breaking ties. + + Returns + ------- + mtc_ids : ndarray of int + Shape (len(receiver), 2): (receiver_index, donor_index) for each row. + Receiver index is 0..n_rec-1, donor index is 0..n_don-1. + """ + rng = np.random.default_rng(random_state) + match_vars = _normalize_list(matching_variables) + don_class = _normalize_list(donor_classes) if donor_classes is not None else None + + for col in match_vars: + if col not in receiver.columns or col not in donor.columns: + raise ValueError(f"Matching variable {col!r} missing in receiver or donor") + if don_class: + for col in don_class: + if col not in receiver.columns or col not in donor.columns: + raise ValueError(f"Donor class variable {col!r} missing in receiver or donor") + + x_rec = receiver[match_vars].astype(float).values + x_don = donor[match_vars].astype(float).values + n_rec, n_don = len(receiver), len(donor) + if n_don == 0: + raise ValueError("Donor dataframe is empty") + + if dist_fun == "Manhattan": + + def dist_fn(donors: np.ndarray, rec_row: np.ndarray) -> np.ndarray: + return np.sum(np.abs(donors - rec_row), axis=1) + + elif dist_fun == "Euclidean": + + def dist_fn(donors: np.ndarray, rec_row: np.ndarray) -> np.ndarray: + return np.sqrt(np.sum((donors - rec_row) ** 2, axis=1)) + + else: + raise ValueError(f"dist_fun must be 'Manhattan' or 'Euclidean', got {dist_fun!r}") + + if don_class is None: + # Global matching: for each receiver row, min distance over all donors + donor_ix = np.zeros(n_rec, dtype=np.intp) + for i in range(n_rec): + d = dist_fn(x_don, x_rec[i]) + min_d = np.min(d) + candidates = np.where(d == min_d)[0] + donor_ix[i] = rng.choice(candidates) + mtc_ids = np.column_stack([np.arange(n_rec), donor_ix]) + return mtc_ids + + # Within-class matching: for each group, match receiver rows to donors in same group + rec_groups = receiver.groupby(don_class, sort=False) + don_groups = donor.groupby(don_class, sort=False) + donor_iloc = np.zeros(n_rec, dtype=np.intp) + + for key, rec_grp in rec_groups: + try: + don_grp = don_groups.get_group(key) + except KeyError: + log.warning("No donors for class %s; receiver rows get donor 0", key) + continue + x_r = rec_grp[match_vars].astype(float).values + x_d = don_grp[match_vars].astype(float).values + n_r, n_d = len(rec_grp), len(don_grp) + if n_d == 0: + continue + # Receiver global ilocs for this group + rec_global_ilocs = receiver.index.get_indexer(rec_grp.index) + for j in range(n_r): + d = dist_fn(x_d, x_r[j]) + min_d = np.min(d) + candidates = np.where(d == min_d)[0] + don_local = rng.choice(candidates) + don_global_iloc = donor.index.get_loc(don_grp.index[don_local]) + donor_iloc[rec_global_ilocs[j]] = don_global_iloc + mtc_ids = np.column_stack([np.arange(n_rec), donor_iloc]) + return mtc_ids + + +def _create_fused_python( + receiver: pd.DataFrame, + donor: pd.DataFrame, + mtc_ids: np.ndarray, + z_variables: list[str], + dup_x: bool = False, + matching_variables: list[str] | None = None, +) -> pd.DataFrame: + """ + Build fused dataset: receiver plus z_variables from matched donors. + + mtc_ids : shape (n_receiver, 2), second column is donor position (integer). + """ + z_vars = _normalize_list(z_variables) + for col in z_vars: + if col not in donor.columns: + raise ValueError(f"z_variable {col!r} not in donor") + fused = receiver.copy() + don_pos = mtc_ids[:, 1] + for col in z_vars: + fused[col] = donor[col].iloc[don_pos].values + if dup_x and matching_variables: + match_vars = _normalize_list(matching_variables) + for col in match_vars: + if col in donor.columns: + fused[col + "_donor"] = donor[col].iloc[don_pos].values + return fused + + +def nnd_hotdeck( + receiver: pd.DataFrame | None = None, + donor: pd.DataFrame | None = None, + matching_variables: str | list[str] | None = None, + z_variables: str | list[str] | None = None, + donor_classes: str | list[str] | None = None, + dist_fun: str = "Manhattan", + use_r: bool = False, + random_state: int | None = None, +) -> tuple[pd.DataFrame, pd.DataFrame]: + """ + Nearest-neighbor donor (NND) hot deck: match each receiver row to a donor, + then fuse z_variables from donor into receiver. + + By default uses a **pure Python** implementation (pandas + numpy). + Set `use_r=True` to use R's StatMatch via rpy2 (same API as before). + + Parameters + ---------- + receiver, donor : DataFrame + Recipient and donor datasets. + matching_variables : str or list of str + Columns used to compute distance (must be numeric for Manhattan/Euclidean). + z_variables : str or list of str + Donor columns to copy into the fused dataset. + donor_classes : str or list of str, optional + Match only within the same class (e.g. same sex). + dist_fun : str + "Manhattan" (default) or "Euclidean" (pure Python); R supports more. + use_r : bool + If True, use R StatMatch via rpy2; otherwise use pure Python. + random_state : int, optional + Seed for tie-breaking (pure Python only). + + Returns + ------- + fused_0, fused_1 : DataFrame + fused_0: receiver + z_variables from donor (no duplicate match vars). + fused_1: same with matching variables duplicated as _donor (if applicable). + """ + assert receiver is not None and donor is not None + assert matching_variables is not None and z_variables is not None + match_vars = _normalize_list(matching_variables) + z_vars = _normalize_list(z_variables) + + if use_r: + return _nnd_hotdeck_rpy2( + receiver=receiver, + donor=donor, + matching_variables=match_vars, + z_variables=z_vars, + donor_classes=donor_classes, + ) + + mtc_ids = _nnd_hotdeck_python( + receiver, + donor, + match_vars, + donor_classes=donor_classes, + dist_fun=dist_fun, + random_state=random_state, + ) + fused_0 = _create_fused_python(receiver, donor, mtc_ids, z_vars, dup_x=False) + fused_1 = _create_fused_python(receiver, donor, mtc_ids, z_vars, dup_x=True, matching_variables=match_vars) + return fused_0, fused_1 + + +def _nnd_hotdeck_rpy2( + receiver: pd.DataFrame, + donor: pd.DataFrame, + matching_variables: list[str], + z_variables: list[str], + donor_classes: str | list[str] | None = None, +) -> tuple[pd.DataFrame, pd.DataFrame]: + """R (StatMatch) implementation via rpy2; same return as nnd_hotdeck.""" + from rpy2.robjects import pandas2ri + from rpy2.robjects.packages import importr + + pandas2ri.activate() + stat_match = importr("StatMatch") + + if donor_classes is not None: + don_class = _normalize_list(donor_classes) + for col in don_class: + if col not in receiver.columns or col not in donor.columns: + raise ValueError(f"Donor class variable {col!r} missing") + out_nnd = stat_match.NND_hotdeck( + data_rec=receiver, + data_don=donor, + match_vars=pd.Series(matching_variables), + don_class=pd.Series(don_class), + ) + else: + out_nnd = stat_match.NND_hotdeck( + data_rec=receiver, + data_don=donor, + match_vars=pd.Series(matching_variables), + ) + + fused_0 = pandas2ri.ri2py( + stat_match.create_fused(data_rec=receiver, data_don=donor, mtc_ids=out_nnd[0], z_vars=pd.Series(z_variables)) + ) + fused_1 = pandas2ri.ri2py( + stat_match.create_fused( + data_rec=receiver, + data_don=donor, + mtc_ids=out_nnd[0], + z_vars=pd.Series(z_variables), + dup_x=True, + match_vars=pd.Series(matching_variables), + ) + ) + return fused_0, fused_1 + + +def nnd_hotdeck_using_rpy2( + receiver: pd.DataFrame | None = None, + donor: pd.DataFrame | None = None, + matching_variables: str | list[str] | None = None, + z_variables: str | list[str] | None = None, + donor_classes: str | list[str] | None = None, +) -> tuple[pd.DataFrame, pd.DataFrame]: + """ + NND hot deck via R (StatMatch). Prefer `nnd_hotdeck(..., use_r=True)`. + """ + return nnd_hotdeck( + receiver=receiver, + donor=donor, + matching_variables=matching_variables, + z_variables=z_variables, + donor_classes=donor_classes, + use_r=True, + ) + + +def nnd_hotdeck_using_feather( + receiver: pd.DataFrame | None = None, + donor: pd.DataFrame | None = None, + matching_variables: str | list[str] | None = None, + z_variables: str | list[str] | None = None, +) -> None: + """ + Not working + """ + import feather + + assert receiver is not None and donor is not None + assert matching_variables is not None + + temporary_directory_path = config_files_directory / "tmp" + assert temporary_directory_path.exists() + receiver_path = temporary_directory_path / "receiver.feather" + donor_path = temporary_directory_path / "donor.feather" + feather.write_dataframe(receiver, receiver_path) + feather.write_dataframe(donor, donor_path) + if isinstance(matching_variables, str): + match_vars = f'"{matching_variables}"' + elif len(matching_variables) == 1: + match_vars = f'"{matching_variables[0]}"' + else: + match_vars = '"{}"'.format("todo") + + r_script = f""" +rm(list=ls()) +gc() +devtools::install_github("wesm/feather/R") +library(feather) +library(StatMatch) + +receiver <- read_feather({receiver_path}) +donor <- read_feather({donor_path}) +summary(receiver) +summary(donor) + +# variables +receiver = as.data.frame(receiver) +donor = as.data.frame(donor) +gc() +match_vars = {match_vars} +# don_class = c("sexe") +out.nnd <- NND.hotdeck( + data.rec = receiver, data.don = donor, match.vars = match_vars + ) + +# out.nndsummary(out.nnd$mtc.ids) +# head(out.nnd$mtc.ids, 10) +# head(receiver, 10) + +fused.nnd.m <- create.fused( + data.rec = receiver, data.don = donor, + mtc.ids = out.nnd$mtc.ids, + z.vars = "{z_variables}" + ) +summary(fused.nnd.m) +""" + log.debug("%s", r_script) + + +if __name__ == "__main__": + log.setLevel(logging.INFO) + # Minimal example: pure Python NND hot deck (no R required) + np.random.seed(42) + receiver = pd.DataFrame({"x": [1.0, 2.0, 3.0], "y": [10.0, 20.0, 30.0]}) + donor = pd.DataFrame({"x": [1.1, 2.2, 2.9], "y": [10.5, 19.0, 31.0], "ident": [100, 200, 300]}) + fused_0, fused_1 = nnd_hotdeck( + receiver=receiver, + donor=donor, + matching_variables=["x", "y"], + z_variables="ident", + random_state=42, + ) + log.info("fused_0 (receiver + z from donor):\n%s", fused_0) + log.info("fused_1 (with _donor dup):\n%s", fused_1) diff --git a/openfisca_survey_manager/scenarios/__init__.py b/openfisca_survey_manager/policy/py.typed similarity index 100% rename from openfisca_survey_manager/scenarios/__init__.py rename to openfisca_survey_manager/policy/py.typed diff --git a/openfisca_survey_manager/policy/scenarios/__init__.py b/openfisca_survey_manager/policy/scenarios/__init__.py new file mode 100644 index 00000000..aa248f89 --- /dev/null +++ b/openfisca_survey_manager/policy/scenarios/__init__.py @@ -0,0 +1,4 @@ +from openfisca_survey_manager.policy.scenarios.abstract_scenario import AbstractSurveyScenario +from openfisca_survey_manager.policy.scenarios.reform_scenario import ReformScenario + +__all__ = ["AbstractSurveyScenario", "ReformScenario"] diff --git a/openfisca_survey_manager/scenarios/abstract_scenario.py b/openfisca_survey_manager/policy/scenarios/abstract_scenario.py similarity index 87% rename from openfisca_survey_manager/scenarios/abstract_scenario.py rename to openfisca_survey_manager/policy/scenarios/abstract_scenario.py index 57fe9493..6c5ef32e 100644 --- a/openfisca_survey_manager/scenarios/abstract_scenario.py +++ b/openfisca_survey_manager/policy/scenarios/abstract_scenario.py @@ -2,7 +2,7 @@ import logging from pathlib import Path -from typing import Optional, Union +from typing import Any, Optional, Union import numpy as np import pandas as pd @@ -11,10 +11,10 @@ from openfisca_core.tools.simulation_dumper import dump_simulation, restore_simulation from openfisca_core.types import Array, Period, TaxBenefitSystem -from openfisca_survey_manager.calibration import Calibration +from openfisca_survey_manager.core.survey import Survey from openfisca_survey_manager.exceptions import SurveyManagerError -from openfisca_survey_manager.simulations import Simulation -from openfisca_survey_manager.surveys import Survey +from openfisca_survey_manager.policy import Simulation +from openfisca_survey_manager.policy.calibration import Calibration log = logging.getLogger(__name__) @@ -44,11 +44,16 @@ class AbstractSurveyScenario: varying_variable = None weight_variable_by_entity = None - def build_input_data(self, **kwargs): + def build_input_data(self, **kwargs: Any) -> None: """Build input data.""" raise NotImplementedError - def calculate_series(self, variable, period=None, simulation=None): + def calculate_series( + self, + variable: str, + period: Optional[Union[int, str, Period]] = None, + simulation: Optional[str] = None, + ) -> pd.Series: """Compute variable values for period for a given simulation. Args: @@ -65,7 +70,12 @@ def calculate_series(self, variable, period=None, simulation=None): name=variable, ) - def calculate_variable(self, variable, period=None, simulation=None): + def calculate_variable( + self, + variable: str, + period: Optional[Union[int, str, Period]] = None, + simulation: Optional[str] = None, + ) -> Array: """Compute variable values for period for a given simulation. Args: @@ -303,7 +313,7 @@ def compute_marginal_tax_rate( else: target_variable_entity_key = variables[target_variable].entity.key - def cast_to_target_entity(simulation: Simulation): + def cast_to_target_entity(simulation: Simulation) -> np.ndarray: population = simulation.populations[target_variable_entity_key] df = ( pd.DataFrame( @@ -337,21 +347,21 @@ def cast_to_target_entity(simulation: Simulation): def compute_pivot_table( self, - aggfunc="mean", - columns=None, - baseline_simulation=None, - filter_by=None, - index=None, - period=None, - simulation=None, - difference=False, - use_baseline_for_columns=None, - values=None, - missing_variable_default_value=np.nan, - concat_axis=None, - weighted=True, - alternative_weights=None, - ): + aggfunc: str = "mean", + columns: Optional[list[str]] = None, + baseline_simulation: Optional[str] = None, + filter_by: Optional[str] = None, + index: Optional[list[str]] = None, + period: Optional[Union[int, str, Period]] = None, + simulation: Optional[str] = None, + difference: bool = False, + use_baseline_for_columns: Optional[bool] = None, + values: Optional[list[str]] = None, + missing_variable_default_value: Any = np.nan, + concat_axis: Optional[int] = None, + weighted: bool = True, + alternative_weights: Optional[Union[str, int, float, Array]] = None, + ) -> pd.DataFrame: """Compute a pivot table of agregated values casted along specified index and columns. Args: @@ -405,17 +415,17 @@ def compute_pivot_table( def compute_winners_losers( self, - variable, - simulation, - baseline_simulation=None, - filter_by=None, - period=None, - absolute_minimal_detected_variation=0, - relative_minimal_detected_variation=0.01, - observations_threshold=None, - weighted=True, - alternative_weights=None, - ): + variable: str, + simulation: str, + baseline_simulation: Optional[str] = None, + filter_by: Optional[str] = None, + period: Optional[Union[int, str, Period]] = None, + absolute_minimal_detected_variation: float = 0, + relative_minimal_detected_variation: float = 0.01, + observations_threshold: Optional[int] = None, + weighted: bool = True, + alternative_weights: Optional[Union[str, int, float, Array]] = None, + ) -> dict[str, Union[int, float]]: simulation = self.simulations[simulation] if baseline_simulation: baseline_simulation = self.simulations[baseline_simulation] @@ -434,8 +444,15 @@ def compute_winners_losers( ) def create_data_frame_by_entity( - self, variables=None, expressions=None, filter_by=None, index=False, period=None, simulation=None, merge=False - ): + self, + variables: Optional[list[str]] = None, + expressions: Optional[list[str]] = None, + filter_by: Optional[str] = None, + index: bool = False, + period: Optional[Union[int, str, Period]] = None, + simulation: Optional[str] = None, + merge: bool = False, + ) -> Union[pd.DataFrame, dict[str, pd.DataFrame]]: """Create dataframe(s) of computed variable for every entity (eventually merged in a unique dataframe). Args: @@ -466,7 +483,11 @@ def create_data_frame_by_entity( merge=merge, ) - def custom_input_data_frame(self, input_data_frame, **kwargs): + def custom_input_data_frame( + self, + input_data_frame: pd.DataFrame, + **kwargs: Any, + ) -> None: """Customize input data frame. Args: @@ -475,7 +496,12 @@ def custom_input_data_frame(self, input_data_frame, **kwargs): """ pass - def dump_data_frame_by_entity(self, variables=None, survey_collection=None, survey_name=None): + def dump_data_frame_by_entity( + self, + variables: Optional[list[str]] = None, + survey_collection: Optional[Any] = None, + survey_name: Optional[str] = None, + ) -> None: assert survey_collection is not None assert survey_name is not None assert variables is not None @@ -486,7 +512,7 @@ def dump_data_frame_by_entity(self, variables=None, survey_collection=None, surv survey_collection.surveys.append(survey) survey_collection.dump(collection="openfisca") - def dump_simulations(self, directory: str): + def dump_simulations(self, directory: str) -> None: """ Dump simulations. @@ -504,7 +530,7 @@ def dump_simulations(self, directory: str): simulation = next(iter(self.simulations.values())) dump_simulation(simulation, directory) - def generate_performance_data(self, output_dir: str): + def generate_performance_data(self, output_dir: str) -> None: if not self.trace: raise SurveyManagerError("Method generate_performance_data cannot be used if trace hasn't been activated.") @@ -517,7 +543,12 @@ def generate_performance_data(self, output_dir: str): simulation.tracer.generate_performance_graph(simulation_dir) simulation.tracer.generate_performance_tables(simulation_dir) - def inflate(self, inflator_by_variable=None, period=None, target_by_variable=None): + def inflate( + self, + inflator_by_variable: Optional[dict[str, float]] = None, + period: Optional[Union[int, str, Period]] = None, + target_by_variable: Optional[dict[str, float]] = None, + ) -> None: assert inflator_by_variable or target_by_variable assert period is not None inflator_by_variable = {} if inflator_by_variable is None else inflator_by_variable @@ -530,14 +561,14 @@ def inflate(self, inflator_by_variable=None, period=None, target_by_variable=Non def init_from_data( self, - calibration_kwargs=None, - inflation_kwargs=None, - rebuild_input_data=False, - rebuild_kwargs=None, - data=None, - memory_config=None, - use_marginal_tax_rate=False, - ): + calibration_kwargs: Optional[dict[str, Any]] = None, + inflation_kwargs: Optional[dict[str, Any]] = None, + rebuild_input_data: bool = False, + rebuild_kwargs: Optional[dict[str, Any]] = None, + data: Optional[dict[str, Any]] = None, + memory_config: Optional[Any] = None, + use_marginal_tax_rate: bool = False, + ) -> None: """Initialise a survey scenario from data. Args: @@ -601,8 +632,14 @@ def init_from_data( self.inflate(**inflation_kwargs) def new_simulation( - self, simulation_name, debug=False, trace=False, data=None, memory_config=None, marginal_tax_rate_only=False - ): + self, + simulation_name: str, + debug: bool = False, + trace: bool = False, + data: Optional[dict[str, Any]] = None, + memory_config: Optional[Any] = None, + marginal_tax_rate_only: bool = False, + ) -> Simulation: tax_benefit_system = self.tax_benefit_systems[simulation_name] assert tax_benefit_system is not None @@ -643,13 +680,13 @@ def new_simulation( return simulation - def memory_usage(self): + def memory_usage(self) -> None: """Log memory usage.""" for simulation_name, simulation in self.simulations.items(): log.info("simulation: %s", simulation_name) simulation.print_memory_usage() - def neutralize_variables(self, tax_benefit_system): + def neutralize_variables(self, tax_benefit_system: TaxBenefitSystem) -> None: """Neutralizes input variables not in input dataframe and keep some crucial variables. Args: @@ -668,7 +705,7 @@ def neutralize_variables(self, tax_benefit_system): tax_benefit_system.neutralize_variable(variable_name) - def restore_simulations(self, directory, **kwargs): + def restore_simulations(self, directory: Union[str, Path], **kwargs: Any) -> None: """Restores SurveyScenario's simulations. Args: @@ -690,7 +727,7 @@ def restore_simulations(self, directory, **kwargs): simulation.id_variable_by_entity_key = self.id_variable_by_entity_key self.simulations["unique_simulation"] = simulation - def set_input_data_frame(self, input_data_frame): + def set_input_data_frame(self, input_data_frame: pd.DataFrame) -> None: """Set the input dataframe. Args: @@ -713,7 +750,10 @@ def set_tax_benefit_systems(self, tax_benefit_systems: dict[str, TaxBenefitSyste # self.tax_benefit_systems = tax_benefit_systems - def set_weight_variable_by_entity(self, weight_variable_by_entity=None): + def set_weight_variable_by_entity( + self, + weight_variable_by_entity: Optional[dict[str, str]] = None, + ) -> None: if weight_variable_by_entity is not None: self.weight_variable_by_entity = weight_variable_by_entity @@ -721,7 +761,12 @@ def set_weight_variable_by_entity(self, weight_variable_by_entity=None): for simulation in self.simulations.values(): simulation.set_weight_variable_by_entity(self.weight_variable_by_entity) - def summarize_variable(self, variable=None, weighted=False, force_compute=False): + def summarize_variable( + self, + variable: Optional[str] = None, + weighted: bool = False, + force_compute: bool = False, + ) -> None: """Log a summary of a variable including its memory usage for all the simulations. Args: @@ -740,12 +785,20 @@ def summarize_variable(self, variable=None, weighted=False, force_compute=False) for _simulation_name, simulation in self.simulations.items(): simulation.summarize_variable(variable, weighted, force_compute) - def _apply_modification(self, simulation, period): + def _apply_modification( + self, + simulation: Simulation, + period: Union[int, str, Period], + ) -> None: period = periods.period(period) varying_variable = self.varying_variable definition_period = simulation.tax_benefit_system.variables[varying_variable].definition_period - def set_variable(varying_variable, varying_variable_value, period_): + def set_variable( + varying_variable: str, + varying_variable_value: np.ndarray, + period_: Period, + ) -> None: delta = self.variation_factor * varying_variable_value new_variable_value = varying_variable_value + delta simulation.delete_arrays(varying_variable, period_) diff --git a/openfisca_survey_manager/scenarios/reform_scenario.py b/openfisca_survey_manager/policy/scenarios/reform_scenario.py similarity index 82% rename from openfisca_survey_manager/scenarios/reform_scenario.py rename to openfisca_survey_manager/policy/scenarios/reform_scenario.py index 5c4673b2..cd056a56 100644 --- a/openfisca_survey_manager/scenarios/reform_scenario.py +++ b/openfisca_survey_manager/policy/scenarios/reform_scenario.py @@ -1,14 +1,14 @@ """Abstract survey scenario definition.""" import logging -from typing import Optional, Union +from typing import Any, Optional, Union import numpy as np import pandas as pd from openfisca_core.types import Array, Period -from openfisca_survey_manager.scenarios.abstract_scenario import AbstractSurveyScenario -from openfisca_survey_manager.simulations import Simulation +from openfisca_survey_manager.policy import Simulation +from openfisca_survey_manager.policy.scenarios.abstract_scenario import AbstractSurveyScenario log = logging.getLogger(__name__) @@ -16,7 +16,7 @@ class ReformScenario(AbstractSurveyScenario): """Reform survey scenario.""" - def _get_simulation(self, use_baseline: bool = False): + def _get_simulation(self, use_baseline: bool = False) -> Simulation: """ Get relevant simulation @@ -32,11 +32,16 @@ def _get_simulation(self, use_baseline: bool = False): assert simulation is not None, f"{simulation_name} does not exist" return simulation - def build_input_data(self, **kwargs): + def build_input_data(self, **kwargs: Any) -> None: """Build input data.""" raise NotImplementedError - def calculate_series(self, variable, period=None, use_baseline=False): + def calculate_series( + self, + variable: str, + period: Optional[Union[int, str, Period]] = None, + use_baseline: bool = False, + ) -> pd.Series: """Compute variable values for period and baseline or reform tax benefit and system. Args: @@ -53,7 +58,12 @@ def calculate_series(self, variable, period=None, use_baseline=False): name=variable, ) - def calculate_variable(self, variable, period=None, use_baseline=False): + def calculate_variable( + self, + variable: str, + period: Optional[Union[int, str, Period]] = None, + use_baseline: bool = False, + ) -> Array: """Compute variable values for period and baseline or reform tax benefit and system. Args: @@ -195,20 +205,20 @@ def compute_marginal_tax_rate( def compute_pivot_table( self, - aggfunc="mean", - columns=None, - difference=False, - filter_by=None, - index=None, - period=None, - use_baseline=False, - use_baseline_for_columns=None, - values=None, - missing_variable_default_value=np.nan, - concat_axis=None, - weighted=True, - alternative_weights=None, - ): + aggfunc: str = "mean", + columns: Optional[list[str]] = None, + difference: bool = False, + filter_by: Optional[str] = None, + index: Optional[list[str]] = None, + period: Optional[Union[int, str, Period]] = None, + use_baseline: bool = False, + use_baseline_for_columns: Optional[bool] = None, + values: Optional[list[str]] = None, + missing_variable_default_value: Any = np.nan, + concat_axis: Optional[int] = None, + weighted: bool = True, + alternative_weights: Optional[Union[str, int, float, Array]] = None, + ) -> pd.DataFrame: filtering_variable_by_entity = self.filtering_variable_by_entity return Simulation.compute_pivot_table( @@ -231,15 +241,15 @@ def compute_pivot_table( def compute_winners_losers( self, - variable=None, - filter_by=None, - period=None, - absolute_minimal_detected_variation=0, - relative_minimal_detected_variation=0.01, - observations_threshold=None, - weighted=True, - alternative_weights=None, - ): + variable: Optional[str] = None, + filter_by: Optional[str] = None, + period: Optional[Union[int, str, Period]] = None, + absolute_minimal_detected_variation: float = 0, + relative_minimal_detected_variation: float = 0.01, + observations_threshold: Optional[int] = None, + weighted: bool = True, + alternative_weights: Optional[Union[str, int, float, Array]] = None, + ) -> dict[str, Union[int, float]]: return super().compute_winners_losers( simulation="reform", baseline_simulation="baseline", @@ -259,14 +269,14 @@ def compute_winners_losers( def create_data_frame_by_entity( self, - variables=None, - expressions=None, - filter_by=None, - index=False, - period=None, - use_baseline=False, - merge=False, - ): + variables: Optional[list[str]] = None, + expressions: Optional[list[str]] = None, + filter_by: Optional[str] = None, + index: bool = False, + period: Optional[Union[int, str, Period]] = None, + use_baseline: bool = False, + merge: bool = False, + ) -> Union[pd.DataFrame, dict[str, pd.DataFrame]]: """Create dataframe(s) of computed variable for every entity (eventually merged in a unique dataframe). Args: diff --git a/openfisca_survey_manager/simulation_builder.py b/openfisca_survey_manager/policy/simulation_builder.py similarity index 90% rename from openfisca_survey_manager/simulation_builder.py rename to openfisca_survey_manager/policy/simulation_builder.py index 1feaad1f..9b9594fb 100644 --- a/openfisca_survey_manager/simulation_builder.py +++ b/openfisca_survey_manager/policy/simulation_builder.py @@ -1,7 +1,14 @@ +"""Simulation builder extensions for survey manager.""" + +from __future__ import annotations + import logging +from typing import Any, Optional +import pandas as pd from openfisca_core.model_api import MONTH, YEAR from openfisca_core.simulations.simulation_builder import SimulationBuilder +from openfisca_core.types import TaxBenefitSystem from openfisca_survey_manager.exceptions import SurveyManagerError @@ -17,7 +24,10 @@ # Helpers -def diagnose_variable_mismatch(used_as_input_variables, input_data_frame): +def diagnose_variable_mismatch( + used_as_input_variables: Optional[list[str]], + input_data_frame: pd.DataFrame, +) -> None: """Diagnose variables mismatch. Args: @@ -41,7 +51,7 @@ def diagnose_variable_mismatch(used_as_input_variables, input_data_frame): # SimulationBuilder monkey-patched methods -def _set_id_variable_by_entity_key(builder) -> dict[str, str]: +def _set_id_variable_by_entity_key(builder: SimulationBuilder) -> dict[str, str]: """Identify and sets the correct ids for the different entities.""" if builder.id_variable_by_entity_key is None: log.debug("Use default id_variable names") @@ -52,7 +62,7 @@ def _set_id_variable_by_entity_key(builder) -> dict[str, str]: return builder.id_variable_by_entity_key -def _set_role_variable_by_entity_key(builder) -> dict[str, str]: +def _set_role_variable_by_entity_key(builder: SimulationBuilder) -> dict[str, str]: """Identify and sets the correct roles for the different entities.""" if builder.role_variable_by_entity_key is None: builder.role_variable_by_entity_key = { @@ -62,7 +72,7 @@ def _set_role_variable_by_entity_key(builder) -> dict[str, str]: return builder.role_variable_by_entity_key -def _set_used_as_input_variables_by_entity(builder) -> dict[str, list[str]]: +def _set_used_as_input_variables_by_entity(builder: SimulationBuilder) -> Optional[dict[str, list[str]]]: """Identify and sets the correct input variables for the different entities.""" if builder.used_as_input_variables_by_entity is not None: return @@ -88,7 +98,11 @@ def _set_used_as_input_variables_by_entity(builder) -> dict[str, list[str]]: return builder.used_as_input_variables_by_entity -def filter_input_variables(builder, input_data_frame, tax_benefit_system): +def filter_input_variables( + builder: SimulationBuilder, + input_data_frame: pd.DataFrame, + tax_benefit_system: TaxBenefitSystem, +) -> pd.DataFrame: """Filter the input data frame from variables that won't be used or are set to be computed. Args: @@ -152,7 +166,11 @@ def filter_input_variables(builder, input_data_frame, tax_benefit_system): return input_data_frame -def init_all_entities(builder, input_data_frame, period=None): +def init_all_entities( + builder: SimulationBuilder, + input_data_frame: pd.DataFrame, + period: Any = None, +) -> Any: assert period is not None log.debug(f"Initialasing simulation using input_data_frame for period {period}") builder._set_id_variable_by_entity_key() @@ -182,7 +200,11 @@ def init_all_entities(builder, input_data_frame, period=None): return simulation -def init_entity_structure(builder, entity, input_data_frame): +def init_entity_structure( + builder: SimulationBuilder, + entity: Any, + input_data_frame: pd.DataFrame, +) -> None: """Initialize sthe simulation with tax_benefit_system entities and input_data_frame. Args: @@ -225,7 +247,11 @@ def init_entity_structure(builder, entity, input_data_frame): ) -def init_simulation_with_data_frame(builder, input_data_frame, period): +def init_simulation_with_data_frame( + builder: SimulationBuilder, + input_data_frame: pd.DataFrame, + period: Any, +) -> Any: """Initialize the simulation period with current input_data_frame for an entity if specified.""" used_as_input_variables = builder.used_as_input_variables id_variable_by_entity_key = builder.id_variable_by_entity_key diff --git a/openfisca_survey_manager/simulations.py b/openfisca_survey_manager/policy/simulations.py similarity index 97% rename from openfisca_survey_manager/simulations.py rename to openfisca_survey_manager/policy/simulations.py index 090ce1b8..fb9f3ffc 100644 --- a/openfisca_survey_manager/simulations.py +++ b/openfisca_survey_manager/policy/simulations.py @@ -17,11 +17,14 @@ from openfisca_core.types import Array, Period, TaxBenefitSystem from openfisca_core.types import CoreEntity as Entity +from openfisca_survey_manager.core.dataset import SurveyCollection, load_table from openfisca_survey_manager.exceptions import SurveyManagerError -from openfisca_survey_manager.simulation_builder import SimulationBuilder, diagnose_variable_mismatch -from openfisca_survey_manager.statshelpers import mark_weighted_percentiles -from openfisca_survey_manager.survey_collections import SurveyCollection -from openfisca_survey_manager.utils import do_nothing, load_table +from openfisca_survey_manager.policy.legislation_asof import do_nothing +from openfisca_survey_manager.policy.simulation_builder import ( + SimulationBuilder, + diagnose_variable_mismatch, +) +from openfisca_survey_manager.policy.statshelpers import mark_weighted_percentiles log = logging.getLogger(__name__) @@ -29,7 +32,10 @@ # Helpers -def assert_variables_in_same_entity(tax_benefit_system: TaxBenefitSystem, variables: list): +def assert_variables_in_same_entity( + tax_benefit_system: TaxBenefitSystem, + variables: list[str], +) -> str: """ Assert that variables are in the same entity. @@ -52,7 +58,7 @@ def assert_variables_in_same_entity(tax_benefit_system: TaxBenefitSystem, variab return entity.key -def get_words(text: str): +def get_words(text: str) -> list[str]: return re.compile("[A-Za-z_]+[A-Za-z0-9_]*").findall(text) @@ -113,7 +119,7 @@ def compute_aggregate( missing_variable_default_value: Any = np.nan, weighted: bool = True, alternative_weights: Optional[Union[str, int, float, Array]] = None, - filtering_variable_by_entity: Optional[dict] = None, + filtering_variable_by_entity: Optional[dict[str, str]] = None, ) -> Optional[Union[float, int]]: """ Compute aggregate of a variable. @@ -244,7 +250,7 @@ def compute_quantiles( filter_by: Optional[str] = None, weighted: bool = True, alternative_weights: Optional[Union[str, int, float, Array]] = None, - filtering_variable_by_entity: Optional[dict] = None, + filtering_variable_by_entity: Optional[dict[str, str]] = None, ) -> list[float]: """ Compute quantiles of a variable. @@ -296,8 +302,8 @@ def compute_quantiles( def compute_pivot_table( - simulation: Simulation = None, - baseline_simulation: Simulation = None, + simulation: Optional[Simulation] = None, + baseline_simulation: Optional[Simulation] = None, aggfunc: str = "mean", columns: Optional[list[str]] = None, difference: bool = False, @@ -310,8 +316,8 @@ def compute_pivot_table( concat_axis: Optional[int] = None, weighted: bool = True, alternative_weights: Optional[Union[str, int, float, Array]] = None, - filtering_variable_by_entity: Optional[dict] = None, -): + filtering_variable_by_entity: Optional[dict[str, str]] = None, +) -> pd.DataFrame: """ Compute pivot table. @@ -399,7 +405,7 @@ def compute_pivot_table( variables.add(weight_variable) else: - log.warn( + log.warning( f"There is no weight variable for entity {entity_key} nor alternative weights. " "Switch to unweighted" ) @@ -687,7 +693,7 @@ def compute_winners_losers( observations_threshold: Optional[int] = None, weighted: bool = True, alternative_weights: Optional[Union[str, int, float, Array]] = None, - filtering_variable_by_entity: Optional[dict] = None, + filtering_variable_by_entity: Optional[dict[str, str]] = None, ) -> dict[str, Union[int, float]]: """ Compute the number of winners and losers for a given variable. @@ -744,7 +750,7 @@ def compute_winners_losers( weight_variable = weight_variable_by_entity[entity_key] weight = baseline_simulation.calculate(weight_variable, period=period) else: - log.warn( + log.warning( f"There is no weight variable for entity {entity_key} nor alternative weights. Switch to unweighted" ) @@ -803,8 +809,8 @@ def init_entity_data( entity: Entity, filtered_input_data_frame: pd.DataFrame, period: Period, - used_as_input_variables_by_entity: dict, -): + used_as_input_variables_by_entity: dict[str, list[str]], +) -> None: """ Initialize entity in simulation at some period with input provided by a dataframe. @@ -837,9 +843,9 @@ def init_entity_data( def inflate( simulation: Simulation, - inflator_by_variable: Optional[dict] = None, + inflator_by_variable: Optional[dict[str, float]] = None, period: Optional[Union[int, str, Period]] = None, - target_by_variable: Optional[dict] = None, + target_by_variable: Optional[dict[str, float]] = None, ) -> None: tax_benefit_system = simulation.tax_benefit_system for variable_name in set(inflator_by_variable.keys()).union(set(target_by_variable.keys())): @@ -874,12 +880,12 @@ def inflate( def _load_table_for_survey( config_files_directory: str, collection: str, - survey: str, - table: str, + survey: Optional[str] = None, + table: Optional[str] = None, batch_size: Optional[int] = None, batch_index: Optional[int] = None, filter_by: Optional[str] = None, -): +) -> pd.DataFrame: if survey is not None: input_data_frame = load_table( config_files_directory=config_files_directory, @@ -905,15 +911,15 @@ def _load_table_for_survey( def _input_data_table_by_entity_by_period_monolithic( tax_benefit_system: TaxBenefitSystem, - simulation: Simulation, + simulation: Optional[Simulation], period: Period, - input_data_table_by_entity: dict, + input_data_table_by_entity: dict[str, Any], builder: SimulationBuilder, - custom_input_data_frame: Callable, + custom_input_data_frame: Callable[..., Any], config_files_directory: str, collection: str, survey: Optional[str] = None, -): +) -> Simulation: """ Initialize simulation with input data from a table for each entity and period. """ @@ -961,15 +967,15 @@ def _input_data_table_by_entity_by_period_monolithic( def _input_data_table_by_entity_by_period_batch( tax_benefit_system: TaxBenefitSystem, - simulation: Simulation, + simulation: Optional[Simulation], period: Period, - input_data_table_by_entity: dict, + input_data_table_by_entity: dict[str, Any], builder: SimulationBuilder, - custom_input_data_frame: Callable, + custom_input_data_frame: Callable[..., Any], config_files_directory: str, collection: str, survey: Optional[str] = None, -): +) -> Simulation: """ Initialize simulation with input data from a table for each entity and period. """ @@ -1047,8 +1053,8 @@ def _input_data_table_by_entity_by_period_batch( def init_simulation( tax_benefit_system: TaxBenefitSystem, period: Union[str, int, Period], - data: dict, -): + data: dict[str, Any], +) -> Simulation: builder = SimulationBuilder() builder.create_entities(tax_benefit_system) @@ -1186,7 +1192,7 @@ def init_variable_in_entity( variable_name: str, series: pd.Series, period: Period, -): +) -> None: variable = simulation.tax_benefit_system.variables[variable_name] # np.issubdtype cannot handles categorical variables @@ -1243,7 +1249,7 @@ def init_variable_in_entity( if variable.definition_period == YEAR and period.unit == MONTH: # Some variables defined for a year are present in month/quarter dataframes # Cleaning the dataframe would probably be better in the long run - log.warn( + log.warning( f"Trying to set a monthly value for variable {variable_name}, which is defined on a year. " "The montly values you provided will be summed." ) @@ -1263,7 +1269,7 @@ def new_from_tax_benefit_system( debug: bool = False, trace: bool = False, data: Optional[dict] = None, - memory_config: MemoryConfig = None, + memory_config: Optional[MemoryConfig] = None, period: Optional[Union[int, str, Period]] = None, custom_initialize: Optional[Callable] = None, ) -> Simulation: @@ -1295,7 +1301,7 @@ def new_from_tax_benefit_system( return simulation -def print_memory_usage(simulation: Simulation): +def print_memory_usage(simulation: Simulation) -> None: """ Print memory usage. @@ -1337,7 +1343,7 @@ def print_memory_usage(simulation: Simulation): def set_weight_variable_by_entity( simulation: Simulation, - weight_variable_by_entity: dict, + weight_variable_by_entity: Optional[dict[str, str]], ) -> None: """ Set weight variable for each entity. @@ -1354,7 +1360,7 @@ def summarize_variable( variable: Optional[str] = None, weighted: bool = False, force_compute: bool = False, -): +) -> None: """Print a summary of a variable including its memory usage. Args: @@ -1428,7 +1434,7 @@ def summarize_variable( ) df = pd.DataFrame({variable: array}).replace(categories_by_index).astype(categories_type) df["weights"] = weights if weighted else 1 - groupby = df.groupby(variable)["weights"].sum() + groupby = df.groupby(variable, observed=False)["weights"].sum() total = groupby.sum() expr = [f" {index} = {row:.2e} ({row / total:.1%})" for index, row in groupby.items()] log.info("%s: %s.", period, ",".join(expr)) diff --git a/openfisca_survey_manager/statshelpers.py b/openfisca_survey_manager/policy/statshelpers.py similarity index 85% rename from openfisca_survey_manager/statshelpers.py rename to openfisca_survey_manager/policy/statshelpers.py index ade97abb..a55e2464 100644 --- a/openfisca_survey_manager/statshelpers.py +++ b/openfisca_survey_manager/policy/statshelpers.py @@ -1,3 +1,10 @@ +"""Statistical helpers (Gini, Lorenz, weighted percentiles, etc.).""" + +from __future__ import annotations + +import logging +from typing import Optional + import numpy as np import pandas as pd import weightedcalcs as wc @@ -5,8 +12,13 @@ from numpy import argsort, asarray, cumsum, linspace, ones, repeat, zeros from numpy import logical_and as and_ +log = logging.getLogger(__name__) + -def gini(values, weights=None): +def gini( + values: np.ndarray | pd.Series, + weights: Optional[np.ndarray | pd.Series] = None, +) -> float: """Computes Gini coefficient (normalized to 1). # Using fastgini formula : # i=N j=i @@ -43,7 +55,11 @@ def gini(values, weights=None): return gini -def kakwani(values, ineq_axis, weights=None): +def kakwani( + values: np.ndarray | pd.Series, + ineq_axis: np.ndarray | pd.Series, + weights: Optional[np.ndarray | pd.Series] = None, +) -> float: """Computes the Kakwani index Args: @@ -67,7 +83,10 @@ def kakwani(values, ineq_axis, weights=None): return simps((lcy - plcy), lcx) -def lorenz(values, weights=None): +def lorenz( + values: np.ndarray | pd.Series, + weights: Optional[np.ndarray | pd.Series] = None, +) -> tuple[np.ndarray, np.ndarray]: """Computes Lorenz curve coordinates (x, y) Args: @@ -90,7 +109,13 @@ def lorenz(values, weights=None): return x, y -def mark_weighted_percentiles(a, labels, weights, method, return_quantiles=False): +def mark_weighted_percentiles( + a: np.ndarray | pd.Series, + labels: np.ndarray | list, + weights: np.ndarray | pd.Series, + method: int, + return_quantiles: bool = False, +) -> np.ndarray | tuple[np.ndarray, list[float]]: """ Args: @@ -253,7 +278,11 @@ def mark_weighted_percentiles(a, labels, weights, method, return_quantiles=False return ret -def pseudo_lorenz(values, ineq_axis, weights=None): +def pseudo_lorenz( + values: np.ndarray | pd.Series, + ineq_axis: np.ndarray | pd.Series, + weights: Optional[np.ndarray | pd.Series] = None, +) -> tuple[np.ndarray, np.ndarray]: """Computes The pseudo Lorenz Curve coordinates Args: @@ -276,7 +305,11 @@ def pseudo_lorenz(values, ineq_axis, weights=None): return x, y -def bottom_share(values, rank_from_bottom, weights=None): +def bottom_share( + values: np.ndarray | pd.Series, + rank_from_bottom: float, + weights: Optional[np.ndarray | pd.Series] = None, +) -> float: """ Args: @@ -304,7 +337,11 @@ def bottom_share(values, rank_from_bottom, weights=None): ).sum() -def top_share(values, rank_from_top, weights=None): +def top_share( + values: np.ndarray | pd.Series, + rank_from_top: float, + weights: Optional[np.ndarray | pd.Series] = None, +) -> float: """ Args: @@ -331,7 +368,12 @@ def top_share(values, rank_from_top, weights=None): ).sum() -def weighted_quantiles(data, labels, weights, return_quantiles=False): +def weighted_quantiles( + data: np.ndarray | pd.Series, + labels: np.ndarray | list, + weights: np.ndarray | pd.Series, + return_quantiles: bool = False, +) -> np.ndarray | tuple[np.ndarray, list[float]]: num_categories = len(labels) breaks = linspace(0, 1, num_categories + 1) quantiles = [wquantiles.quantile_1D(data, weights, mybreak) for mybreak in breaks[1:]] @@ -347,7 +389,12 @@ def weighted_quantiles(data, labels, weights, return_quantiles=False): return ret + 1 -def weightedcalcs_quantiles(data, labels, weights, return_quantiles=False): +def weightedcalcs_quantiles( + data: np.ndarray | pd.Series, + labels: np.ndarray | list, + weights: np.ndarray | pd.Series, + return_quantiles: bool = False, +) -> np.ndarray | tuple[np.ndarray, list[float]]: calc = wc.Calculator("weights") num_categories = len(labels) breaks = linspace(0, 1, num_categories + 1) diff --git a/openfisca_survey_manager/policy/tests/__init__.py b/openfisca_survey_manager/policy/tests/__init__.py new file mode 100644 index 00000000..f9d9e925 --- /dev/null +++ b/openfisca_survey_manager/policy/tests/__init__.py @@ -0,0 +1 @@ +# Tests for policy package (simulations, simulation_builder, aggregates). diff --git a/openfisca_survey_manager/tests/test_aggregates.py b/openfisca_survey_manager/policy/tests/test_aggregates.py similarity index 96% rename from openfisca_survey_manager/tests/test_aggregates.py rename to openfisca_survey_manager/policy/tests/test_aggregates.py index 38c461d4..7ec83109 100644 --- a/openfisca_survey_manager/tests/test_aggregates.py +++ b/openfisca_survey_manager/policy/tests/test_aggregates.py @@ -1,7 +1,7 @@ import pytest from openfisca_country_template.reforms.modify_social_security_taxation import modify_social_security_taxation -from openfisca_survey_manager.aggregates import AbstractAggregates +from openfisca_survey_manager.policy import AbstractAggregates from openfisca_survey_manager.tests.test_scenario import create_randomly_initialized_survey_scenario diff --git a/openfisca_survey_manager/tests/test_compute_aggregate.py b/openfisca_survey_manager/policy/tests/test_compute_aggregate.py similarity index 100% rename from openfisca_survey_manager/tests/test_compute_aggregate.py rename to openfisca_survey_manager/policy/tests/test_compute_aggregate.py diff --git a/openfisca_survey_manager/tests/test_compute_pivot_table.py b/openfisca_survey_manager/policy/tests/test_compute_pivot_table.py similarity index 89% rename from openfisca_survey_manager/tests/test_compute_pivot_table.py rename to openfisca_survey_manager/policy/tests/test_compute_pivot_table.py index 75706e21..4c20ff94 100644 --- a/openfisca_survey_manager/tests/test_compute_pivot_table.py +++ b/openfisca_survey_manager/policy/tests/test_compute_pivot_table.py @@ -7,7 +7,7 @@ def test_compute_pivot_table(): survey_scenario = create_randomly_initialized_survey_scenario(reform=modify_social_security_taxation) period = "2017-01" - return survey_scenario.compute_pivot_table( + pivot_table = survey_scenario.compute_pivot_table( aggfunc="mean", columns=["age"], difference=False, @@ -22,3 +22,4 @@ def test_compute_pivot_table(): weighted=True, alternative_weights=None, ) + assert pivot_table is not None diff --git a/openfisca_survey_manager/tests/test_compute_winners_losers.py b/openfisca_survey_manager/policy/tests/test_compute_winners_losers.py similarity index 98% rename from openfisca_survey_manager/tests/test_compute_winners_losers.py rename to openfisca_survey_manager/policy/tests/test_compute_winners_losers.py index 3e50d93a..63fd7867 100644 --- a/openfisca_survey_manager/tests/test_compute_winners_losers.py +++ b/openfisca_survey_manager/policy/tests/test_compute_winners_losers.py @@ -1,7 +1,7 @@ import pytest from openfisca_country_template.reforms.modify_social_security_taxation import modify_social_security_taxation -from openfisca_survey_manager.simulations import SecretViolationError +from openfisca_survey_manager.policy import SecretViolationError from openfisca_survey_manager.tests.test_scenario import create_randomly_initialized_survey_scenario diff --git a/openfisca_survey_manager/tests/test_create_data_frame_by_entity.py b/openfisca_survey_manager/policy/tests/test_create_data_frame_by_entity.py similarity index 100% rename from openfisca_survey_manager/tests/test_create_data_frame_by_entity.py rename to openfisca_survey_manager/policy/tests/test_create_data_frame_by_entity.py diff --git a/openfisca_survey_manager/tests/test_marginal_tax_rate.py b/openfisca_survey_manager/policy/tests/test_marginal_tax_rate.py similarity index 100% rename from openfisca_survey_manager/tests/test_marginal_tax_rate.py rename to openfisca_survey_manager/policy/tests/test_marginal_tax_rate.py diff --git a/openfisca_survey_manager/tests/test_summarize_variables.py b/openfisca_survey_manager/policy/tests/test_summarize_variables.py similarity index 95% rename from openfisca_survey_manager/tests/test_summarize_variables.py rename to openfisca_survey_manager/policy/tests/test_summarize_variables.py index 5e7c2ebb..4b62f548 100644 --- a/openfisca_survey_manager/tests/test_summarize_variables.py +++ b/openfisca_survey_manager/policy/tests/test_summarize_variables.py @@ -19,7 +19,7 @@ def test_summarize_variable_log_output(caplog): The doctest used to check stdout; we now send that output to the logging system. This test captures logs and verifies the expected content is present. """ - with caplog.at_level(logging.INFO, logger="openfisca_survey_manager.simulations"): + with caplog.at_level(logging.INFO, logger="openfisca_survey_manager.policy.simulations"): survey_scenario = create_randomly_initialized_survey_scenario(collection=None) survey_scenario.summarize_variable(variable="housing_occupancy_status", force_compute=True) @@ -32,7 +32,7 @@ def test_summarize_variable_log_output(caplog): assert "owner" in text or "tenant" in text or "free_lodger" in text or "homeless" in text caplog.clear() - with caplog.at_level(logging.INFO, logger="openfisca_survey_manager.simulations"): + with caplog.at_level(logging.INFO, logger="openfisca_survey_manager.policy.simulations"): survey_scenario.summarize_variable(variable="rent", force_compute=True) messages = [r.message for r in caplog.records] @@ -43,7 +43,7 @@ def test_summarize_variable_log_output(caplog): survey_scenario.tax_benefit_systems["baseline"].neutralize_variable("age") caplog.clear() - with caplog.at_level(logging.INFO, logger="openfisca_survey_manager.simulations"): + with caplog.at_level(logging.INFO, logger="openfisca_survey_manager.policy.simulations"): survey_scenario.summarize_variable(variable="age") messages = [r.message for r in caplog.records] diff --git a/openfisca_survey_manager/variables.py b/openfisca_survey_manager/policy/variables.py similarity index 78% rename from openfisca_survey_manager/variables.py rename to openfisca_survey_manager/policy/variables.py index f91cbcc3..fc6f8fc1 100644 --- a/openfisca_survey_manager/variables.py +++ b/openfisca_survey_manager/policy/variables.py @@ -1,14 +1,24 @@ +"""Policy variables helpers (quantile formulas).""" + +from __future__ import annotations + import logging +from typing import Any, Callable, Optional from numpy import arange from openfisca_core.model_api import ADD, YEAR, Variable, where -from openfisca_survey_manager.statshelpers import mark_weighted_percentiles, weightedcalcs_quantiles +from openfisca_survey_manager.policy.statshelpers import mark_weighted_percentiles, weightedcalcs_quantiles log = logging.getLogger(__name__) -def create_quantile(x, nquantiles, weight_variable, entity_name): +def create_quantile( + x: str, + nquantiles: int, + weight_variable: str, + entity_name: Any, +) -> type[Variable]: class quantile(Variable): value_type = int entity = entity_name @@ -35,7 +45,12 @@ def formula(entity, period): return quantile -def quantile(q, variable, weight_variable=None, filter_variable=None): +def quantile( + q: int, + variable: str, + weight_variable: Optional[str] = None, + filter_variable: Optional[str] = None, +) -> Callable[..., Any]: """ Return quantile of a variable with weight provided by a specific wieght variable potentially filtered """ @@ -63,7 +78,12 @@ def formula(entity, period): return formula -def old_quantile(q, variable, weight_variable=None, filter_variable=None): +def old_quantile( + q: int, + variable: str, + weight_variable: Optional[str] = None, + filter_variable: Optional[str] = None, +) -> Callable[..., Any]: def formula(entity, period): value = entity(variable, period) if weight_variable is not None: diff --git a/openfisca_survey_manager/processing/__init__.py b/openfisca_survey_manager/processing/__init__.py index ce2da410..b6e95546 100644 --- a/openfisca_survey_manager/processing/__init__.py +++ b/openfisca_survey_manager/processing/__init__.py @@ -2,6 +2,23 @@ # See docs/REFACTORING_PLAN.md for migration steps. from openfisca_survey_manager.processing.cleaning import clean_data_frame -from openfisca_survey_manager.processing.weights import Calibration, calmar, check_calmar +from openfisca_survey_manager.processing.harmonization import harmonize_data_frame_columns -__all__ = ["Calibration", "calmar", "check_calmar", "clean_data_frame"] + +# Lazy import to avoid circular dependency (processing -> policy -> survey_collections -> core) +def __getattr__(name: str) -> object: + if name in ("Calibration", "calmar", "check_calmar"): + from openfisca_survey_manager.policy.calibration import Calibration + from openfisca_survey_manager.policy.calmar import calmar, check_calmar + + return {"Calibration": Calibration, "calmar": calmar, "check_calmar": check_calmar}[name] + raise AttributeError(f"module {__name__!r} has no attribute {name!r}") + + +__all__ = [ + "Calibration", + "calmar", + "check_calmar", + "clean_data_frame", + "harmonize_data_frame_columns", +] diff --git a/openfisca_survey_manager/processing/cleaning.py b/openfisca_survey_manager/processing/cleaning.py index 6a0b81fa..5aa88b7e 100644 --- a/openfisca_survey_manager/processing/cleaning.py +++ b/openfisca_survey_manager/processing/cleaning.py @@ -1,5 +1,7 @@ """Data frame cleaning (column normalization, empty handling).""" +from __future__ import annotations + import logging import pandas as pd diff --git a/openfisca_survey_manager/processing/harmonization.py b/openfisca_survey_manager/processing/harmonization.py new file mode 100644 index 00000000..4d30c6b4 --- /dev/null +++ b/openfisca_survey_manager/processing/harmonization.py @@ -0,0 +1,36 @@ +"""Column harmonization for survey data (lowercase, ident renaming).""" + +from __future__ import annotations + +import logging +import re + +import pandas as pd + +log = logging.getLogger(__name__) + +# Column names matching this pattern (e.g. ident01, ident2019) are renamed to "ident" +IDENT_COLUMN_PATTERN = re.compile(r"(?i)ident\d{2,4}$") + + +def harmonize_data_frame_columns( + data_frame: pd.DataFrame, + *, + lowercase: bool = False, + rename_ident: bool = True, +) -> None: + """Harmonize column names in place. + + - If lowercase: rename all columns to lowercase. + - If rename_ident: rename the first column matching ident pattern (e.g. ident01, ident2019) to "ident". + """ + if lowercase: + columns = {col: col.lower() for col in data_frame.columns} + data_frame.rename(columns=columns, inplace=True) + + if rename_ident: + for column_name in data_frame.columns: + if IDENT_COLUMN_PATTERN.match(str(column_name)) is not None: + data_frame.rename(columns={column_name: "ident"}, inplace=True) + log.info("%s column have been replaced by ident", column_name) + break diff --git a/openfisca_survey_manager/processing/weights/__init__.py b/openfisca_survey_manager/processing/weights/__init__.py deleted file mode 100644 index 66b4484b..00000000 --- a/openfisca_survey_manager/processing/weights/__init__.py +++ /dev/null @@ -1,6 +0,0 @@ -# Calibration and CALMAR weight calibration. See docs/REFACTORING_PLAN.md. - -from openfisca_survey_manager.processing.weights.calibration import Calibration -from openfisca_survey_manager.processing.weights.calmar import calmar, check_calmar - -__all__ = ["Calibration", "calmar", "check_calmar"] diff --git a/openfisca_survey_manager/read_dbf.py b/openfisca_survey_manager/read_dbf.py deleted file mode 100644 index 3ff17ae2..00000000 --- a/openfisca_survey_manager/read_dbf.py +++ /dev/null @@ -1,5 +0,0 @@ -"""Re-export for backward compatibility. Prefer: from openfisca_survey_manager.io.readers import read_dbf.""" - -from openfisca_survey_manager.io.readers import read_dbf - -__all__ = ["read_dbf"] diff --git a/openfisca_survey_manager/read_sas.py b/openfisca_survey_manager/read_sas.py deleted file mode 100644 index 58168894..00000000 --- a/openfisca_survey_manager/read_sas.py +++ /dev/null @@ -1,5 +0,0 @@ -"""Re-export for backward compatibility. Prefer: from openfisca_survey_manager.io.readers import read_sas.""" - -from openfisca_survey_manager.io.readers import read_sas - -__all__ = ["read_sas"] diff --git a/openfisca_survey_manager/read_spss.py b/openfisca_survey_manager/read_spss.py deleted file mode 100644 index ee4da5cb..00000000 --- a/openfisca_survey_manager/read_spss.py +++ /dev/null @@ -1,5 +0,0 @@ -"""Re-export for backward compatibility. Prefer: from openfisca_survey_manager.io.readers import read_spss.""" - -from openfisca_survey_manager.io.readers import read_spss - -__all__ = ["read_spss"] diff --git a/openfisca_survey_manager/scripts/build_collection.py b/openfisca_survey_manager/scripts/build_collection.py index 4aa6472b..51b368e0 100755 --- a/openfisca_survey_manager/scripts/build_collection.py +++ b/openfisca_survey_manager/scripts/build_collection.py @@ -13,9 +13,12 @@ import sys from pathlib import Path -from openfisca_survey_manager.paths import default_config_files_directory, openfisca_survey_manager_location -from openfisca_survey_manager.survey_collections import SurveyCollection -from openfisca_survey_manager.surveys import Survey +from openfisca_survey_manager.configuration.paths import ( + default_config_files_directory, + openfisca_survey_manager_location, +) +from openfisca_survey_manager.core.dataset import SurveyCollection +from openfisca_survey_manager.core.survey import Survey app_name = Path(__file__).stem log = logging.getLogger(app_name) @@ -240,9 +243,14 @@ def main(): "--parquet", action="store_true", default=False, + help="save data in parquet format (directory with one .parquet file per table)", + ) + parser.add_argument( + "--zarr", + action="store_true", + default=False, help=( - "save data in parquet format instead of HDF5 " - "(HDF5 will no longer be the default format in a future version)" + "save data in zarr format (one zarr group per table); requires: pip install openfisca-survey-manager[zarr]" ), ) parser.add_argument( @@ -280,22 +288,16 @@ def main(): start_time = datetime.datetime.now() - # Determine store format based on argument - store_format = "parquet" if args.parquet else "hdf5" - - # Deprecation warning for HDF5 format - if not args.parquet: - import warnings - - warnings.warn( - "HDF5 will no longer be the default format in a future version. " - "Please use --parquet option to save data in parquet format.", - DeprecationWarning, - stacklevel=2, - ) + # Determine store format based on argument (--zarr > --parquet > default hdf5) + if args.zarr: + store_format = "zarr" + elif args.parquet: + store_format = "parquet" + else: + store_format = "hdf5" log.warning( "HDF5 will no longer be the default format in a future version. " - "Please use --parquet option to save data in parquet format." + "Please use --parquet or --zarr to save data in parquet or zarr format." ) try: diff --git a/openfisca_survey_manager/scripts/migrate_config_to_rfc002.py b/openfisca_survey_manager/scripts/migrate_config_to_rfc002.py new file mode 100644 index 00000000..acd42d1f --- /dev/null +++ b/openfisca_survey_manager/scripts/migrate_config_to_rfc002.py @@ -0,0 +1,252 @@ +#!/usr/bin/env python +""" +Migrate existing config (config.ini + raw_data.ini + JSON collections) to RFC-002 layout. + +Produces: + - config.yaml (collections_dir, default_output_dir, tmp_dir) + - collections_dir//manifest.yaml per collection + +Usage: + python -m openfisca_survey_manager.scripts.migrate_config_to_rfc002 [--config-dir PATH] [--dry-run] +""" + +from __future__ import annotations + +import argparse +import configparser +import json +import logging +import sys +from pathlib import Path + +import yaml + +# Allow running as __main__ or as script +try: + from openfisca_survey_manager.configuration.config_loader import ( + CONFIG_FILENAME, + MANIFEST_FILENAME, + ) +except ImportError: + CONFIG_FILENAME = "config.yaml" + MANIFEST_FILENAME = "manifest.yaml" + +log = logging.getLogger(__name__) + +SOURCE_FORMAT_KEYS = ("csv_files", "sas_files", "stata_files", "parquet_files") + + +def _informations_to_source(informations: dict) -> tuple[str, str]: + """From Survey.informations (e.g. csv_files, sas_files), return (format, path).""" + if not informations: + return "csv", "" + for key in SOURCE_FORMAT_KEYS: + paths = informations.get(key) + if paths and isinstance(paths, list) and len(paths) > 0: + fmt = key.replace("_files", "") + path = paths[0] if isinstance(paths[0], str) else str(paths[0]) + return fmt, path + return "csv", "" + + +def build_manifest_from_json( + json_path: Path, + raw_data_section: dict[str, str] | None = None, +) -> dict: + """ + Build RFC-002 manifest dict from a legacy collection JSON file. + raw_data_section: optional dict survey_name -> path from raw_data.ini [collection_name]. + """ + with json_path.open(encoding="utf-8") as f: + data = json.load(f) + name = data.get("name", json_path.stem) + label = data.get("label", name) + surveys_data = data.get("surveys", {}) + if not isinstance(surveys_data, dict): + surveys_data = {} + surveys = {} + for survey_name, survey_obj in surveys_data.items(): + if not isinstance(survey_obj, dict): + continue + infos = survey_obj.get("informations", {}) or {} + if raw_data_section and survey_name in raw_data_section: + path = raw_data_section[survey_name] + fmt = "csv" + for k in SOURCE_FORMAT_KEYS: + if infos.get(k): + fmt = k.replace("_files", "") + break + else: + fmt, path = _informations_to_source(infos) + surveys[survey_name] = { + "label": survey_obj.get("label", survey_name), + "source": {"format": fmt, "path": path}, + } + if survey_obj.get("output_subdir"): + surveys[survey_name]["output_subdir"] = survey_obj["output_subdir"] + + store_format = _infer_store_format_from_legacy(surveys_data) + return {"name": name, "label": label, "store_format": store_format, "surveys": surveys} + + +def _infer_store_format_from_legacy(surveys_data: dict) -> str: + """Infer store_format from legacy JSON surveys (parquet_file_path, zarr_file_path, hdf5_file_path).""" + if not isinstance(surveys_data, dict): + return "parquet" + for survey_obj in surveys_data.values(): + if not isinstance(survey_obj, dict): + continue + if survey_obj.get("zarr_file_path"): + return "zarr" + if survey_obj.get("parquet_file_path"): + return "parquet" + if survey_obj.get("hdf5_file_path"): + return "hdf5" + return "parquet" + + +def load_raw_data_ini(config_dir: Path) -> configparser.ConfigParser | None: + """Load raw_data.ini if present.""" + path = config_dir / "raw_data.ini" + if not path.is_file(): + return None + parser = configparser.ConfigParser() + parser.read(path, encoding="utf-8") + return parser + + +def migrate( + config_dir: Path, + *, + dry_run: bool = False, +) -> bool: + """ + Migrate config_dir from config.ini (+ raw_data.ini + JSON) to config.yaml + manifests. + Returns True if migration was done (or dry_run succeeded). + """ + config_ini = config_dir / "config.ini" + if not config_ini.is_file(): + log.error("No config.ini found in %s", config_dir) + return False + + parser = configparser.ConfigParser() + parser.read(config_ini, encoding="utf-8") + if "collections" not in parser.sections(): + log.error("config.ini has no [collections] section") + return False + + collections_dir_str = parser.get("collections", "collections_directory", fallback=None) + if not collections_dir_str: + collections_dir_str = str(config_dir / "collections") + collections_dir = Path(collections_dir_str).expanduser().resolve() + + output_dir = parser.get("data", "output_directory", fallback=str(config_dir / "output")) + tmp_dir = parser.get("data", "tmp_directory", fallback="/tmp") + if "data" not in parser.sections(): + output_dir = str(config_dir / "output") + tmp_dir = "/tmp" + + raw_data = load_raw_data_ini(config_dir) + collection_names: list[str] = [] + for key in parser.options("collections"): + if key == "collections_directory": + continue + collection_names.append(key) + + if not collection_names: + log.warning("No collection entries in config.ini (only collections_directory)") + # Still write config.yaml so the dir is ready for new-style use + else: + if not dry_run: + collections_dir.mkdir(parents=True, exist_ok=True) + for name in collection_names: + try: + json_path_str = parser.get("collections", name) + except configparser.NoOptionError: + continue + json_path = Path(json_path_str).expanduser().resolve() + if not json_path.is_file(): + log.warning("Collection %s: JSON file not found %s", name, json_path) + continue + raw_section = None + if raw_data and raw_data.has_section(name): + raw_section = dict(raw_data.items(name)) + manifest = build_manifest_from_json(json_path, raw_section) + manifest_path = collections_dir / name / MANIFEST_FILENAME + if dry_run: + log.info("[dry-run] Would write %s", manifest_path) + continue + manifest_path.parent.mkdir(parents=True, exist_ok=True) + with manifest_path.open("w", encoding="utf-8") as f: + yaml.safe_dump( + manifest, + f, + default_flow_style=False, + allow_unicode=True, + sort_keys=False, + ) + log.info("Wrote %s", manifest_path) + + config_yaml_path = config_dir / CONFIG_FILENAME + new_config = { + "collections_dir": str(collections_dir), + "default_output_dir": str(Path(output_dir).expanduser().resolve()), + "tmp_dir": str(Path(tmp_dir).expanduser().resolve()), + } + if dry_run: + log.info("[dry-run] Would write %s with %s", config_yaml_path, new_config) + return True + with config_yaml_path.open("w", encoding="utf-8") as f: + yaml.safe_dump(new_config, f, default_flow_style=False, sort_keys=False) + log.info("Wrote %s", config_yaml_path) + return True + + +def main() -> int: + ap = argparse.ArgumentParser( + description="Migrate config.ini + raw_data.ini + JSON to RFC-002 (config.yaml + manifests)", + ) + ap.add_argument( + "--config-dir", + type=Path, + default=None, + help="Directory containing config.ini (default: XDG or OPENFISCA_SURVEY_CONFIG_DIR)", + ) + ap.add_argument( + "--dry-run", + action="store_true", + help="Only log what would be done", + ) + ap.add_argument( + "-v", + "--verbose", + action="store_true", + help="Verbose logging", + ) + args = ap.parse_args() + logging.basicConfig( + level=logging.DEBUG if args.verbose else logging.INFO, + format="%(message)s", + stream=sys.stdout, + ) + if args.config_dir is None: + try: + from openfisca_survey_manager.configuration.config_loader import ( + get_config_dir, + ) + + config_dir = get_config_dir() + except Exception: + log.error("Provide --config-dir or set OPENFISCA_SURVEY_CONFIG_DIR") + return 1 + else: + config_dir = args.config_dir.expanduser().resolve() + if not config_dir.is_dir(): + log.error("Config directory does not exist: %s", config_dir) + return 1 + ok = migrate(config_dir, dry_run=args.dry_run) + return 0 if ok else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/openfisca_survey_manager/survey_collections.py b/openfisca_survey_manager/survey_collections.py deleted file mode 100644 index ba25168b..00000000 --- a/openfisca_survey_manager/survey_collections.py +++ /dev/null @@ -1,154 +0,0 @@ -import codecs -import collections -import configparser -import json -import logging -from pathlib import Path - -from openfisca_survey_manager.config import Config -from openfisca_survey_manager.exceptions import SurveyConfigError -from openfisca_survey_manager.paths import default_config_files_directory -from openfisca_survey_manager.surveys import Survey - -log = logging.getLogger(__name__) - - -class SurveyCollection: - """A collection of Surveys""" - - def __init__( - self, config_files_directory=default_config_files_directory, label=None, name=None, json_file_path=None - ): - self.name = name - self.label = label - self.json_file_path = json_file_path - self.surveys = [] - log.debug(f"Initializing SurveyCollection from config file found in {config_files_directory} ..") - config = Config(config_files_directory=config_files_directory) - if label is not None: - self.label = label - if name is not None: - self.name = name - if json_file_path is not None: - self.json_file_path = json_file_path - if "collections" not in config.sections(): - config["collections"] = {} - config.set("collections", self.name, str(self.json_file_path)) - config.save() - elif config is not None: - if config.has_option("collections", self.name): - self.json_file_path = config.get("collections", self.name) - elif config.get("collections", "collections_directory") is not None: - self.json_file_path = str(Path(config.get("collections", "collections_directory")) / (name + ".json")) - - self.config = config - - def __repr__(self): - header = f"""{self.name} -Survey collection of {self.label} -Contains the following surveys : -""" - surveys = [f" {survey.name} : {survey.label} \n" for survey in self.surveys] - return header + "".join(surveys) - - def dump(self, config_files_directory=None, json_file_path=None): - """ - Dump the survey collection to a json file - And set the json file path in the config file - """ - if self.config is not None: - config = self.config - else: - if config_files_directory is not None: - pass - else: - config_files_directory = default_config_files_directory - self.config = Config(config_files_directory=config_files_directory) - - if json_file_path is None: - assert self.json_file_path is not None, "A json_file_path should be provided" - else: - self.json_file_path = json_file_path - - config.set("collections", self.name, str(self.json_file_path)) - config.save() - with codecs.open(str(self.json_file_path), "w", encoding="utf-8") as _file: - json.dump(self.to_json(), _file, ensure_ascii=False, indent=2) - - def fill_store( - self, - source_format=None, - surveys=None, - tables=None, - overwrite=False, - keep_original_parquet_file=False, - encoding=None, - store_format="hdf5", - categorical_strategy="unique_labels", - ): - if surveys is None: - surveys = self.surveys - for survey in surveys: - survey.fill_store( - source_format=source_format, - tables=tables, - overwrite=overwrite, - keep_original_parquet_file=keep_original_parquet_file, - encoding=encoding, - store_format=store_format, - categorical_strategy=categorical_strategy, - ) - self.dump() - - def get_survey(self, survey_name): - available_surveys_names = [survey.name for survey in self.surveys] - assert survey_name in available_surveys_names, ( - f"Survey {survey_name} cannot be found for survey collection {self.name}.\n" - f"Available surveys are :{available_surveys_names}" - ) - return [survey for survey in self.surveys if survey.name == survey_name].pop() - - @classmethod - def load(cls, json_file_path=None, collection=None, config_files_directory=default_config_files_directory): - assert Path(config_files_directory).exists() - config = Config(config_files_directory=config_files_directory) - if json_file_path is None: - assert collection is not None, "A collection is needed" - try: - json_file_path = config.get("collections", collection) - except (configparser.NoOptionError, configparser.NoSectionError) as error: - msg = f"Looking for config file in {config_files_directory}" - log.debug(msg) - log.error(error) - raise error - except Exception as error: - msg = f"Looking for config file in {config_files_directory}" - log.debug(msg) - log.error(error) - raise SurveyConfigError(msg) from error - - with Path(json_file_path).open("r") as _file: - self_json = json.load(_file) - name = self_json["name"] - - self = cls(config_files_directory=config_files_directory, name=name) - self.config = config - with Path(json_file_path).open("r") as _file: - self_json = json.load(_file) - self.json_file_path = json_file_path - self.label = self_json.get("label") - self.name = self_json.get("name") - - surveys = self_json["surveys"] - for survey_name, survey_json in surveys.items(): - survey = Survey(name=survey_name) - self.surveys.append(survey.create_from_json(survey_json)) - return self - - def to_json(self): - self_json = collections.OrderedDict(()) - self_json["name"] = self.name - self_json["surveys"] = collections.OrderedDict(()) - for survey in self.surveys: - self_json["surveys"][survey.name] = survey.to_json() - return self_json diff --git a/openfisca_survey_manager/surveys.py b/openfisca_survey_manager/surveys.py deleted file mode 100644 index 77dd566e..00000000 --- a/openfisca_survey_manager/surveys.py +++ /dev/null @@ -1,388 +0,0 @@ -#! /usr/bin/env python - - -import collections -import logging -import re -from pathlib import Path - -import pandas -import pyarrow as pa -import pyarrow.parquet as pq -import yaml - -from openfisca_survey_manager.exceptions import SurveyIOError, SurveyManagerError - -from .tables import Table - -ident_re = re.compile(r"(?i)ident\d{2,4}$") - -log = logging.getLogger(__name__) - - -source_format_by_extension = { - "csv": "csv", - "sas7bdat": "sas", - "dta": "stata", - "Rdata": "Rdata", - "spss": "sav", - "parquet": "parquet", -} - -admissible_source_formats = list(source_format_by_extension.values()) - - -class NoMoreDataError(Exception): - # Exception when the user ask for more data than available in file - pass - - -class Survey: - """An object to describe survey data""" - - hdf5_file_path = None - parquet_file_path = None - label = None - name = None - survey_collection = None - - def __init__( - self, name=None, label=None, hdf5_file_path=None, parquet_file_path=None, survey_collection=None, **kwargs - ): - assert name is not None, "A survey should have a name" - self.name = name - self.tables = collections.OrderedDict() - self.informations = {} - self.tables_index = {} - - if label is not None: - self.label = label - - if hdf5_file_path is not None: - self.hdf5_file_path = hdf5_file_path - - if parquet_file_path is not None: - self.parquet_file_path = parquet_file_path - - if survey_collection is not None: - self.survey_collection = survey_collection - - self.informations = kwargs - - def __repr__(self): - header = f"""{self.name} : survey data {self.label} -Contains the following tables : \n""" - tables = yaml.safe_dump(list(self.tables.keys()), default_flow_style=False) - informations = yaml.safe_dump(self.informations, default_flow_style=False) - return header + tables + informations - - @classmethod - def create_from_json(cls, survey_json): - self = cls( - name=survey_json.get("name"), - label=survey_json.get("label"), - hdf5_file_path=survey_json.get("hdf5_file_path"), - parquet_file_path=survey_json.get("parquet_file_path"), - **survey_json.get("informations", {}), - ) - self.tables = survey_json.get("tables") - return self - - def dump(self): - assert self.survey_collection is not None - self.survey_collection.dump() - - def fill_store( - self, - source_format=None, - tables=None, - overwrite=True, - keep_original_parquet_file=False, - encoding=None, - store_format="hdf5", - categorical_strategy="unique_labels", - ): - """ - Convert data from the source files to store format either hdf5 or parquet. - If the source is in parquet, the data is not converted. - """ - assert self.survey_collection is not None - assert isinstance(overwrite, (bool, list)) - survey = self - # Create folder if it does not exist - config = survey.survey_collection.config - directory_path = config.get("data", "output_directory") - if not Path(directory_path).is_dir(): - log.warn( - f"{directory_path} who should be the store data directory does not exist: we create the directory" - ) - Path(directory_path).mkdir(parents=True) - - if source_format == "parquet": - store_format = "parquet" - - if store_format == "hdf5" and survey.hdf5_file_path is None: - survey.hdf5_file_path = str(Path(directory_path) / (survey.name + ".h5")) - - if store_format == "parquet" and survey.parquet_file_path is None: - survey.parquet_file_path = str(Path(directory_path) / survey.name) - - self.store_format = store_format - - if source_format is not None: - assert source_format in admissible_source_formats, f"Data source format {source_format} is unknown" - source_formats = [source_format] - else: - source_formats = admissible_source_formats - - for source_format in source_formats: - files = f"{source_format}_files" - for data_file in survey.informations.get(files, []): - name = Path(data_file).stem - extension = Path(data_file).suffix - if tables is None or name in tables: - if keep_original_parquet_file: - # Use folder instead of files if numeric at end of file - if re.match(r".*-\d$", name): - name = name.split("-")[0] - parquet_file = str(Path(data_file).parent) - # Get the parent folder - survey.parquet_file_path = str(Path(data_file).parent.parent) - else: - parquet_file = data_file - survey.parquet_file_path = str(Path(data_file).parent) - table = Table( - label=name, - name=name, - source_format=source_format_by_extension[extension[1:]], - survey=survey, - parquet_file=parquet_file, - ) - table.read_parquet_columns(data_file) - - else: - table = Table( - label=name, - name=name, - source_format=source_format_by_extension[extension[1:]], - survey=survey, - ) - table.fill_store( - data_file, - clean=True, - overwrite=overwrite if isinstance(overwrite, bool) else table.name in overwrite, - encoding=encoding, - categorical_strategy=categorical_strategy, - ) - self.dump() - - def get_value(self, variable, table, lowercase=False, ignorecase=False): - """Get variable value from a survey table. - - Args: - variable: variable to retrieve - table(str): name of the table - lowercase(bool, optional, optional): lowercase variable names, defaults to False - ignorecase: ignore case of table name, defaults to False - - Returns: - pd.DataFrame: dataframe containing the variable - - """ - return self.get_values([variable], table) - - def get_values( - self, - variables=None, - table=None, - lowercase=False, - ignorecase=False, - rename_ident=True, - batch_size=None, - batch_index=0, - filter_by=None, - ) -> pandas.DataFrame: - """Get variables values from a survey table. - - Args: - variables(list, optional, optional): variables to retrieve, defaults to None (retrieve all variables) - table(str, optional, optional): name of the table, defaults to None - ignorecase: ignore case of table name, defaults to False - lowercase(bool, optional, optional): lowercase variable names, defaults to False - rename_ident(bool, optional, optional): rename ident+yr (e.g. ident08) into ident, defaults to True - batch_size(int, optional, optional): batch size for parquet file, defaults to None - batch_index(int, optional, optional): batch index for parquet file, defaults to 0 - - Returns: - pd.DataFrame: dataframe containing the variables - - Raises: - Exception: - - """ - if self.parquet_file_path is None and self.hdf5_file_path is None: - raise SurveyIOError(f"No data file found for survey {self.name}") - if self.hdf5_file_path is not None: - assert Path(self.hdf5_file_path).exists(), ( - f"{self.hdf5_file_path} is not a valid path. This could happen because " - "your data were not builded yet. Please consider using a rebuild option in your code." - ) - store = pandas.HDFStore(self.hdf5_file_path, "r") - if ignorecase: - keys = store.keys() - eligible_tables = [] - for string in keys: - match = re.findall(table, string, re.IGNORECASE) - if match: - eligible_tables.append(match[0]) - if len(eligible_tables) > 1: - raise SurveyManagerError( - f"{table} is ambiguous since the following tables are available: {eligible_tables}" - ) - elif len(eligible_tables) == 0: - raise SurveyIOError(f"No eligible available table in {keys}") - else: - table = eligible_tables[0] - try: - df = store.select(table) - except KeyError: - log.error(f"No table {table} in the file {self.hdf5_file_path}") - log.error( - f"This could happen because your data were not builded yet. Available tables are: {store.keys()}" - ) - store.close() - raise - - store.close() - - elif self.parquet_file_path is not None: - if table is None: - raise SurveyIOError("A table name is needed to retrieve data from a parquet file") - for table_name, table_content in self.tables.items(): - if table == table_name: - parquet_file = table_content.get("parquet_file") - # Is parquet_file a folder or a file? - if Path(parquet_file).is_dir(): - # find first parquet file in folder - for file in Path(parquet_file).iterdir(): - if file.suffix == ".parquet": - one_parquet_file = str(Path(parquet_file) / file) - break - else: - raise SurveyIOError(f"No parquet file found in {parquet_file}") - else: - one_parquet_file = parquet_file - parquet_schema = pq.read_schema(one_parquet_file) - assert len(parquet_schema.names) >= 1, ( - f"The parquet file {table_content.get('parquet_file')} is empty" - ) - if variables is None: - variables = table_content.get("variables") - if filter_by: - df = pq.ParquetDataset(parquet_file, filters=filter_by).read(columns=variables).to_pandas() - elif batch_size: - if Path(parquet_file).is_dir(): - parquet_file = [str(p) for p in Path(parquet_file).glob("*.parquet")] - else: - parquet_file = [parquet_file] - # Initialize an empty list to store the Parquet tables - tables = [] - # Loop through the file paths and read each Parquet file - for file_path in parquet_file: - table = pq.read_table(file_path, columns=variables) - tables.append(table) - - # Concatenate the tables if needed - final_table = pa.concat_tables(tables) if len(tables) > 1 else tables[0] - record_batches = final_table.to_batches(max_chunksize=batch_size) - if len(record_batches) <= batch_index: - raise NoMoreDataError( - f"Batch {batch_index} not found in {table_name}. Max index is {len(record_batches)}" - ) - df = record_batches[batch_index].to_pandas() - # iter_parquet = parquet_file.iter_batches(batch_size=batch_size, columns=variables) - # index = 0 - # while True: - # try: - # batch = next(iter_parquet) - # except StopIteration: - # raise NoMoreDataError( - # f"Batch {batch_index} not found in {table_name}. Max index is {index}" - # ) - # break - # if batch_index == index: - # df = batch.to_pandas() - # break - # index += 1 - else: - df = pq.ParquetDataset(parquet_file).read(columns=variables).to_pandas() - break - else: - raise SurveyIOError(f"No table {table} found in {self.parquet_file_path}") - - if lowercase: - columns = {column_name: column_name.lower() for column_name in df} - df.rename(columns=columns, inplace=True) - - if rename_ident is True: - for column_name in df: - if ident_re.match(str(column_name)) is not None: - df.rename(columns={column_name: "ident"}, inplace=True) - log.info(f"{column_name} column have been replaced by ident") - break - - if variables is None: - return df - else: - diff = set(variables) - set(df.columns) - if diff: - raise SurveyIOError(f"The following variable(s) {diff} are missing") - variables = list(set(variables).intersection(df.columns)) - df = df[variables] - return df - - def insert_table(self, label=None, name=None, **kwargs): - """Insert a table in the Survey object. - - If a pandas dataframe is provided, it is saved in the store file - """ - parquet_file = kwargs.pop("parquet_file", None) - data_frame = kwargs.pop("data_frame", None) - if data_frame is None: - # Try without underscore - data_frame = kwargs.pop("dataframe", None) - - if data_frame is not None: - assert isinstance(data_frame, pandas.DataFrame) - variables = kwargs.pop("variables", None) - if variables is not None: - assert set(variables) < set(data_frame.columns) - else: - variables = list(data_frame.columns) - if label is None: - label = name - table = Table(label=label, name=name, survey=self, variables=variables, parquet_file=parquet_file) - assert (table.survey.hdf5_file_path is not None) or (table.survey.parquet_file_path is not None) - if parquet_file is not None: - log.debug(f"Saving table {name} in {table.survey.parquet_file_path}") - data_frame.to_parquet(parquet_file) - else: - log.debug(f"Saving table {name} in {table.survey.hdf5_file_path}") - to_hdf_kwargs = kwargs.pop("to_hdf_kwargs", {}) - table.save_data_frame_to_hdf5(data_frame, **to_hdf_kwargs) - - if name not in self.tables: - self.tables[name] = {} - for key, val in kwargs.items(): - self.tables[name][key] = val - - def to_json(self): - """Convert the survey to a JSON object.""" - self_json = collections.OrderedDict(()) - self_json["hdf5_file_path"] = str(self.hdf5_file_path) if self.hdf5_file_path else None - self_json["parquet_file_path"] = str(self.parquet_file_path) if self.parquet_file_path else None - self_json["label"] = self.label - self_json["name"] = self.name - self_json["tables"] = self.tables - self_json["informations"] = collections.OrderedDict(sorted(self.informations.items())) - return self_json diff --git a/openfisca_survey_manager/tables.py b/openfisca_survey_manager/tables.py deleted file mode 100644 index 89951176..00000000 --- a/openfisca_survey_manager/tables.py +++ /dev/null @@ -1,365 +0,0 @@ -"""Tables.""" - -import collections -import csv -import datetime -import errno -import gc -import logging -import os -from pathlib import Path - -import pandas -from chardet.universaldetector import UniversalDetector -from pyarrow import parquet as pq - -from openfisca_survey_manager import read_sas -from openfisca_survey_manager.exceptions import SurveyIOError -from openfisca_survey_manager.io.writers import write_table_to_hdf5, write_table_to_parquet -from openfisca_survey_manager.processing.cleaning import clean_data_frame - -try: - from openfisca_survey_manager.read_spss import read_spss -except ImportError: - read_spss = None - - -log = logging.getLogger(__name__) - - -reader_by_source_format = { - # Rdata = pandas.rpy.common.load_data, - "csv": pandas.read_csv, - "sas": read_sas.read_sas, - "spss": read_spss, - "stata": pandas.read_stata, - "parquet": pandas.read_parquet, -} - - -class Table: - """A table of a survey.""" - - label = None - name = None - source_format = None - survey = None - variables = None - parquet_file = None - - def __init__( - self, survey=None, name=None, label=None, source_format=None, variables=None, parquet_file=None, **kwargs - ): - assert name is not None, "A table should have a name" - self.name = name - self.label = label - self.source_format = source_format - self.variables = variables - self.parquet_file = parquet_file - self.informations = kwargs - - from .surveys import Survey # Keep it here to avoid infinite recursion - - assert isinstance(survey, Survey), f"survey is of type {type(survey)} and not {Survey}" - self.survey = survey - if not survey.tables: - survey.tables = collections.OrderedDict() - - survey.tables[name] = collections.OrderedDict( - source_format=source_format, - variables=variables, - parquet_file=parquet_file, - ) - - def _check_and_log(self, data_file_path, store_file_path): - """ - Check if the file exists and log the insertion. - - Args: - data_file_path: Data file path - store_file_path: Store file or dir path - - Raises: - Exception: File not found - """ - assert store_file_path is not None, "Store file path cannot be None" - if not Path(data_file_path).is_file(): - raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), data_file_path) - - log.info( - f"Inserting table {self.name} from file {data_file_path} in store file {store_file_path} " - f"at point {self.name}" - ) - - def _is_stored(self): - if self.survey.hdf5_file_path is not None: - store = pandas.HDFStore(self.survey.hdf5_file_path) - if self.name in store: - log.info(f"Exiting without overwriting {self.name} in {self.survey.hdf5_file_path}") - store.close() - return True - - store.close() - return False - else: - return False - - def _save(self, data_frame: pandas.DataFrame = None, store_format="hdf5"): - """ - Save a data frame in the store according to is format (HDF5 or Parque). - """ - assert data_frame is not None - variables = self.variables - - if variables: - stored_variables = list(set(variables).intersection(set(data_frame.columns))) - log.info(f"The folloging variables are stored: {stored_variables}") - if set(stored_variables) != set(variables): - log.info( - "variables wanted by the user that were not available: " - f"{list(set(variables) - set(stored_variables))}" - ) - data_frame = data_frame[stored_variables].copy() - - assert store_format in ["hdf5", "parquet"], f"invalid store_format: {store_format}" - if store_format == "hdf5": - import warnings - - warnings.warn( - "HDF5 will no longer be the default format in a future version. Please use parquet format instead.", - DeprecationWarning, - stacklevel=3, - ) - log.warning( - "HDF5 will no longer be the default format in a future version. Please use parquet format instead." - ) - self.save_data_frame_to_hdf5(data_frame) - else: - parquet_file_path = self.survey.parquet_file_path - log.info(f"Inserting table {self.name} in Parquet file {parquet_file_path}") - self.save_data_frame_to_parquet(data_frame) - gc.collect() - - def fill_store(self, data_file, overwrite: bool = False, clean: bool = False, **kwargs): - """ - Fill the store (HDF5 or parquet file) with the table. - Read the `data_file` in parameter and save it to the store. - - Args: - data_file (_type_, optional): The data file path. Defaults to None. - overwrite (bool, optional): Overwrite the data. Defaults to False. - clean (bool, optional): Clean the raw data befoe saving. Defaults to False. - store_format (str, optional): _description_. Defaults to "hdf5". - - Raises: - e: Skip file if error - """ - if not overwrite and self._is_stored(): - log.info(f"Exiting without overwriting {self.name} in {self.survey.hdf5_file_path}") - return - - start_table_time = datetime.datetime.now() - if self.source_format in ["sas", "parquet"] and "encoding" in kwargs: - del kwargs["encoding"] - data_frame = self.read_source(data_file, **kwargs) - try: - if clean: - clean_data_frame(data_frame) - self._save(data_frame=data_frame, store_format=self.survey.store_format) - log.info(f"File {data_file} has been processed in {datetime.datetime.now() - start_table_time}") - except Exception as e: - log.info(f"Skipping file {data_file} because of following error \n {e}") - raise e - - def read_parquet_columns(self, parquet_file=None) -> list: - """ - Initialize the table from a parquet file. - """ - if parquet_file is None: - parquet_file = self.parquet_file - log.info(f"Initializing table {self.name} from parquet file {parquet_file}") - self.source_format = "parquet" - parquet_schema = pq.read_schema(parquet_file) - self.variables = parquet_schema.names - self.survey.tables[self.name]["variables"] = self.variables - return self.variables - - def read_source(self, data_file, **kwargs): - source_format = self.source_format - store_file_path = ( - self.survey.hdf5_file_path if self.survey.store_format == "hdf5" else self.survey.parquet_file_path - ) - - self._check_and_log(data_file, store_file_path=store_file_path) - reader = reader_by_source_format[source_format] - # Extract categorical_strategy early - only stata format uses it - # Other formats (parquet, csv, etc.) don't support it and will error if passed - categorical_strategy = ( - kwargs.pop("categorical_strategy", "unique_labels") - if source_format == "stata" - else kwargs.pop("categorical_strategy", None) - ) - try: - if source_format == "csv": - try: - data_frame = reader(data_file, **kwargs) - - if len(data_frame.columns) == 1 and ";" in data_frame.columns[0]: - raise SurveyIOError( - "A ';' is present in the unique column name. Looks like we got the wrong separator." - ) - - except Exception: - log.debug(f"Failing to read {data_file}, Trying to infer encoding and dialect/separator") - - # Detect encoding - detector = UniversalDetector() - with Path(data_file).open("rb") as csvfile: - for line in csvfile: - detector.feed(line) - if detector.done: - break - detector.close() - - encoding = detector.result["encoding"] - confidence = detector.result["confidence"] - - # Sniff dialect - try: - with Path(data_file).open("r", newline="", encoding=encoding) as csvfile: - dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,") - except Exception: - # Sometimes the sniffer fails, we switch back to the default ... of french statistical data - dialect = None - delimiter = ";" - - log.debug( - f"dialect.delimiter = {dialect.delimiter if dialect is not None else delimiter}, " - f"encoding = {encoding}, confidence = {confidence}" - ) - kwargs["engine"] = "python" - if dialect: - kwargs["dialect"] = dialect - else: - kwargs["delimiter"] = delimiter - kwargs["encoding"] = encoding - data_frame = reader(data_file, **kwargs) - - else: - # Remove encoding parameter for pandas 2.0+ compatibility (not supported in read_stata) - if "encoding" in kwargs and source_format == "stata": - kwargs.pop("encoding") - # Try to read with categoricals, handle non-unique labels with configurable strategy - if source_format == "stata": - # categorical_strategy already extracted above - - try: - # Try reading with default convert_categoricals (True) if not specified - if "convert_categoricals" not in kwargs: - data_frame = reader(data_file, **kwargs) - else: - data_frame = reader(data_file, **kwargs) - except ValueError as e: - if "not unique" in str(e) or "Categorical categories must be unique" in str(e): - log.info( - f"Non-unique value labels detected in {data_file}, " - f"using strategy '{categorical_strategy}'" - ) - - # Read without categoricals first - kwargs_no_cat = kwargs.copy() - kwargs_no_cat["convert_categoricals"] = False - data_frame = reader(data_file, **kwargs_no_cat) - - # Apply categorical strategy - if categorical_strategy == "unique_labels": - # Solution 2: Make labels unique by adding code suffix - from pandas.io.stata import StataReader - - stata_reader = StataReader(data_file) - value_labels = stata_reader.value_labels() - - for col_name, labels in value_labels.items(): - if col_name in data_frame.columns: - unique_labels = {} - seen_labels = {} - - for code, label in labels.items(): - if pandas.isna(code): - unique_labels[code] = label - elif label in seen_labels: - # Duplicate label: add code as suffix - unique_labels[code] = f"{label} ({code})" - else: - unique_labels[code] = label - seen_labels[label] = code - - # Create mapping code -> unique label - code_to_label = {code: unique_labels[code] for code in sorted(labels.keys())} - - # Map codes to unique labels and create categories - data_frame[col_name] = data_frame[col_name].map(code_to_label) - data_frame[col_name] = pandas.Categorical( - data_frame[col_name], - categories=list(code_to_label.values()), - ordered=False, - ) - - elif categorical_strategy == "codes": - # Solution 1: Use codes as categories - from pandas.io.stata import StataReader - - stata_reader = StataReader(data_file) - value_labels = stata_reader.value_labels() - - for col_name, labels in value_labels.items(): - if col_name in data_frame.columns: - codes = sorted([c for c in labels if pandas.notna(c)]) - if codes: - data_frame[col_name] = pandas.Categorical( - data_frame[col_name], categories=codes, ordered=False - ) - - elif categorical_strategy == "skip": - # Keep as-is (no categories) - pass - else: - log.warning(f"Unknown categorical_strategy '{categorical_strategy}', using 'skip'") - else: - raise - else: - data_frame = reader(data_file, **kwargs) - - except Exception as e: - log.info(f"Error while reading {data_file}") - raise e - - gc.collect() - return data_frame - - def save_data_frame_to_hdf5(self, data_frame, **kwargs): - """Save a data frame in the HDF5 file format.""" - hdf5_file_path = self.survey.hdf5_file_path - log.info(f"Inserting table {self.name} in HDF file {hdf5_file_path}") - store_path = self.name - write_table_to_hdf5( - data_frame, - hdf5_file_path=hdf5_file_path, - store_path=store_path, - **kwargs, - ) - - self.variables = list(data_frame.columns) - - def save_data_frame_to_parquet(self, data_frame): - """Save a data frame in the Parquet file format.""" - parquet_file_path = self.survey.parquet_file_path - self.parquet_file = write_table_to_parquet( - data_frame, - parquet_dir_path=parquet_file_path, - table_name=self.name, - ) - self.variables = list(data_frame.columns) - - self.survey.tables[self.name]["parquet_file"] = self.parquet_file - self.survey.tables[self.name]["variables"] = self.variables diff --git a/openfisca_survey_manager/temporary.py b/openfisca_survey_manager/temporary.py index 4b185fb5..59d84fb9 100644 --- a/openfisca_survey_manager/temporary.py +++ b/openfisca_survey_manager/temporary.py @@ -5,7 +5,7 @@ from pandas import HDFStore -from openfisca_survey_manager.paths import default_config_files_directory +from openfisca_survey_manager.configuration.paths import default_config_files_directory log = logging.getLogger(__name__) diff --git a/openfisca_survey_manager/tests/conftest.py b/openfisca_survey_manager/tests/conftest.py index 825a5cd1..5d639e77 100644 --- a/openfisca_survey_manager/tests/conftest.py +++ b/openfisca_survey_manager/tests/conftest.py @@ -1,8 +1,34 @@ +import json from pathlib import Path import pandas as pd import pytest +# Répertoire des données de test (à côté de conftest.py) +TESTS_DATA_FILES = Path(__file__).resolve().parent / "data_files" +TEST_RANDOM_GENERATOR_JSON = TESTS_DATA_FILES / "test_random_generator.json" + +# Contenu minimal pour que SurveyCollection.load() réussisse +TEST_RANDOM_GENERATOR_MINIMAL_JSON = {"name": "test_random_generator", "surveys": {}} + + +@pytest.fixture(scope="session", autouse=True) +def ensure_test_random_generator_json(): + """Crée test_random_generator.json avec un contenu minimal s'il est absent ou vide.""" + TESTS_DATA_FILES.mkdir(parents=True, exist_ok=True) + if not TEST_RANDOM_GENERATOR_JSON.exists(): + TEST_RANDOM_GENERATOR_JSON.write_text( + json.dumps(TEST_RANDOM_GENERATOR_MINIMAL_JSON, indent=2), + encoding="utf-8", + ) + else: + raw = TEST_RANDOM_GENERATOR_JSON.read_text(encoding="utf-8").strip() + if not raw: + TEST_RANDOM_GENERATOR_JSON.write_text( + json.dumps(TEST_RANDOM_GENERATOR_MINIMAL_JSON, indent=2), + encoding="utf-8", + ) + @pytest.fixture def parquet_data(tmp_path: Path): diff --git a/openfisca_survey_manager/input_dataframe_generator.py b/openfisca_survey_manager/tests/input_dataframe_generator.py similarity index 96% rename from openfisca_survey_manager/input_dataframe_generator.py rename to openfisca_survey_manager/tests/input_dataframe_generator.py index 4fa18510..3316ac5c 100644 --- a/openfisca_survey_manager/input_dataframe_generator.py +++ b/openfisca_survey_manager/tests/input_dataframe_generator.py @@ -1,3 +1,5 @@ +"""Helpers to build input dataframes and fill surveys for tests.""" + import configparser import logging import random @@ -7,9 +9,12 @@ import pandas as pd from openfisca_core import periods -from openfisca_survey_manager.paths import default_config_files_directory, openfisca_survey_manager_location -from openfisca_survey_manager.survey_collections import SurveyCollection -from openfisca_survey_manager.surveys import Survey +from openfisca_survey_manager.configuration.paths import ( + default_config_files_directory, + openfisca_survey_manager_location, +) +from openfisca_survey_manager.core.dataset import SurveyCollection +from openfisca_survey_manager.core.survey import Survey log = logging.getLogger(__name__) @@ -27,7 +32,7 @@ def make_input_dataframe_by_entity(tax_benefit_system, nb_persons, nb_groups): Example: - >>> from openfisca_survey_manager.input_dataframe_generator import make_input_dataframe_by_entity + >>> from openfisca_survey_manager.tests.input_dataframe_generator import make_input_dataframe_by_entity >>> from openfisca_country_template import CountryTaxBenefitSystem >>> tbs = CountryTaxBenefitSystem() >>> input_dataframe_by_entity = make_input_dataframe_by_entity(tbs, 400, 100) @@ -150,7 +155,7 @@ def randomly_init_variable( seed: Random seed used whe ndrawing the values (Default value = None) Examples - >>> from openfisca_survey_manager.input_dataframe_generator import make_input_dataframe_by_entity + >>> from openfisca_survey_manager.tests.input_dataframe_generator import make_input_dataframe_by_entity >>> from openfisca_country_template import CountryTaxBenefitSystem >>> tbs = CountryTaxBenefitSystem() >>> input_dataframe_by_entity = make_input_dataframe_by_entity(tbs, 400, 100) diff --git a/openfisca_survey_manager/tests/test_add_survey_to_collection.py b/openfisca_survey_manager/tests/test_add_survey_to_collection.py index 3d2c968d..a417f29d 100644 --- a/openfisca_survey_manager/tests/test_add_survey_to_collection.py +++ b/openfisca_survey_manager/tests/test_add_survey_to_collection.py @@ -1,11 +1,11 @@ import pandas as pd import pytest -from openfisca_survey_manager.input_dataframe_generator import set_table_in_survey +from openfisca_survey_manager.core.dataset import SurveyCollection from openfisca_survey_manager.scripts.build_collection import ( add_survey_to_collection, ) -from openfisca_survey_manager.survey_collections import SurveyCollection +from openfisca_survey_manager.tests.input_dataframe_generator import set_table_in_survey @pytest.mark.order(after="test_write_parquet.py::test_write_parquet_one_file_per_entity") diff --git a/openfisca_survey_manager/tests/test_calibration.py b/openfisca_survey_manager/tests/test_calibration.py index 9e2bc37b..05f23fcb 100644 --- a/openfisca_survey_manager/tests/test_calibration.py +++ b/openfisca_survey_manager/tests/test_calibration.py @@ -1,8 +1,8 @@ from openfisca_core import periods from openfisca_core.tools import assert_near -from openfisca_survey_manager.calibration import Calibration -from openfisca_survey_manager.scenarios.abstract_scenario import AbstractSurveyScenario +from openfisca_survey_manager.policy.calibration import Calibration +from openfisca_survey_manager.policy.scenarios.abstract_scenario import AbstractSurveyScenario from openfisca_survey_manager.tests import tax_benefit_system from openfisca_survey_manager.tests.test_scenario import ( create_randomly_initialized_survey_scenario, @@ -164,4 +164,4 @@ def test_simulation_calibration_input_from_data(tmp_path): f"{simulation_name} weight_variable_by_entity does not match {weight_variable_by_entity}" ) assert (survey_scenario.calculate_series("household_weight", period, simulation=simulation_name) != 0).all() - return survey_scenario + assert survey_scenario is not None diff --git a/openfisca_survey_manager/tests/test_calmar.py b/openfisca_survey_manager/tests/test_calmar.py index 93a914d4..884f3997 100644 --- a/openfisca_survey_manager/tests/test_calmar.py +++ b/openfisca_survey_manager/tests/test_calmar.py @@ -6,7 +6,7 @@ import numpy as np import pandas as pd -from openfisca_survey_manager.calmar import calmar +from openfisca_survey_manager.policy.calmar import calmar def create_input_dataframe(entities=1): diff --git a/openfisca_survey_manager/tests/test_config_manifest_rfc002.py b/openfisca_survey_manager/tests/test_config_manifest_rfc002.py new file mode 100644 index 00000000..9f5da00c --- /dev/null +++ b/openfisca_survey_manager/tests/test_config_manifest_rfc002.py @@ -0,0 +1,302 @@ +"""Tests for RFC-002: config.yaml and manifest.yaml (new metadata architecture).""" + +from pathlib import Path + +import pytest + +from openfisca_survey_manager.configuration.config_loader import ( + get_config_dir, + load_config, + load_manifest, + manifest_survey_to_json, +) +from openfisca_survey_manager.configuration.paths import openfisca_survey_manager_location +from openfisca_survey_manager.core.dataset import SurveyCollection + + +@pytest.fixture +def rfc002_config_dir(tmp_path): + """Create a config dir with config.yaml and a dataset with manifest.yaml.""" + config_dir = tmp_path / "config" + config_dir.mkdir() + (config_dir / "config.yaml").write_text( + """ +collections_dir: {collections} +default_output_dir: {output} +tmp_dir: {tmp} +""".format( + collections=tmp_path / "collections", + output=tmp_path / "output", + tmp=tmp_path / "tmp", + ) + ) + collections_dir = tmp_path / "collections" + collections_dir.mkdir() + dataset_dir = collections_dir / "test_dataset" + dataset_dir.mkdir() + (dataset_dir / "manifest.yaml").write_text( + """ +name: test_dataset +label: "Test dataset (RFC-002)" + +surveys: + survey_a: + label: "Survey A" + source: + format: csv + path: /data/survey_a + survey_b: + label: "Survey B" + source: + format: sas + path: /data/survey_b +""" + ) + return config_dir + + +def test_get_config_dir_explicit(tmp_path): + assert get_config_dir(tmp_path) == tmp_path.resolve() + + +def test_get_config_dir_env(monkeypatch, tmp_path): + monkeypatch.setenv("OPENFISCA_SURVEY_CONFIG_DIR", str(tmp_path)) + assert get_config_dir() == tmp_path.resolve() + + +def test_load_config_missing(tmp_path): + assert load_config(tmp_path) is None + + +def test_load_config_present(rfc002_config_dir): + cfg = load_config(rfc002_config_dir) + assert cfg is not None + assert "collections_dir" in cfg + assert "default_output_dir" in cfg + assert cfg["collections_dir"].is_dir() + assert (cfg["collections_dir"] / "test_dataset").is_dir() + + +def test_load_manifest_missing(tmp_path): + assert load_manifest(tmp_path, "nonexistent") is None + + +def test_load_manifest_present(rfc002_config_dir): + cfg = load_config(rfc002_config_dir) + assert cfg is not None + manifest = load_manifest(cfg["collections_dir"], "test_dataset") + assert manifest is not None + assert manifest["name"] == "test_dataset" + assert manifest["label"] == "Test dataset (RFC-002)" + assert "survey_a" in manifest["surveys"] + assert manifest["surveys"]["survey_a"]["source"]["format"] == "csv" + assert manifest["surveys"]["survey_a"]["source"]["path"] == "/data/survey_a" + + +def test_manifest_survey_to_json(): + entry = { + "label": "My survey", + "source": {"format": "sas", "path": "/path/to/data"}, + } + out = manifest_survey_to_json("my_survey", entry) + assert out["name"] == "my_survey" + assert out["label"] == "My survey" + assert out["informations"]["sas_files"] == ["/path/to/data"] + + +def test_survey_collection_load_from_manifest(rfc002_config_dir): + """SurveyCollection.load(collection=..., config_files_directory=...) uses manifest when config.yaml exists.""" + col = SurveyCollection.load( + collection="test_dataset", + config_files_directory=rfc002_config_dir, + ) + assert col.name == "test_dataset" + assert col.label == "Test dataset (RFC-002)" + assert col.config is None + assert col.output_directory is not None + assert len(col.surveys) == 2 + names = {s.name for s in col.surveys} + assert names == {"survey_a", "survey_b"} + survey_a = col.get_survey("survey_a") + assert survey_a.label == "Survey A" + assert survey_a.informations.get("csv_files") == ["/data/survey_a"] + # Default store_format when missing in manifest is parquet + assert survey_a.store_format == "parquet" + assert survey_a.parquet_file_path is not None + assert "survey_a" in survey_a.parquet_file_path + + +def test_survey_collection_load_from_manifest_store_format_zarr(tmp_path): + """When manifest has store_format: zarr, surveys get zarr_file_path set.""" + config_dir = tmp_path / "config" + config_dir.mkdir() + (config_dir / "config.yaml").write_text( + f""" +collections_dir: {tmp_path / "collections"} +default_output_dir: {tmp_path / "output"} +tmp_dir: {tmp_path / "tmp"} +""" + ) + collections_dir = tmp_path / "collections" + collections_dir.mkdir() + dataset_dir = collections_dir / "zarr_dataset" + dataset_dir.mkdir() + (dataset_dir / "manifest.yaml").write_text( + """ +name: zarr_dataset +label: "Zarr dataset" +store_format: zarr +surveys: + s1: + label: "Survey 1" + source: + format: csv + path: /data/s1 +""" + ) + col = SurveyCollection.load( + collection="zarr_dataset", + config_files_directory=config_dir, + ) + assert col.output_directory is not None + survey_s1 = col.get_survey("s1") + assert survey_s1.store_format == "zarr" + assert survey_s1.zarr_file_path is not None + assert survey_s1.zarr_file_path.endswith(".zarr") + assert survey_s1.hdf5_file_path is None + assert survey_s1.parquet_file_path is None + + +def test_survey_collection_load_legacy_unchanged(tmp_path): + """Legacy config.ini + JSON still works when config.yaml is absent (emits DeprecationWarning).""" + # Use the package test data dir which has config.ini and fake.json + tests_data = Path(openfisca_survey_manager_location) / "openfisca_survey_manager" / "tests" / "data_files" + if not (tests_data / "config.ini").exists(): + pytest.skip("config.ini not present in tests/data_files") + if not (tests_data / "fake.json").exists(): + pytest.skip("fake.json not present in tests/data_files") + with pytest.warns(DeprecationWarning, match="config.ini and JSON files is deprecated"): + col = SurveyCollection.load( + collection="fake", + config_files_directory=tests_data, + ) + assert col.config is not None + assert col.name == "fake" + assert len(col.surveys) >= 0 + + +# --- Migration script tests --- + + +@pytest.fixture +def legacy_config_dir(tmp_path): + """Create a minimal legacy config dir: config.ini + one collection JSON.""" + config_dir = tmp_path / "legacy_config" + config_dir.mkdir() + collections_dir = tmp_path / "legacy_collections" + collections_dir.mkdir() + json_path = collections_dir / "my_collection.json" + json_path.write_text( + """ +{ + "name": "my_collection", + "label": "My collection", + "surveys": { + "survey_1": { + "label": "Survey 1", + "informations": { + "csv_files": ["/data/s1/file1.csv"] + } + }, + "survey_2": { + "label": "Survey 2", + "informations": { + "sas_files": ["/data/s2/file.sas7bdat"] + } + } + } +} +""", + encoding="utf-8", + ) + config_ini = config_dir / "config.ini" + config_ini.write_text( + f"""[collections] +collections_directory = {collections_dir} +my_collection = {json_path} + +[data] +output_directory = {tmp_path / "output"} +tmp_directory = /tmp +""", + encoding="utf-8", + ) + return config_dir + + +def test_migrate_produces_config_yaml_and_manifests(legacy_config_dir): + """Migration script creates config.yaml and collection manifest.""" + from openfisca_survey_manager.scripts.migrate_config_to_rfc002 import ( + CONFIG_FILENAME, + MANIFEST_FILENAME, + migrate, + ) + + ok = migrate(legacy_config_dir, dry_run=False) + assert ok is True + config_yaml = legacy_config_dir / CONFIG_FILENAME + assert config_yaml.is_file() + cfg = load_config(legacy_config_dir) + assert cfg is not None + assert "collections_dir" in cfg + assert (Path(cfg["collections_dir"]) / "my_collection" / MANIFEST_FILENAME).is_file() + manifest = load_manifest(cfg["collections_dir"], "my_collection") + assert manifest is not None + assert manifest["name"] == "my_collection" + assert manifest.get("store_format") == "parquet" + assert manifest["surveys"]["survey_1"]["source"]["format"] == "csv" + assert manifest["surveys"]["survey_1"]["source"]["path"] == "/data/s1/file1.csv" + assert manifest["surveys"]["survey_2"]["source"]["format"] == "sas" + assert manifest["surveys"]["survey_2"]["source"]["path"] == "/data/s2/file.sas7bdat" + + +def test_migrate_infers_store_format_from_legacy(tmp_path): + """Migration infers store_format from legacy JSON (hdf5_file_path -> hdf5, etc.).""" + from openfisca_survey_manager.scripts.migrate_config_to_rfc002 import ( + _infer_store_format_from_legacy, + build_manifest_from_json, + ) + + # Legacy with parquet_file_path + json_parquet = tmp_path / "p.json" + json_parquet.write_text( + '{"name":"p","label":"P","surveys":{"s1":{"label":"S1","parquet_file_path":"/out/s1"}}}', + encoding="utf-8", + ) + manifest_parquet = build_manifest_from_json(json_parquet, None) + assert manifest_parquet["store_format"] == "parquet" + + # Legacy with hdf5_file_path only + json_hdf5 = tmp_path / "h.json" + json_hdf5.write_text( + '{"name":"h","label":"H","surveys":{"s1":{"label":"S1","hdf5_file_path":"/out/s1.h5"}}}', + encoding="utf-8", + ) + manifest_hdf5 = build_manifest_from_json(json_hdf5, None) + assert manifest_hdf5["store_format"] == "hdf5" + + # Infer function directly + assert _infer_store_format_from_legacy({}) == "parquet" + assert _infer_store_format_from_legacy({"s": {"zarr_file_path": "/z"}}) == "zarr" + + +def test_migrate_dry_run_does_not_write(legacy_config_dir): + """Migration with --dry-run does not create files.""" + from openfisca_survey_manager.scripts.migrate_config_to_rfc002 import ( + CONFIG_FILENAME, + migrate, + ) + + ok = migrate(legacy_config_dir, dry_run=True) + assert ok is True + assert not (legacy_config_dir / CONFIG_FILENAME).is_file() diff --git a/openfisca_survey_manager/tests/test_coverage_boost.py b/openfisca_survey_manager/tests/test_coverage_boost.py index 3e1e6930..845843e1 100644 --- a/openfisca_survey_manager/tests/test_coverage_boost.py +++ b/openfisca_survey_manager/tests/test_coverage_boost.py @@ -5,22 +5,21 @@ import pytest from openfisca_core import periods -from openfisca_survey_manager.aggregates import AbstractAggregates -from openfisca_survey_manager.input_dataframe_generator import ( - make_input_dataframe_by_entity, - randomly_init_variable, - set_table_in_survey, -) -from openfisca_survey_manager.scenarios.abstract_scenario import AbstractSurveyScenario +from openfisca_survey_manager.core.dataset import SurveyCollection +from openfisca_survey_manager.core.survey import Survey +from openfisca_survey_manager.policy import AbstractAggregates +from openfisca_survey_manager.policy.scenarios.abstract_scenario import AbstractSurveyScenario +from openfisca_survey_manager.policy.variables import quantile from openfisca_survey_manager.scripts.build_collection import ( check_template_config_files, create_data_file_by_format, ) -from openfisca_survey_manager.survey_collections import SurveyCollection -from openfisca_survey_manager.surveys import Survey from openfisca_survey_manager.tests import tax_benefit_system -from openfisca_survey_manager.utils import do_nothing -from openfisca_survey_manager.variables import quantile +from openfisca_survey_manager.tests.input_dataframe_generator import ( + make_input_dataframe_by_entity, + randomly_init_variable, + set_table_in_survey, +) def setup_test_config(config_files_directory: Path): @@ -135,17 +134,6 @@ def test_build_collection_helpers(tmp_path): create_data_file_by_format(str(tmp_path)) -def test_google_colab_boost(): - from openfisca_survey_manager.google_colab import create_raw_data_ini - - with suppress(Exception): - create_raw_data_ini({"test": {"opt": "val"}}) - - -def test_utils_do_nothing(): - assert do_nothing(1, a=2) is None - - def test_matching_mock_extended(monkeypatch): import sys @@ -168,7 +156,7 @@ def test_matching_mock_extended(monkeypatch): ) monkeypatch.setitem(sys.modules, "rpy2", fake_rpy2) monkeypatch.setitem(sys.modules, "rpy2.robjects", fake_rpy2.robjects) - from openfisca_survey_manager.matching import nnd_hotdeck_using_feather, nnd_hotdeck_using_rpy2 + from openfisca_survey_manager.policy.matching import nnd_hotdeck_using_feather, nnd_hotdeck_using_rpy2 receiver = pd.DataFrame({"a": [1], "c": [1]}) donor = pd.DataFrame({"a": [1], "b": [2], "c": [1]}) diff --git a/openfisca_survey_manager/tests/test_enum.py b/openfisca_survey_manager/tests/test_enum.py index 8189acb8..8d6277c4 100644 --- a/openfisca_survey_manager/tests/test_enum.py +++ b/openfisca_survey_manager/tests/test_enum.py @@ -1,7 +1,7 @@ import pandas as pd from openfisca_country_template.variables.housing import HousingOccupancyStatus -from openfisca_survey_manager.scenarios.abstract_scenario import AbstractSurveyScenario +from openfisca_survey_manager.policy.scenarios.abstract_scenario import AbstractSurveyScenario from openfisca_survey_manager.tests import tax_benefit_system diff --git a/openfisca_survey_manager/tests/test_legislation_inflator.py b/openfisca_survey_manager/tests/test_legislation_inflator.py index 7344a7b7..986174fb 100644 --- a/openfisca_survey_manager/tests/test_legislation_inflator.py +++ b/openfisca_survey_manager/tests/test_legislation_inflator.py @@ -1,7 +1,7 @@ from openfisca_core import periods from openfisca_country_template import CountryTaxBenefitSystem -from openfisca_survey_manager.utils import inflate_parameters, parameters_asof +from openfisca_survey_manager.policy.legislation_asof import inflate_parameters, parameters_asof def test_asof_simple_annual_parameter(): diff --git a/openfisca_survey_manager/tests/test_matching.py b/openfisca_survey_manager/tests/test_matching.py index 10d91255..ecc1b39e 100644 --- a/openfisca_survey_manager/tests/test_matching.py +++ b/openfisca_survey_manager/tests/test_matching.py @@ -2,7 +2,7 @@ import pandas as pd -from openfisca_survey_manager.matching import nnd_hotdeck_using_rpy2 +from openfisca_survey_manager.policy.matching import nnd_hotdeck_using_rpy2 try: import rpy2 diff --git a/openfisca_survey_manager/tests/test_parquet.py b/openfisca_survey_manager/tests/test_parquet.py index 6612874f..d1f4b51b 100644 --- a/openfisca_survey_manager/tests/test_parquet.py +++ b/openfisca_survey_manager/tests/test_parquet.py @@ -4,13 +4,13 @@ import pytest from openfisca_core import periods -from openfisca_survey_manager.scenarios.abstract_scenario import AbstractSurveyScenario +from openfisca_survey_manager.core.dataset import SurveyCollection +from openfisca_survey_manager.core.survey import NoMoreDataError +from openfisca_survey_manager.policy.scenarios.abstract_scenario import AbstractSurveyScenario from openfisca_survey_manager.scripts.build_collection import ( add_survey_to_collection, build_survey_collection, ) -from openfisca_survey_manager.survey_collections import SurveyCollection -from openfisca_survey_manager.surveys import NoMoreDataError from openfisca_survey_manager.tests import tax_benefit_system logger = logging.getLogger(__name__) diff --git a/openfisca_survey_manager/tests/test_quantile.py b/openfisca_survey_manager/tests/test_quantile.py index 6c33c69b..87b40a1c 100644 --- a/openfisca_survey_manager/tests/test_quantile.py +++ b/openfisca_survey_manager/tests/test_quantile.py @@ -4,11 +4,11 @@ from openfisca_core.model_api import YEAR, Variable from openfisca_core.taxbenefitsystems import TaxBenefitSystem -from openfisca_survey_manager.paths import default_config_files_directory -from openfisca_survey_manager.scenarios.abstract_scenario import AbstractSurveyScenario -from openfisca_survey_manager.statshelpers import mark_weighted_percentiles +from openfisca_survey_manager.configuration.paths import default_config_files_directory +from openfisca_survey_manager.policy.scenarios.abstract_scenario import AbstractSurveyScenario +from openfisca_survey_manager.policy.statshelpers import mark_weighted_percentiles +from openfisca_survey_manager.policy.variables import quantile from openfisca_survey_manager.tests.test_scenario import setup_test_config -from openfisca_survey_manager.variables import quantile Individu = build_entity( key="individu", diff --git a/openfisca_survey_manager/tests/test_read_sas.py b/openfisca_survey_manager/tests/test_read_sas.py index 455c87b6..27a77ef8 100644 --- a/openfisca_survey_manager/tests/test_read_sas.py +++ b/openfisca_survey_manager/tests/test_read_sas.py @@ -6,8 +6,8 @@ from pandas.testing import assert_frame_equal -from openfisca_survey_manager.paths import openfisca_survey_manager_location -from openfisca_survey_manager.read_sas import read_sas +from openfisca_survey_manager.configuration.paths import openfisca_survey_manager_location +from openfisca_survey_manager.io.readers import read_sas def test(): diff --git a/openfisca_survey_manager/tests/test_scenario.py b/openfisca_survey_manager/tests/test_scenario.py index 13fa5c0a..941e5626 100644 --- a/openfisca_survey_manager/tests/test_scenario.py +++ b/openfisca_survey_manager/tests/test_scenario.py @@ -7,17 +7,17 @@ from openfisca_core import periods from openfisca_core.tools import assert_near -from openfisca_survey_manager.input_dataframe_generator import ( +from openfisca_survey_manager.configuration.paths import ( + default_config_files_directory, +) +from openfisca_survey_manager.policy.scenarios.abstract_scenario import AbstractSurveyScenario +from openfisca_survey_manager.policy.scenarios.reform_scenario import ReformScenario +from openfisca_survey_manager.tests import tax_benefit_system +from openfisca_survey_manager.tests.input_dataframe_generator import ( make_input_dataframe_by_entity, random_data_generator, randomly_init_variable, ) -from openfisca_survey_manager.paths import ( - default_config_files_directory, -) -from openfisca_survey_manager.scenarios.abstract_scenario import AbstractSurveyScenario -from openfisca_survey_manager.scenarios.reform_scenario import ReformScenario -from openfisca_survey_manager.tests import tax_benefit_system log = logging.getLogger(__name__) diff --git a/openfisca_survey_manager/tests/test_store_backends.py b/openfisca_survey_manager/tests/test_store_backends.py new file mode 100644 index 00000000..6e08e56b --- /dev/null +++ b/openfisca_survey_manager/tests/test_store_backends.py @@ -0,0 +1,50 @@ +"""Tests for store backends (HDF5, Parquet, Zarr).""" + +from pathlib import Path + +import pandas as pd +import pytest + +from openfisca_survey_manager.io.backends import ( + get_available_backend_names, + get_backend, +) + + +def test_get_backend_hdf5(): + backend = get_backend("hdf5") + assert backend is not None + assert hasattr(backend, "write_table") and hasattr(backend, "read_table") + assert hasattr(backend, "table_exists") + + +def test_get_backend_parquet(): + backend = get_backend("parquet") + assert backend is not None + + +def test_get_backend_invalid_raises(): + with pytest.raises(ValueError, match="Unknown store backend"): + get_backend("invalid_format") + + +def test_parquet_backend_roundtrip(tmp_path): + backend = get_backend("parquet") + store_path = str(tmp_path / "survey") + store_path_path = Path(store_path) + store_path_path.mkdir(parents=True) + df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}) + backend.write_table(store_path, "mytable", df) + assert backend.table_exists(store_path, "mytable") + df2 = backend.read_table(store_path, "mytable") + pd.testing.assert_frame_equal(df, df2) + df3 = backend.read_table(store_path, "mytable", variables=["a"]) + assert list(df3.columns) == ["a"] + + +def test_available_backends_include_hdf5_parquet(): + names = get_available_backend_names() + assert "hdf5" in names + assert "parquet" in names + # zarr only if zarr package installed + assert len(names) >= 2 diff --git a/openfisca_survey_manager/tests/test_surveys.py b/openfisca_survey_manager/tests/test_surveys.py index de2556be..d3891ca7 100644 --- a/openfisca_survey_manager/tests/test_surveys.py +++ b/openfisca_survey_manager/tests/test_surveys.py @@ -1,9 +1,9 @@ import pandas as pd import pytest -from openfisca_survey_manager.input_dataframe_generator import set_table_in_survey -from openfisca_survey_manager.survey_collections import SurveyCollection -from openfisca_survey_manager.surveys import Survey +from openfisca_survey_manager.core.dataset import SurveyCollection +from openfisca_survey_manager.core.survey import Survey +from openfisca_survey_manager.tests.input_dataframe_generator import set_table_in_survey @pytest.fixture diff --git a/openfisca_survey_manager/tests/test_tax_benefit_system_asof.py b/openfisca_survey_manager/tests/test_tax_benefit_system_asof.py index 994f7f44..3318df92 100644 --- a/openfisca_survey_manager/tests/test_tax_benefit_system_asof.py +++ b/openfisca_survey_manager/tests/test_tax_benefit_system_asof.py @@ -2,7 +2,7 @@ from openfisca_core.parameters import ParameterNode, Scale from openfisca_country_template import CountryTaxBenefitSystem -from openfisca_survey_manager.utils import parameters_asof, variables_asof +from openfisca_survey_manager.policy.legislation_asof import parameters_asof, variables_asof def check_max_instant_leaf(sub_parameter, instant): diff --git a/openfisca_survey_manager/tests/test_top_bottom_share.py b/openfisca_survey_manager/tests/test_top_bottom_share.py index 05497b56..acf6b362 100644 --- a/openfisca_survey_manager/tests/test_top_bottom_share.py +++ b/openfisca_survey_manager/tests/test_top_bottom_share.py @@ -1,6 +1,6 @@ import numpy as np -from openfisca_survey_manager.statshelpers import bottom_share, top_share +from openfisca_survey_manager.policy.statshelpers import bottom_share, top_share size = 1000 x = np.ones(size) + np.random.uniform(0, 0.00000001, size) diff --git a/openfisca_survey_manager/utils.py b/openfisca_survey_manager/utils.py deleted file mode 100644 index 24ba9c9e..00000000 --- a/openfisca_survey_manager/utils.py +++ /dev/null @@ -1,70 +0,0 @@ -"""Utilities: re-exports from common.misc + load_table (survey-dependent).""" - -import logging -from typing import Optional - -import pandas as pd - -from openfisca_survey_manager.common.misc import ( - asof, - do_nothing, - inflate_parameter_leaf, - inflate_parameters, - parameters_asof, - stata_files_to_data_frames, - variables_asof, -) -from openfisca_survey_manager.survey_collections import SurveyCollection - -log = logging.getLogger(__name__) - -__all__ = [ - "asof", - "do_nothing", - "inflate_parameter_leaf", - "inflate_parameters", - "load_table", - "parameters_asof", - "stata_files_to_data_frames", - "variables_asof", -] - - -def load_table( - config_files_directory, - variables: Optional[list] = None, - collection: Optional[str] = None, - survey: Optional[str] = None, - input_data_survey_prefix: Optional[str] = None, - data_year=None, - table: Optional[str] = None, - batch_size=None, - batch_index=0, - filter_by=None, -) -> pd.DataFrame: - """ - Load values from table from a survey in a collection. - - Args: - config_files_directory : _description_. - variables (List, optional): List of the variables to retrieve in the table. - Defaults to None to get all the variables. - collection (str, optional): Collection. Defaults to None. - survey (str, optional): Survey. Defaults to None. - input_data_survey_prefix (str, optional): Prefix of the survey to be combined with data year. Defaults to None. - data_year (_type_, optional): Year of the survey data. Defaults to None. - table (str, optional): Table. Defaults to None. - - Returns: - pandas.DataFrame: A table with the retrieved variables - """ - survey_collection = SurveyCollection.load(collection=collection, config_files_directory=config_files_directory) - survey = survey if survey is not None else f"{input_data_survey_prefix}_{data_year}" - survey_ = survey_collection.get_survey(survey) - log.debug(f"Loading table {table} in survey {survey} from collection {collection}") - if batch_size: - return survey_.get_values( - table=table, variables=variables, batch_size=batch_size, batch_index=batch_index, filter_by=filter_by - ) - else: - return survey_.get_values(table=table, variables=variables, filter_by=filter_by) diff --git a/pyproject.toml b/pyproject.toml index a25a5be4..0e6c7805 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "OpenFisca-Survey-Manager" -version = "6.3.1" +version = "8.0.0" description = "A tool for managing survey/administrative data and import them in OpenFisca" readme = "README.md" keywords = ["microsimulation", "tax", "benefit", "rac", "rules-as-code", "survey", "data"] @@ -52,6 +52,9 @@ build-backend = "setuptools.build_meta" build-collection = "openfisca_survey_manager.scripts.build_collection:main" [project.optional-dependencies] +zarr = [ + 'zarr >=2.18.0, < 3.0', +] matching = [ # 'feather', 'rpy2 >=3.5.10, < 4.0' diff --git a/uv.lock b/uv.lock index 966c17f3..a5c605b4 100644 --- a/uv.lock +++ b/uv.lock @@ -9,6 +9,12 @@ resolution-markers = [ "python_full_version < '3.10'", ] +[[package]] +name = "asciitree" +version = "0.3.3" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/2d/6a/885bc91484e1aa8f618f6f0228d76d0e67000b0fdd6090673b777e311913/asciitree-0.3.3.tar.gz", hash = "sha256:4aa4b9b649f85e3fcb343363d97564aa1fb62e249677f2e18a96765145cc0f6e", size = 3951 } + [[package]] name = "asttokens" version = "3.0.0" @@ -439,6 +445,18 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/4e/8c/f3147f5c4b73e7550fe5f9352eaa956ae838d5c51eb58e7a25b9f3e2643b/decorator-5.2.1-py3-none-any.whl", hash = "sha256:d316bb415a2d9e2d2b3abcc4084c6502fc09240e292cd76a76afc106a1c8e04a", size = 9190 }, ] +[[package]] +name = "deprecated" +version = "1.3.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "wrapt", marker = "python_full_version >= '3.11'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/49/85/12f0a49a7c4ffb70572b6c2ef13c90c88fd190debda93b23f026b25f9634/deprecated-1.3.1.tar.gz", hash = "sha256:b1b50e0ff0c1fddaa5708a2c6b0a6588bb09b892825ab2b214ac9ea9d92a5223", size = 2932523 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/84/d0/205d54408c08b13550c733c4b85429e7ead111c7f0014309637425520a9a/deprecated-1.3.1-py2.py3-none-any.whl", hash = "sha256:597bfef186b6f60181535a29fbe44865ce137a5079f295b479886c82729d5f3f", size = 11298 }, +] + [[package]] name = "distlib" version = "0.4.0" @@ -497,6 +515,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/30/c3/6f0e3896f193528bbd2b4d2122d4be8108a37efab0b8475855556a8c4afa/fancycompleter-0.11.1-py3-none-any.whl", hash = "sha256:44243d7fab37087208ca5acacf8f74c0aa4d733d04d593857873af7513cdf8a6", size = 11207 }, ] +[[package]] +name = "fasteners" +version = "0.20" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/2d/18/7881a99ba5244bfc82f06017316ffe93217dbbbcfa52b887caa1d4f2a6d3/fasteners-0.20.tar.gz", hash = "sha256:55dce8792a41b56f727ba6e123fcaee77fd87e638a6863cec00007bfea84c8d8", size = 25087 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/51/ac/e5d886f892666d2d1e5cb8c1a41146e1d79ae8896477b1153a21711d3b44/fasteners-0.20-py3-none-any.whl", hash = "sha256:9422c40d1e350e4259f509fb2e608d6bc43c0136f79a00db1b49046029d0b3b7", size = 18702 }, +] + [[package]] name = "filelock" version = "3.19.1" @@ -999,6 +1026,96 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/d2/1d/1b658dbd2b9fa9c4c9f32accbfc0205d532c8c6194dc0f2a4c0428e7128a/nodeenv-1.9.1-py2.py3-none-any.whl", hash = "sha256:ba11c9782d29c27c70ffbdda2d7415098754709be8a7056d79a737cd901155c9", size = 22314 }, ] +[[package]] +name = "numcodecs" +version = "0.12.1" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version < '3.10'", +] +dependencies = [ + { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/b7/1b/1f1d880e29e719c7c6205065d1afbc91114c0d91935ac419faa43e5e08b0/numcodecs-0.12.1.tar.gz", hash = "sha256:05d91a433733e7eef268d7e80ec226a0232da244289614a8f3826901aec1098e", size = 4091415 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/9f/66/08744c9007f1d02476dd97f3c23032f3555dbb8e9a32b0f0ea4724e6b2a2/numcodecs-0.12.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:d37f628fe92b3699e65831d5733feca74d2e33b50ef29118ffd41c13c677210e", size = 1696843 }, + { url = "https://files.pythonhosted.org/packages/b8/6f/a04a33c5edb8fa9ba63783d34ff5768ba6b562ebe11078c07848e283f4ad/numcodecs-0.12.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:941b7446b68cf79f089bcfe92edaa3b154533dcbcd82474f994b28f2eedb1c60", size = 1422578 }, + { url = "https://files.pythonhosted.org/packages/1e/b8/1040f299803eacc9c522fdc69a4dafc42ad0e8722bb48aa43d2310cf195b/numcodecs-0.12.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0e79bf9d1d37199ac00a60ff3adb64757523291d19d03116832e600cac391c51", size = 7709402 }, + { url = "https://files.pythonhosted.org/packages/8c/fa/da0637e1a6db74361a2875425021957859749166c0174ddedbb629518970/numcodecs-0.12.1-cp310-cp310-win_amd64.whl", hash = "sha256:82d7107f80f9307235cb7e74719292d101c7ea1e393fe628817f0d635b7384f5", size = 790204 }, + { url = "https://files.pythonhosted.org/packages/10/63/a50f4113a2bb1decfaedeffc448c5f8b26ded1c583247c893120fcd25e3e/numcodecs-0.12.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:eeaf42768910f1c6eebf6c1bb00160728e62c9343df9e2e315dc9fe12e3f6071", size = 1696786 }, + { url = "https://files.pythonhosted.org/packages/92/77/0fde34bf3a8402d696218a565230097d904c9eebb62cd952923b1155b7f7/numcodecs-0.12.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:135b2d47563f7b9dc5ee6ce3d1b81b0f1397f69309e909f1a35bb0f7c553d45e", size = 1422330 }, + { url = "https://files.pythonhosted.org/packages/14/e6/8f9d4a498a06f11a06297f0b02af9968844d2e40ee79d372ccee33595285/numcodecs-0.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a191a8e347ecd016e5c357f2bf41fbcb026f6ffe78fff50c77ab12e96701d155", size = 7949787 }, + { url = "https://files.pythonhosted.org/packages/08/f3/44597198c2cfb0d808d68583445b60b0d0ae057f20f0caf2a1200405655e/numcodecs-0.12.1-cp311-cp311-win_amd64.whl", hash = "sha256:21d8267bd4313f4d16f5b6287731d4c8ebdab236038f29ad1b0e93c9b2ca64ee", size = 790313 }, + { url = "https://files.pythonhosted.org/packages/d7/b2/7842675a798e79686d14a20baa554b165aab86feac28f32695266ab42b7e/numcodecs-0.12.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:2f84df6b8693206365a5b37c005bfa9d1be486122bde683a7b6446af4b75d862", size = 1697725 }, + { url = "https://files.pythonhosted.org/packages/fc/1f/e3b033181a28ce153fd0c9acd3ed978ee9c424de7cc3d8e97fc60647eddf/numcodecs-0.12.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:760627780a8b6afdb7f942f2a0ddaf4e31d3d7eea1d8498cf0fd3204a33c4618", size = 1423927 }, + { url = "https://files.pythonhosted.org/packages/3b/88/fb3186f944b9586e9c4c54bd1d1899947b88465ad3ab1ff1111066871644/numcodecs-0.12.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c258bd1d3dfa75a9b708540d23b2da43d63607f9df76dfa0309a7597d1de3b73", size = 7944856 }, + { url = "https://files.pythonhosted.org/packages/f4/03/54e22e273d584e83100ffa60c47c29cae905015ecb1f693918072c3595b9/numcodecs-0.12.1-cp312-cp312-win_amd64.whl", hash = "sha256:e04649ea504aff858dbe294631f098fbfd671baf58bfc04fc48d746554c05d67", size = 787000 }, + { url = "https://files.pythonhosted.org/packages/dd/3c/950f816b837fc7714102b45491e2612b10757106f9a8e3785d7b3806acd4/numcodecs-0.12.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:2fbb12a6a1abe95926f25c65e283762d63a9bf9e43c0de2c6a1a798347dfcb40", size = 1700073 }, + { url = "https://files.pythonhosted.org/packages/76/2f/19f4f012f253ff33948a024e0a814c758ea137e3ba86118daac83a8d9123/numcodecs-0.12.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:f2207871868b2464dc11c513965fd99b958a9d7cde2629be7b2dc84fdaab013b", size = 1425835 }, + { url = "https://files.pythonhosted.org/packages/6d/0f/0442e80d707b5dd2e177a9490c25b89aa6a6c44579de8ec223e78a8884da/numcodecs-0.12.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:abff3554a6892a89aacf7b642a044e4535499edf07aeae2f2e6e8fc08c9ba07f", size = 7722207 }, + { url = "https://files.pythonhosted.org/packages/77/b6/345f8648874a81232bc1a87e55a771430488a832c68f873aa6ed23a1dedf/numcodecs-0.12.1-cp39-cp39-win_amd64.whl", hash = "sha256:ef964d4860d3e6b38df0633caf3e51dc850a6293fd8e93240473642681d95136", size = 792870 }, +] + +[[package]] +name = "numcodecs" +version = "0.13.1" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version == '3.10.*'", +] +dependencies = [ + { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/85/56/8895a76abe4ec94ebd01eeb6d74f587bc4cddd46569670e1402852a5da13/numcodecs-0.13.1.tar.gz", hash = "sha256:a3cf37881df0898f3a9c0d4477df88133fe85185bffe57ba31bcc2fa207709bc", size = 5955215 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/14/c0/6d72cde772bcec196b7188731d41282993b2958440f77fdf0db216f722da/numcodecs-0.13.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:96add4f783c5ce57cc7e650b6cac79dd101daf887c479a00a29bc1487ced180b", size = 1580012 }, + { url = "https://files.pythonhosted.org/packages/94/1d/f81fc1fa9210bbea97258242393a1f9feab4f6d8fb201f81f76003005e4b/numcodecs-0.13.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:237b7171609e868a20fd313748494444458ccd696062f67e198f7f8f52000c15", size = 1176919 }, + { url = "https://files.pythonhosted.org/packages/16/e4/b9ec2f4dfc34ecf724bc1beb96a9f6fa9b91801645688ffadacd485089da/numcodecs-0.13.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:96e42f73c31b8c24259c5fac6adba0c3ebf95536e37749dc6c62ade2989dca28", size = 8625842 }, + { url = "https://files.pythonhosted.org/packages/fe/90/299952e1477954ec4f92813fa03e743945e3ff711bb4f6c9aace431cb3da/numcodecs-0.13.1-cp310-cp310-win_amd64.whl", hash = "sha256:eda7d7823c9282e65234731fd6bd3986b1f9e035755f7fed248d7d366bb291ab", size = 828638 }, + { url = "https://files.pythonhosted.org/packages/f0/78/34b8e869ef143e88d62e8231f4dbfcad85e5c41302a11fc5bd2228a13df5/numcodecs-0.13.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:2eda97dd2f90add98df6d295f2c6ae846043396e3d51a739ca5db6c03b5eb666", size = 1580199 }, + { url = "https://files.pythonhosted.org/packages/3b/cf/f70797d86bb585d258d1e6993dced30396f2044725b96ce8bcf87a02be9c/numcodecs-0.13.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:2a86f5367af9168e30f99727ff03b27d849c31ad4522060dde0bce2923b3a8bc", size = 1177203 }, + { url = "https://files.pythonhosted.org/packages/a8/b5/d14ad69b63fde041153dfd05d7181a49c0d4864de31a7a1093c8370da957/numcodecs-0.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:233bc7f26abce24d57e44ea8ebeb5cd17084690b4e7409dd470fdb75528d615f", size = 8868743 }, + { url = "https://files.pythonhosted.org/packages/13/d4/27a7b5af0b33f6d61e198faf177fbbf3cb83ff10d9d1a6857b7efc525ad5/numcodecs-0.13.1-cp311-cp311-win_amd64.whl", hash = "sha256:796b3e6740107e4fa624cc636248a1580138b3f1c579160f260f76ff13a4261b", size = 829603 }, + { url = "https://files.pythonhosted.org/packages/37/3a/bc09808425e7d3df41e5fc73fc7a802c429ba8c6b05e55f133654ade019d/numcodecs-0.13.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:5195bea384a6428f8afcece793860b1ab0ae28143c853f0b2b20d55a8947c917", size = 1575806 }, + { url = "https://files.pythonhosted.org/packages/3a/cc/dc74d0bfdf9ec192332a089d199f1e543e747c556b5659118db7a437dcca/numcodecs-0.13.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:3501a848adaddce98a71a262fee15cd3618312692aa419da77acd18af4a6a3f6", size = 1178233 }, + { url = "https://files.pythonhosted.org/packages/d4/ce/434e8e3970b8e92ae9ab6d9db16cb9bc7aa1cd02e17c11de6848224100a1/numcodecs-0.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:da2230484e6102e5fa3cc1a5dd37ca1f92dfbd183d91662074d6f7574e3e8f53", size = 8857827 }, + { url = "https://files.pythonhosted.org/packages/83/e7/1d8b1b266a92f9013c755b1c146c5ad71a2bff147ecbc67f86546a2e4d6a/numcodecs-0.13.1-cp312-cp312-win_amd64.whl", hash = "sha256:e5db4824ebd5389ea30e54bc8aeccb82d514d28b6b68da6c536b8fa4596f4bca", size = 826539 }, + { url = "https://files.pythonhosted.org/packages/83/8b/06771dead2cc4a8ae1ea9907737cf1c8d37a323392fa28f938a586373468/numcodecs-0.13.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:7a60d75179fd6692e301ddfb3b266d51eb598606dcae7b9fc57f986e8d65cb43", size = 1571660 }, + { url = "https://files.pythonhosted.org/packages/f9/ea/d925bf85f92dfe4635356018da9fe4bfecb07b1c72f62b01c1bc47f936b1/numcodecs-0.13.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:3f593c7506b0ab248961a3b13cb148cc6e8355662ff124ac591822310bc55ecf", size = 1169925 }, + { url = "https://files.pythonhosted.org/packages/0f/d6/643a3839d571d8e439a2c77dc4b0b8cab18d96ac808e4a81dbe88e959ab6/numcodecs-0.13.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:80d3071465f03522e776a31045ddf2cfee7f52df468b977ed3afdd7fe5869701", size = 8814257 }, + { url = "https://files.pythonhosted.org/packages/a6/c5/f3e56bc9b4e438a287fff738993d6d11abef368c0328a612ac2842ba9fca/numcodecs-0.13.1-cp313-cp313-win_amd64.whl", hash = "sha256:90d3065ae74c9342048ae0046006f99dcb1388b7288da5a19b3bddf9c30c3176", size = 821887 }, +] + +[[package]] +name = "numcodecs" +version = "0.15.1" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version >= '3.13'", + "python_full_version == '3.12.*'", + "python_full_version == '3.11.*'", +] +dependencies = [ + { name = "deprecated", marker = "python_full_version >= '3.11'" }, + { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11' and python_full_version < '3.13'" }, + { name = "numpy", version = "2.4.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/63/fc/bb532969eb8236984ba65e4f0079a7da885b8ac0ce1f0835decbb3938a62/numcodecs-0.15.1.tar.gz", hash = "sha256:eeed77e4d6636641a2cc605fbc6078c7a8f2cc40f3dfa2b3f61e52e6091b04ff", size = 6267275 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/e4/fc/410f1cacaef0931f5daf06813b1b8a2442f7418ee284ec73fe5e830dca48/numcodecs-0.15.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:698f1d59511488b8fe215fadc1e679a4c70d894de2cca6d8bf2ab770eed34dfd", size = 1649501 }, + { url = "https://files.pythonhosted.org/packages/85/29/dff62fae04323035912c419a82dc9624fad7d08541dbfcd9ab78a3a40074/numcodecs-0.15.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:bef8c8e64fab76677324a07672b10c31861775d03fc63ed5012ca384144e4bb9", size = 1187306 }, + { url = "https://files.pythonhosted.org/packages/a6/a8/908a226632ffabf19caf8c99f1b2898f2f22aac02795a6fe9d018fd6d9dd/numcodecs-0.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cdfaef9f5f2ed8f65858db801f1953f1007c9613ee490a1c56233cd78b505ed5", size = 8891971 }, + { url = "https://files.pythonhosted.org/packages/2b/e8/058aac43e1300d588e99b2d0d5b771c8a43fa92ce9c9517da596869fc146/numcodecs-0.15.1-cp311-cp311-win_amd64.whl", hash = "sha256:e2547fa3a7ffc9399cfd2936aecb620a3db285f2630c86c8a678e477741a4b3c", size = 840035 }, + { url = "https://files.pythonhosted.org/packages/e7/7e/f12fc32d3beedc6a8f1ec69ea0ba72e93cb99c0350feed2cff5d04679bc3/numcodecs-0.15.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:b0a9d9cd29a0088220682dda4a9898321f7813ff7802be2bbb545f6e3d2f10ff", size = 1691889 }, + { url = "https://files.pythonhosted.org/packages/81/38/88e40d40288b73c3b3a390ed5614a34b0661d00255bdd4cfb91c32101364/numcodecs-0.15.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:a34f0fe5e5f3b837bbedbeb98794a6d4a12eeeef8d4697b523905837900b5e1c", size = 1189149 }, + { url = "https://files.pythonhosted.org/packages/28/7d/7527d9180bc76011d6163c848c9cf02cd28a623c2c66cf543e1e86de7c5e/numcodecs-0.15.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c3a09e22140f2c691f7df26303ff8fa2dadcf26d7d0828398c0bc09b69e5efa3", size = 8879163 }, + { url = "https://files.pythonhosted.org/packages/ab/bc/b6c3cde91c754860a3467a8c058dcf0b1a5ca14d82b1c5397c700cf8b1eb/numcodecs-0.15.1-cp312-cp312-win_amd64.whl", hash = "sha256:daed6066ffcf40082da847d318b5ab6123d69ceb433ba603cb87c323a541a8bc", size = 836785 }, + { url = "https://files.pythonhosted.org/packages/78/57/acbc54b3419e5be65015e47177c76c0a73e037fd3ae2cde5808169194d4d/numcodecs-0.15.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e3d82b70500cf61e8d115faa0d0a76be6ecdc24a16477ee3279d711699ad85f3", size = 1688220 }, + { url = "https://files.pythonhosted.org/packages/b6/56/9863fa6dc679f40a31bea5e9713ee5507a31dcd3ee82ea4b1a9268ce52e8/numcodecs-0.15.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:1d471a1829ce52d3f365053a2bd1379e32e369517557c4027ddf5ac0d99c591e", size = 1180294 }, + { url = "https://files.pythonhosted.org/packages/fa/91/d96999b41e3146b6c0ce6bddc5ad85803cb4d743c95394562c2a4bb8cded/numcodecs-0.15.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1dfdea4a67108205edfce99c1cb6cd621343bc7abb7e16a041c966776920e7de", size = 8834323 }, + { url = "https://files.pythonhosted.org/packages/c3/32/233e5ede6568bdb044e6f99aaa9fa39827ff3109c6487fc137315f733586/numcodecs-0.15.1-cp313-cp313-win_amd64.whl", hash = "sha256:a4f7bdb26f1b34423cb56d48e75821223be38040907c9b5954eeb7463e7eb03c", size = 831955 }, +] + [[package]] name = "numexpr" version = "2.10.2" @@ -1272,7 +1389,7 @@ wheels = [ [[package]] name = "openfisca-survey-manager" -version = "6.3.0" +version = "1.0.0" source = { editable = "." } dependencies = [ { name = "chardet" }, @@ -1327,6 +1444,11 @@ sas = [ { name = "pyreadstat" }, { name = "sas7bdat" }, ] +zarr = [ + { name = "zarr", version = "2.18.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, + { name = "zarr", version = "2.18.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" }, + { name = "zarr", version = "2.18.7", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" }, +] [package.metadata] requires-dist = [ @@ -1364,8 +1486,9 @@ requires-dist = [ { name = "tabulate", specifier = ">=0.9.0,<0.10.0" }, { name = "weightedcalcs", specifier = ">=0.1.2,<0.2.0" }, { name = "wquantiles", specifier = ">=0.6,<0.7" }, + { name = "zarr", marker = "extra == 'zarr'", specifier = ">=2.18.0,<3.0" }, ] -provides-extras = ["matching", "dev", "casd", "sas"] +provides-extras = ["zarr", "matching", "dev", "casd", "sas"] [[package]] name = "packaging" @@ -2430,6 +2553,144 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/f7/75/3cce30508bf46121b7cabce57b9cacbf8d935fa555cb3c5fca43f8dd0414/wquantiles-0.6-py3-none-any.whl", hash = "sha256:1b90d68fa05251bb96f8806a346e8d7dec9a9bb99f381ad5094707b46ab85218", size = 3291 }, ] +[[package]] +name = "wrapt" +version = "2.1.1" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/f7/37/ae31f40bec90de2f88d9597d0b5281e23ffe85b893a47ca5d9c05c63a4f6/wrapt-2.1.1.tar.gz", hash = "sha256:5fdcb09bf6db023d88f312bd0767594b414655d58090fc1c46b3414415f67fac", size = 81329 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/ca/21/293b657a27accfbbbb6007ebd78af0efa2083dac83e8f523272ea09b4638/wrapt-2.1.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:7e927375e43fd5a985b27a8992327c22541b6dede1362fc79df337d26e23604f", size = 60554 }, + { url = "https://files.pythonhosted.org/packages/25/e9/96dd77728b54a899d4ce2798d7b1296989ce687ed3c0cb917d6b3154bf5d/wrapt-2.1.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:e1c99544b6a7d40ca22195563b6d8bc3986ee8bb82f272f31f0670fe9440c869", size = 61496 }, + { url = "https://files.pythonhosted.org/packages/44/79/4c755b45df6ef30c0dd628ecfaa0c808854be147ca438429da70a162833c/wrapt-2.1.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:b2be3fa5f4efaf16ee7c77d0556abca35f5a18ad4ac06f0ef3904c3399010ce9", size = 113528 }, + { url = "https://files.pythonhosted.org/packages/9f/63/23ce28f7b841217d9a6337a340fbb8d4a7fbd67a89d47f377c8550fa34aa/wrapt-2.1.1-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:67c90c1ae6489a6cb1a82058902caa8006706f7b4e8ff766f943e9d2c8e608d0", size = 115536 }, + { url = "https://files.pythonhosted.org/packages/23/7b/5ca8d3b12768670d16c8329e29960eedd56212770365a02a8de8bf73dc01/wrapt-2.1.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:05c0db35ccffd7480143e62df1e829d101c7b86944ae3be7e4869a7efa621f53", size = 114716 }, + { url = "https://files.pythonhosted.org/packages/c7/3a/9789ccb14a096d30bb847bf3ee137bf682cc9750c2ce155f4c5ae1962abf/wrapt-2.1.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:0c2ec9f616755b2e1e0bf4d0961f59bb5c2e7a77407e7e2c38ef4f7d2fdde12c", size = 113200 }, + { url = "https://files.pythonhosted.org/packages/cf/e5/4ec3526ce6ce920b267c8d35d2c2f0874d3fad2744c8b7259353f1132baa/wrapt-2.1.1-cp310-cp310-win32.whl", hash = "sha256:203ba6b3f89e410e27dbd30ff7dccaf54dcf30fda0b22aa1b82d560c7f9fe9a1", size = 57876 }, + { url = "https://files.pythonhosted.org/packages/d1/4e/661c7c76ecd85375b2bc03488941a3a1078642af481db24949e2b9de01f4/wrapt-2.1.1-cp310-cp310-win_amd64.whl", hash = "sha256:6f9426d9cfc2f8732922fc96198052e55c09bb9db3ddaa4323a18e055807410e", size = 60224 }, + { url = "https://files.pythonhosted.org/packages/5f/b7/53c7252d371efada4cb119e72e774fa2c6b3011fc33e3e552cdf48fb9488/wrapt-2.1.1-cp310-cp310-win_arm64.whl", hash = "sha256:69c26f51b67076b40714cff81bdd5826c0b10c077fb6b0678393a6a2f952a5fc", size = 58645 }, + { url = "https://files.pythonhosted.org/packages/b8/a8/9254e4da74b30a105935197015b18b31b7a298bf046e67d8952ef74967bd/wrapt-2.1.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:6c366434a7fb914c7a5de508ed735ef9c133367114e1a7cb91dfb5cd806a1549", size = 60554 }, + { url = "https://files.pythonhosted.org/packages/9e/a1/378579880cc7af226354054a2c255f69615b379d8adad482bfe2f22a0dc2/wrapt-2.1.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:5d6a2068bd2e1e19e5a317c8c0b288267eec4e7347c36bc68a6e378a39f19ee7", size = 61491 }, + { url = "https://files.pythonhosted.org/packages/dc/72/957b51c56acca35701665878ad31626182199fc4afecfe67dea072210f95/wrapt-2.1.1-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:891ab4713419217b2aed7dd106c9200f64e6a82226775a0d2ebd6bef2ebd1747", size = 113949 }, + { url = "https://files.pythonhosted.org/packages/cd/74/36bbebb4a3d2ae9c3e6929639721f8606cd0710a82a777c371aa69e36504/wrapt-2.1.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c8ef36a0df38d2dc9d907f6617f89e113c5892e0a35f58f45f75901af0ce7d81", size = 115989 }, + { url = "https://files.pythonhosted.org/packages/ae/0d/f1177245a083c7be284bc90bddfe5aece32cdd5b858049cb69ce001a0e8d/wrapt-2.1.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:76e9af3ebd86f19973143d4d592cbf3e970cf3f66ddee30b16278c26ae34b8ab", size = 115242 }, + { url = "https://files.pythonhosted.org/packages/62/3e/3b7cf5da27e59df61b1eae2d07dd03ff5d6f75b5408d694873cca7a8e33c/wrapt-2.1.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:ff562067485ebdeaef2fa3fe9b1876bc4e7b73762e0a01406ad81e2076edcebf", size = 113676 }, + { url = "https://files.pythonhosted.org/packages/f7/65/8248d3912c705f2c66f81cb97c77436f37abcbedb16d633b5ab0d795d8cd/wrapt-2.1.1-cp311-cp311-win32.whl", hash = "sha256:9e60a30aa0909435ec4ea2a3c53e8e1b50ac9f640c0e9fe3f21fd248a22f06c5", size = 57863 }, + { url = "https://files.pythonhosted.org/packages/6b/31/d29310ab335f71f00c50466153b3dc985aaf4a9fc03263e543e136859541/wrapt-2.1.1-cp311-cp311-win_amd64.whl", hash = "sha256:7d79954f51fcf84e5ec4878ab4aea32610d70145c5bbc84b3370eabfb1e096c2", size = 60224 }, + { url = "https://files.pythonhosted.org/packages/0c/90/a6ec319affa6e2894962a0cb9d73c67f88af1a726d15314bfb5c88b8a08d/wrapt-2.1.1-cp311-cp311-win_arm64.whl", hash = "sha256:d3ffc6b0efe79e08fd947605fd598515aebefe45e50432dc3b5cd437df8b1ada", size = 58643 }, + { url = "https://files.pythonhosted.org/packages/df/cb/4d5255d19bbd12be7f8ee2c1fb4269dddec9cef777ef17174d357468efaa/wrapt-2.1.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:ab8e3793b239db021a18782a5823fcdea63b9fe75d0e340957f5828ef55fcc02", size = 61143 }, + { url = "https://files.pythonhosted.org/packages/6f/07/7ed02daa35542023464e3c8b7cb937fa61f6c61c0361ecf8f5fecf8ad8da/wrapt-2.1.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:7c0300007836373d1c2df105b40777986accb738053a92fe09b615a7a4547e9f", size = 61740 }, + { url = "https://files.pythonhosted.org/packages/c4/60/a237a4e4a36f6d966061ccc9b017627d448161b19e0a3ab80a7c7c97f859/wrapt-2.1.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:2b27c070fd1132ab23957bcd4ee3ba707a91e653a9268dc1afbd39b77b2799f7", size = 121327 }, + { url = "https://files.pythonhosted.org/packages/ae/fe/9139058a3daa8818fc67e6460a2340e8bbcf3aef8b15d0301338bbe181ca/wrapt-2.1.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8b0e36d845e8b6f50949b6b65fc6cd279f47a1944582ed4ec8258cd136d89a64", size = 122903 }, + { url = "https://files.pythonhosted.org/packages/91/10/b8479202b4164649675846a531763531f0a6608339558b5a0a718fc49a8d/wrapt-2.1.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:4aeea04a9889370fcfb1ef828c4cc583f36a875061505cd6cd9ba24d8b43cc36", size = 121333 }, + { url = "https://files.pythonhosted.org/packages/5f/75/75fc793b791d79444aca2c03ccde64e8b99eda321b003f267d570b7b0985/wrapt-2.1.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:d88b46bb0dce9f74b6817bc1758ff2125e1ca9e1377d62ea35b6896142ab6825", size = 120458 }, + { url = "https://files.pythonhosted.org/packages/d7/8f/c3f30d511082ca6d947c405f9d8f6c8eaf83cfde527c439ec2c9a30eb5ea/wrapt-2.1.1-cp312-cp312-win32.whl", hash = "sha256:63decff76ca685b5c557082dfbea865f3f5f6d45766a89bff8dc61d336348833", size = 58086 }, + { url = "https://files.pythonhosted.org/packages/0a/c8/37625b643eea2849f10c3b90f69c7462faa4134448d4443234adaf122ae5/wrapt-2.1.1-cp312-cp312-win_amd64.whl", hash = "sha256:b828235d26c1e35aca4107039802ae4b1411be0fe0367dd5b7e4d90e562fcbcd", size = 60328 }, + { url = "https://files.pythonhosted.org/packages/ce/79/56242f07572d5682ba8065a9d4d9c2218313f576e3c3471873c2a5355ffd/wrapt-2.1.1-cp312-cp312-win_arm64.whl", hash = "sha256:75128507413a9f1bcbe2db88fd18fbdbf80f264b82fa33a6996cdeaf01c52352", size = 58722 }, + { url = "https://files.pythonhosted.org/packages/f7/ca/3cf290212855b19af9fcc41b725b5620b32f470d6aad970c2593500817eb/wrapt-2.1.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:ce9646e17fa7c3e2e7a87e696c7de66512c2b4f789a8db95c613588985a2e139", size = 61150 }, + { url = "https://files.pythonhosted.org/packages/9d/33/5b8f89a82a9859ce82da4870c799ad11ce15648b6e1c820fec3e23f4a19f/wrapt-2.1.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:428cfc801925454395aa468ba7ddb3ed63dc0d881df7b81626cdd433b4e2b11b", size = 61743 }, + { url = "https://files.pythonhosted.org/packages/1e/2f/60c51304fbdf47ce992d9eefa61fbd2c0e64feee60aaa439baf42ea6f40b/wrapt-2.1.1-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:5797f65e4d58065a49088c3b32af5410751cd485e83ba89e5a45e2aa8905af98", size = 121341 }, + { url = "https://files.pythonhosted.org/packages/ad/03/ce5256e66dd94e521ad5e753c78185c01b6eddbed3147be541f4d38c0cb7/wrapt-2.1.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5a2db44a71202c5ae4bb5f27c6d3afbc5b23053f2e7e78aa29704541b5dad789", size = 122947 }, + { url = "https://files.pythonhosted.org/packages/eb/ae/50ca8854b81b946a11a36fcd6ead32336e6db2c14b6e4a8b092b80741178/wrapt-2.1.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:8d5350c3590af09c1703dd60ec78a7370c0186e11eaafb9dda025a30eee6492d", size = 121370 }, + { url = "https://files.pythonhosted.org/packages/fb/d9/d6a7c654e0043319b4cc137a4caaf7aa16b46b51ee8df98d1060254705b7/wrapt-2.1.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:2d9b076411bed964e752c01b49fd224cc385f3a96f520c797d38412d70d08359", size = 120465 }, + { url = "https://files.pythonhosted.org/packages/55/90/65be41e40845d951f714b5a77e84f377a3787b1e8eee6555a680da6d0db5/wrapt-2.1.1-cp313-cp313-win32.whl", hash = "sha256:0bb7207130ce6486727baa85373503bf3334cc28016f6928a0fa7e19d7ecdc06", size = 58090 }, + { url = "https://files.pythonhosted.org/packages/5f/66/6a09e0294c4fc8c26028a03a15191721c9271672467cc33e6617ee0d91d2/wrapt-2.1.1-cp313-cp313-win_amd64.whl", hash = "sha256:cbfee35c711046b15147b0ae7db9b976f01c9520e6636d992cd9e69e5e2b03b1", size = 60341 }, + { url = "https://files.pythonhosted.org/packages/7a/f0/20ceb8b701e9a71555c87a5ddecbed76ec16742cf1e4b87bbaf26735f998/wrapt-2.1.1-cp313-cp313-win_arm64.whl", hash = "sha256:7d2756061022aebbf57ba14af9c16e8044e055c22d38de7bf40d92b565ecd2b0", size = 58731 }, + { url = "https://files.pythonhosted.org/packages/80/b4/fe95beb8946700b3db371f6ce25115217e7075ca063663b8cca2888ba55c/wrapt-2.1.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:4814a3e58bc6971e46baa910ecee69699110a2bf06c201e24277c65115a20c20", size = 62969 }, + { url = "https://files.pythonhosted.org/packages/b8/89/477b0bdc784e3299edf69c279697372b8bd4c31d9c6966eae405442899df/wrapt-2.1.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:106c5123232ab9b9f4903692e1fa0bdc231510098f04c13c3081f8ad71c3d612", size = 63606 }, + { url = "https://files.pythonhosted.org/packages/ed/55/9d0c1269ab76de87715b3b905df54dd25d55bbffd0b98696893eb613469f/wrapt-2.1.1-cp313-cp313t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:1a40b83ff2535e6e56f190aff123821eea89a24c589f7af33413b9c19eb2c738", size = 152536 }, + { url = "https://files.pythonhosted.org/packages/44/18/2004766030462f79ad86efaa62000b5e39b1ff001dcce86650e1625f40ae/wrapt-2.1.1-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:789cea26e740d71cf1882e3a42bb29052bc4ada15770c90072cb47bf73fb3dbf", size = 158697 }, + { url = "https://files.pythonhosted.org/packages/e1/bb/0a880fa0f35e94ee843df4ee4dd52a699c9263f36881311cfb412c09c3e5/wrapt-2.1.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:ba49c14222d5e5c0ee394495a8655e991dc06cbca5398153aefa5ac08cd6ccd7", size = 155563 }, + { url = "https://files.pythonhosted.org/packages/42/ff/cd1b7c4846c8678fac359a6eb975dc7ab5bd606030adb22acc8b4a9f53f1/wrapt-2.1.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:ac8cda531fe55be838a17c62c806824472bb962b3afa47ecbd59b27b78496f4e", size = 150161 }, + { url = "https://files.pythonhosted.org/packages/38/ec/67c90a7082f452964b4621e4890e9a490f1add23cdeb7483cc1706743291/wrapt-2.1.1-cp313-cp313t-win32.whl", hash = "sha256:b8af75fe20d381dd5bcc9db2e86a86d7fcfbf615383a7147b85da97c1182225b", size = 59783 }, + { url = "https://files.pythonhosted.org/packages/ec/08/466afe4855847d8febdfa2c57c87e991fc5820afbdef01a273683dfd15a0/wrapt-2.1.1-cp313-cp313t-win_amd64.whl", hash = "sha256:45c5631c9b6c792b78be2d7352129f776dd72c605be2c3a4e9be346be8376d83", size = 63082 }, + { url = "https://files.pythonhosted.org/packages/9a/62/60b629463c28b15b1eeadb3a0691e17568622b12aa5bfa7ebe9b514bfbeb/wrapt-2.1.1-cp313-cp313t-win_arm64.whl", hash = "sha256:da815b9263947ac98d088b6414ac83507809a1d385e4632d9489867228d6d81c", size = 60251 }, + { url = "https://files.pythonhosted.org/packages/95/a0/1c2396e272f91efe6b16a6a8bce7ad53856c8f9ae4f34ceaa711d63ec9e1/wrapt-2.1.1-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:9aa1765054245bb01a37f615503290d4e207e3fd59226e78341afb587e9c1236", size = 61311 }, + { url = "https://files.pythonhosted.org/packages/b0/9a/d2faba7e61072a7507b5722db63562fdb22f5a24e237d460d18755627f15/wrapt-2.1.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:feff14b63a6d86c1eee33a57f77573649f2550935981625be7ff3cb7342efe05", size = 61805 }, + { url = "https://files.pythonhosted.org/packages/db/56/073989deb4b5d7d6e7ea424476a4ae4bda02140f2dbeaafb14ba4864dd60/wrapt-2.1.1-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:81fc5f22d5fcfdbabde96bb3f5379b9f4476d05c6d524d7259dc5dfb501d3281", size = 120308 }, + { url = "https://files.pythonhosted.org/packages/d1/b6/84f37261295e38167a29eb82affaf1dc15948dc416925fe2091beee8e4ac/wrapt-2.1.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:951b228ecf66def855d22e006ab9a1fc12535111ae7db2ec576c728f8ddb39e8", size = 122688 }, + { url = "https://files.pythonhosted.org/packages/ea/80/32db2eec6671f80c65b7ff175be61bc73d7f5223f6910b0c921bbc4bd11c/wrapt-2.1.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:0ddf582a95641b9a8c8bd643e83f34ecbbfe1b68bc3850093605e469ab680ae3", size = 121115 }, + { url = "https://files.pythonhosted.org/packages/49/ef/dcd00383df0cd696614127902153bf067971a5aabcd3c9dcb2d8ef354b2a/wrapt-2.1.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:fc5c500966bf48913f795f1984704e6d452ba2414207b15e1f8c339a059d5b16", size = 119484 }, + { url = "https://files.pythonhosted.org/packages/76/29/0630280cdd2bd8f86f35cb6854abee1c9d6d1a28a0c6b6417cd15d378325/wrapt-2.1.1-cp314-cp314-win32.whl", hash = "sha256:4aa4baadb1f94b71151b8e44a0c044f6af37396c3b8bcd474b78b49e2130a23b", size = 58514 }, + { url = "https://files.pythonhosted.org/packages/db/19/5bed84f9089ed2065f6aeda5dfc4f043743f642bc871454b261c3d7d322b/wrapt-2.1.1-cp314-cp314-win_amd64.whl", hash = "sha256:860e9d3fd81816a9f4e40812f28be4439ab01f260603c749d14be3c0a1170d19", size = 60763 }, + { url = "https://files.pythonhosted.org/packages/e4/cb/b967f2f9669e4249b4fe82e630d2a01bc6b9e362b9b12ed91bbe23ae8df4/wrapt-2.1.1-cp314-cp314-win_arm64.whl", hash = "sha256:3c59e103017a2c1ea0ddf589cbefd63f91081d7ce9d491d69ff2512bb1157e23", size = 59051 }, + { url = "https://files.pythonhosted.org/packages/eb/19/6fed62be29f97eb8a56aff236c3f960a4b4a86e8379dc7046a8005901a97/wrapt-2.1.1-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:9fa7c7e1bee9278fc4f5dd8275bc8d25493281a8ec6c61959e37cc46acf02007", size = 63059 }, + { url = "https://files.pythonhosted.org/packages/0a/1c/b757fd0adb53d91547ed8fad76ba14a5932d83dde4c994846a2804596378/wrapt-2.1.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:39c35e12e8215628984248bd9c8897ce0a474be2a773db207eb93414219d8469", size = 63618 }, + { url = "https://files.pythonhosted.org/packages/10/fe/e5ae17b1480957c7988d991b93df9f2425fc51f128cf88144d6a18d0eb12/wrapt-2.1.1-cp314-cp314t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:94ded4540cac9125eaa8ddf5f651a7ec0da6f5b9f248fe0347b597098f8ec14c", size = 152544 }, + { url = "https://files.pythonhosted.org/packages/3e/cc/99aed210c6b547b8a6e4cb9d1425e4466727158a6aeb833aa7997e9e08dd/wrapt-2.1.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:da0af328373f97ed9bdfea24549ac1b944096a5a71b30e41c9b8b53ab3eec04a", size = 158700 }, + { url = "https://files.pythonhosted.org/packages/81/0e/d442f745f4957944d5f8ad38bc3a96620bfff3562533b87e486e979f3d99/wrapt-2.1.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:4ad839b55f0bf235f8e337ce060572d7a06592592f600f3a3029168e838469d3", size = 155561 }, + { url = "https://files.pythonhosted.org/packages/51/ac/9891816280e0018c48f8dfd61b136af7b0dcb4a088895db2531acde5631b/wrapt-2.1.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:0d89c49356e5e2a50fa86b40e0510082abcd0530f926cbd71cf25bee6b9d82d7", size = 150188 }, + { url = "https://files.pythonhosted.org/packages/24/98/e2f273b6d70d41f98d0739aa9a269d0b633684a5fb17b9229709375748d4/wrapt-2.1.1-cp314-cp314t-win32.whl", hash = "sha256:f4c7dd22cf7f36aafe772f3d88656559205c3af1b7900adfccb70edeb0d2abc4", size = 60425 }, + { url = "https://files.pythonhosted.org/packages/1e/06/b500bfc38a4f82d89f34a13069e748c82c5430d365d9e6b75afb3ab74457/wrapt-2.1.1-cp314-cp314t-win_amd64.whl", hash = "sha256:f76bc12c583ab01e73ba0ea585465a41e48d968f6d1311b4daec4f8654e356e3", size = 63855 }, + { url = "https://files.pythonhosted.org/packages/d9/cc/5f6193c32166faee1d2a613f278608e6f3b95b96589d020f0088459c46c9/wrapt-2.1.1-cp314-cp314t-win_arm64.whl", hash = "sha256:7ea74fc0bec172f1ae5f3505b6655c541786a5cabe4bbc0d9723a56ac32eb9b9", size = 60443 }, + { url = "https://files.pythonhosted.org/packages/08/3e/144e085a4a237b60a1b41f56e8a173e5e4f21f42a201e43f8d38272b4772/wrapt-2.1.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:9e03b3d486eb39f5d3f562839f59094dcee30c4039359ea15768dc2214d9e07c", size = 60552 }, + { url = "https://files.pythonhosted.org/packages/69/25/576fa5d1e8c0b2657ed411b947bb50c7cc56a0a882fbd1b04574803e668a/wrapt-2.1.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:0fdf3073f488ce4d929929b7799e3b8c52b220c9eb3f4a5a51e2dc0e8ff07881", size = 61498 }, + { url = "https://files.pythonhosted.org/packages/48/01/37def21f806dee9db8c12f99b872b3cdf15215bafe3919c982968134b804/wrapt-2.1.1-cp39-cp39-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:0cb4f59238c6625fae2eeb72278da31c9cfba0ff4d9cbe37446b73caa0e9bcf7", size = 113232 }, + { url = "https://files.pythonhosted.org/packages/bf/ee/31dfda37ae75db11cc46634aa950c3497f7a8f987d811388bf1b11fe2f80/wrapt-2.1.1-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7f794a1c148871b714cb566f5466ec8288e0148a1c417550983864b3981737cd", size = 115198 }, + { url = "https://files.pythonhosted.org/packages/93/d5/43cb27a2d7142bdbe9700099e7261fdc264f63c6b60a8025dd5f8af157da/wrapt-2.1.1-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:95ef3866631c6da9ce1fc0f1e17b90c4c0aa6d041fc70a11bc90733aee122e1a", size = 114400 }, + { url = "https://files.pythonhosted.org/packages/61/91/8429803605df5540b918fe6fc9ffc4f167270f4b7ca1f82eaf7d7b1204b6/wrapt-2.1.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:66bc1b2446f01cbbd3c56b79a3a8435bcd4178ac4e06b091913f7751a7f528b8", size = 112998 }, + { url = "https://files.pythonhosted.org/packages/7e/6a/25cb316f3e8262a1626da71b2c299ae2be02fb0547028eac9aa21daeedda/wrapt-2.1.1-cp39-cp39-win32.whl", hash = "sha256:1b9e08e57cabc32972f7c956d10e85093c5da9019faa24faf411e7dd258e528c", size = 57871 }, + { url = "https://files.pythonhosted.org/packages/09/69/ffd41e6149ac4bd9700552659842383f44eb96f00e03c2db433bc856bf2f/wrapt-2.1.1-cp39-cp39-win_amd64.whl", hash = "sha256:e75ad48c3cca739f580b5e14c052993eb644c7fa5b4c90aa51193280b30875ae", size = 60222 }, + { url = "https://files.pythonhosted.org/packages/59/f0/1889e68a0d389d2552b9e014ed6471addcfab98f09611bac61a8d8fab223/wrapt-2.1.1-cp39-cp39-win_arm64.whl", hash = "sha256:9ccd657873b7f964711447d004563a2bc08d1476d7a1afcad310f3713e6f50f4", size = 58647 }, + { url = "https://files.pythonhosted.org/packages/c4/da/5a086bf4c22a41995312db104ec2ffeee2cf6accca9faaee5315c790377d/wrapt-2.1.1-py3-none-any.whl", hash = "sha256:3b0f4629eb954394a3d7c7a1c8cca25f0b07cefe6aa8545e862e9778152de5b7", size = 43886 }, +] + +[[package]] +name = "zarr" +version = "2.18.2" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version < '3.10'", +] +dependencies = [ + { name = "asciitree", marker = "python_full_version < '3.10'" }, + { name = "fasteners", marker = "python_full_version < '3.10' and sys_platform != 'emscripten'" }, + { name = "numcodecs", version = "0.12.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, + { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/b3/00/ac5c518ff1c1b1cc87a62f86ad9d19c647c19d969a91faa40d3b6342ccaa/zarr-2.18.2.tar.gz", hash = "sha256:9bb393b8a0a38fb121dbb913b047d75db28de9890f6d644a217a73cf4ae74f47", size = 3603055 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/5d/bd/8d881d8ca6d80fcb8da2b2f94f8855384daf649499ddfba78ffd1ee2caa3/zarr-2.18.2-py3-none-any.whl", hash = "sha256:a638754902f97efa99b406083fdc807a0e2ccf12a949117389d2a4ba9b05df38", size = 210228 }, +] + +[[package]] +name = "zarr" +version = "2.18.3" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version == '3.10.*'", +] +dependencies = [ + { name = "asciitree", marker = "python_full_version == '3.10.*'" }, + { name = "fasteners", marker = "python_full_version == '3.10.*' and sys_platform != 'emscripten'" }, + { name = "numcodecs", version = "0.13.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" }, + { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/23/c4/187a21ce7cf7c8f00c060dd0e04c2a81139bb7b1ab178bba83f2e1134ce2/zarr-2.18.3.tar.gz", hash = "sha256:2580d8cb6dd84621771a10d31c4d777dca8a27706a1a89b29f42d2d37e2df5ce", size = 3603224 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/ed/c9/142095e654c2b97133ff71df60979422717b29738b08bc8a1709a5d5e0d0/zarr-2.18.3-py3-none-any.whl", hash = "sha256:b1f7dfd2496f436745cdd4c7bcf8d3b4bc1dceef5fdd0d589c87130d842496dd", size = 210723 }, +] + +[[package]] +name = "zarr" +version = "2.18.7" +source = { registry = "https://pypi.org/simple" } +resolution-markers = [ + "python_full_version >= '3.13'", + "python_full_version == '3.12.*'", + "python_full_version == '3.11.*'", +] +dependencies = [ + { name = "asciitree", marker = "python_full_version >= '3.11'" }, + { name = "fasteners", marker = "python_full_version >= '3.11' and sys_platform != 'emscripten'" }, + { name = "numcodecs", version = "0.15.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" }, + { name = "numpy", version = "1.26.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11' and python_full_version < '3.13'" }, + { name = "numpy", version = "2.4.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/da/1d/01cf9e3ab2d85190278efc3fca9f68563de35ae30ee59e7640e3af98abe3/zarr-2.18.7.tar.gz", hash = "sha256:b2b8f66f14dac4af66b180d2338819981b981f70e196c9a66e6bfaa9e59572f5", size = 3604558 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/5e/d8/9ffd8c237b3559945bb52103cf0eed64ea098f7b7f573f8d2962ef27b4b2/zarr-2.18.7-py3-none-any.whl", hash = "sha256:ac3dc4033e9ae4e9d7b5e27c97ea3eaf1003cc0a07f010bd83d5134bf8c4b223", size = 211273 }, +] + [[package]] name = "zipp" version = "3.23.0"