-
-
Notifications
You must be signed in to change notification settings - Fork 6
Add Cercarbono & Isometric registry data processors #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 51 commits
Commits
Show all changes
71 commits
Select commit
Hold shift + click to select a range
3a02bef
Add Cercarbono project processing and update raw columns mapping
andersy005 47829e2
Merge branch 'main' into add-Cercarbano
andersy005 95d9a1a
Update Cercarbono mappings in projects-raw-columns-mapping.json
andersy005 dbbc2c6
Add method to generate project URLs for Cercarbono projects
andersy005 8a06d33
Add processing method for Cercarbono transactions and update column m…
andersy005 7a1dcc4
Update transaction date conversion to use ISO8601 format
andersy005 a05fc97
Extract vintage year from vintage_of_credits in process_cercarbono_tr…
andersy005 328d074
Add missing columns handling in process_cercarbono_transactions
andersy005 2c2df90
Refactor process_cercarbono_projects to accept credits DataFrame and …
andersy005 3a36872
Remove unnecessary parameter from process_vcs_projects calls in tests
andersy005 3f6029a
Add process_isometric_projects function to handle Isometric project data
andersy005 121a275
Add isometric project mappings to projects-raw-columns-mapping.json
andersy005 e8d93cc
Add project URL handling and enhance isometric project processing
andersy005 53b461e
Rename process_cercarbono_transactions to process_cercarbono_credits …
andersy005 a7a7540
Enhance process_isometric_credits function to include datetime conver…
andersy005 156694b
Add project ID and vintage year extraction to process_isometric_credi…
andersy005 31b6cb4
Change integer columns to Float32 in project_schema and credit_withou…
andersy005 cf5ca9c
Uncomment methods to add retired and issued totals, and first issuanc…
andersy005 783b1d2
Refactor process_isometric_credits function to handle transaction typ…
andersy005 eaa2599
Add 'isometric' and 'cercarbono' to registry abbreviation mapping
andersy005 04072d5
Update project_id mapping in cercarbono retirements and remove redund…
andersy005 d97f43b
Add project ID methods for Cercarbono and Isometric credits dataframe…
andersy005 5efeea9
Fix project ID assignment order in process_cercarbono_projects and up…
andersy005 0b6d441
Refactor process_cercarbono_credits to streamline data handling for i…
andersy005 3966307
Merge branch 'main' into add-Cercarbano
andersy005 ad1804a
Enhance process_isometric_credits to support project ID mapping with …
andersy005 d2bc9a2
Add harmonization option for beneficiary data in process functions
andersy005 62153ae
Refactor process_isometric_credits to improve flow and readability by…
andersy005 75a1c69
Merge branch 'main' into add-Cercarbano
andersy005 1afb6e8
Retrigger CI
andersy005 0c1c6d9
Merge branch 'main' into add-Cercarbano
andersy005 adcfc70
Refactor import statements for pandera to use pandas submodule
andersy005 9816997
Add new project types and update isometric project type inference logic
andersy005 99b03bf
Add Cercarbono project type inference and update protocol mapping
andersy005 c40bcf1
Merge branch 'main' into add-Cercarbano
andersy005 09f7c6e
Refactor protocol mapping: rename 'ccb-reforest' to 'ccb-refor' and r…
andersy005 f772e5b
Remove Cercarbono and Isometric project type inference from processin…
andersy005 56e9bb8
Add project type inference to process_cercarbono_projects and process…
andersy005 d5a6282
Update cercarbano config for protocol definition
badgley 5dbec5b
Fix typos in CCB methodology descriptions in all-protocol-mapping.json
andersy005 b87fc10
Map more cercarbono protocol strings
badgley 143606c
Merge branch 'main' into add-Cercarbano
andersy005 e9d8de0
fix formatting
andersy005 48e7d43
remove infer_cercarbono_project_type and infer_isometric_project_type…
andersy005 0276b2c
Fix project type mapping string
badgley 6531715
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 70bb898
Add protocol mapping to project processing functions and update add_c…
andersy005 767444e
Change project schema to use Float64 for retired, issued, and quantit…
andersy005 c8dea2f
Update project_id generation to preserve full code for Cercarbono pro…
andersy005 146d145
Add optional projects parameter to process_cercarbono_credits for imp…
andersy005 d364b97
Refactor project ID generation in process_cercarbono_credits to use g…
andersy005 3599b43
Update beneficiary data to include new registries
badgley d3e7f52
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d404725
Merge branch 'main' into add-Cercarbano
andersy005 496b98f
Merge main: adopt refactored test signatures from #153
andersy005 530fef3
add tests for cercarbono and isometric
andersy005 773fff9
remove stale download_type='projects' from both test calls
andersy005 3ed8931
Refactor CI workflow: separate unit-test and integration-test jobs
andersy005 4bb16e9
Update credits mapping and normalize data processing in isometric.py
andersy005 e911f1f
Add transaction_url field to credits mapping and models
andersy005 62e1f0b
Add transaction_url field for issuances in process_isometric_credits …
andersy005 a7e74c2
update isometric retirement data
andersy005 e6ac8bc
Update scratch date in test fixtures and adjust test cases for cercar…
andersy005 9cb0328
Refactor tests to use subtests for improved clarity
andersy005 377f59d
Update vintage mapping for isometric issuances and adjust transaction…
andersy005 e0e9c3f
Fix precision errors in credit totals calculation by rounding based o…
andersy005 a200308
Merge branch 'main' into add-Cercarbano
andersy005 dd06d31
Merge branch 'main' into add-Cercarbano
andersy005 efd8640
Upadte protocol mapping
badgley 53a64bb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 49fcbab
Strip cercarbono-specific code; scope branch to Isometric only
andersy005 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,197 @@ | ||
| import pandas as pd | ||
| import pandas_flavor as pf | ||
|
|
||
| from offsets_db_data.common import ( | ||
| BERKELEY_PROJECT_TYPE_UPATH, | ||
| CREDIT_SCHEMA_UPATH, | ||
| PROJECT_SCHEMA_UPATH, | ||
| load_column_mapping, | ||
| load_inverted_protocol_mapping, | ||
| load_protocol_mapping, | ||
| load_registry_project_column_mapping, | ||
| load_type_category_mapping, | ||
| ) | ||
| from offsets_db_data.credits import ( | ||
| aggregate_issuance_transactions, # noqa: F401 | ||
| harmonize_beneficiary_data, # noqa: F401 | ||
| merge_with_arb, # noqa: F401 | ||
| ) | ||
| from offsets_db_data.models import credit_without_id_schema, project_schema | ||
| from offsets_db_data.projects import ( | ||
| add_category, # noqa: F401 | ||
| add_first_issuance_and_retirement_dates, # noqa: F401 | ||
| add_is_compliance_flag, # noqa: F401 | ||
| add_retired_and_issued_totals, # noqa: F401 | ||
| harmonize_country_names, # noqa: F401 | ||
| harmonize_status_codes, # noqa: F401 | ||
| map_protocol, # noqa: F401 | ||
| ) | ||
|
|
||
|
|
||
| @pf.register_dataframe_method | ||
| def add_cercarbono_project_url(df: pd.DataFrame) -> pd.DataFrame: | ||
| """Add project URL column for Cercarbono projects. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| df : pd.DataFrame | ||
| Input dataframe containing Cercarbono project data. | ||
|
|
||
| Returns | ||
| ------- | ||
| pd.DataFrame | ||
| Dataframe with added project URL column. | ||
| """ | ||
| base_url = 'https://www.ecoregistry.io/projects' | ||
| df['project_url'] = df['project_id'].apply(lambda x: f'{base_url}/{x}') | ||
| return df | ||
|
|
||
|
|
||
| @pf.register_dataframe_method | ||
| def add_cercarbono_project_id(df: pd.DataFrame, prefix: str = 'CCB') -> pd.DataFrame: | ||
| """Add project ID column for Cercarbono credits dataframe. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| df : pd.DataFrame | ||
| Input dataframe containing Cercarbono credit transactions data. | ||
|
|
||
| Returns | ||
| ------- | ||
| pd.DataFrame | ||
| Dataframe with added project ID column. | ||
| """ | ||
| df = df.copy() | ||
| # Use the globally unique numeric id (not the per-prefix code number) to avoid collisions. | ||
| # Different code prefixes (CDC, CP, CGS, CDB, CBA) share numeric suffixes (e.g. CDC-1, | ||
| # CP-1, CGS-1) but each project has a distinct id across the whole registry. | ||
| df['project_id'] = prefix + df['id'].astype(str) | ||
| return df | ||
|
|
||
|
|
||
| @pf.register_dataframe_method | ||
| def process_cercarbono_credits( | ||
| df: pd.DataFrame, | ||
| *, | ||
| download_type: str, | ||
| registry_name: str = 'cercarbono', | ||
| prefix: str = 'CCB', | ||
| harmonize_beneficiary_info: bool = False, | ||
| ) -> pd.DataFrame: | ||
| """Process Cercarbono transactions dataframe to conform to offsets-db schema. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| df : pd.DataFrame | ||
| Input dataframe containing Cercarbono credit transactions data. | ||
| download_type : str, optional | ||
| Type of data to download, either 'issuances' or 'retirements'. | ||
| registry_name : str, optional | ||
| Name of the registry to be added to the dataframe, by default "cercarbono" | ||
| prefix : str, optional | ||
| Prefix to add to project IDs, by default "CCB" | ||
|
|
||
| Returns | ||
| ------- | ||
| pd.DataFrame | ||
| Processed dataframe conforming to offsets-db schema. | ||
| """ | ||
|
|
||
| if download_type == 'issuances': | ||
| # TODO: @badgley, please confirm this is the correct way to extract vintage year for issuances | ||
| df['vintage'] = df['vintage_of_credits'].str.split(' / ').str[-1].str[:4].astype(int) | ||
| df['transaction_type'] = 'issuance' | ||
| # Extract numeric project ID from serial — this is the globally unique id. | ||
| # Standard format: CDC_1_... → id at index 1 | ||
| # Revised format: CDC_R_16_... → id at index 2 (R indicates revision) | ||
| parts = df.serial.str.split('_') | ||
| numeric_id = parts.str[1].where(parts.str[1] != 'R', parts.str[2]) | ||
| df['project_id'] = prefix + numeric_id | ||
|
|
||
| else: | ||
| df['transaction_type'] = 'retirement' | ||
| # project_id in the raw retirements data is the numeric id | ||
| df['project_id'] = prefix + df['project_id'].astype(str) | ||
|
|
||
| column_mapping = load_column_mapping( | ||
| registry_name=registry_name, download_type=download_type, mapping_path=CREDIT_SCHEMA_UPATH | ||
| ) | ||
|
|
||
| columns = {v: k for k, v in column_mapping.items()} | ||
|
|
||
| data = ( | ||
| df.rename(columns=columns) | ||
| .set_registry(registry_name=registry_name) | ||
| .convert_to_datetime(columns=['transaction_date'], format='ISO8601') | ||
| .add_missing_columns(schema=credit_without_id_schema) | ||
| .validate(schema=credit_without_id_schema) | ||
| ) | ||
|
|
||
| if harmonize_beneficiary_info: | ||
| data = data.pipe( | ||
| harmonize_beneficiary_data, registry_name=registry_name, download_type=download_type | ||
| ) | ||
| return data | ||
|
|
||
|
|
||
| @pf.register_dataframe_method | ||
| def process_cercarbono_projects( | ||
| df: pd.DataFrame, | ||
| *, | ||
| credits: pd.DataFrame, | ||
| registry_name: str = 'cercarbono', | ||
| ) -> pd.DataFrame: | ||
| """Process Cercarbono projects dataframe to conform to offsets-db schema. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| df : pd.DataFrame | ||
| Input dataframe containing Cercarbono project data. | ||
| registry_name : str, optional | ||
| Name of the registry to be added to the dataframe, by default "cercarbon | ||
|
|
||
|
|
||
| Returns | ||
| ------- | ||
| pd.DataFrame | ||
| Processed dataframe conforming to offsets-db schema. | ||
| """ | ||
|
|
||
| registry_project_column_mapping = load_registry_project_column_mapping( | ||
| registry_name=registry_name, file_path=PROJECT_SCHEMA_UPATH | ||
| ) | ||
| inverted_column_mapping = {value: key for key, value in registry_project_column_mapping.items()} | ||
| type_category_mapping = load_type_category_mapping() | ||
| inverted_protocol_mapping = load_inverted_protocol_mapping() | ||
| protocol_mapping = load_protocol_mapping() | ||
| df = df.copy() | ||
| df['country'] = df.locations.map( | ||
| lambda x: x[0]['country'] | ||
| ) # extract country from locations by taking first entry | ||
|
|
||
| data = ( | ||
| df.rename(columns=inverted_column_mapping) | ||
| .set_registry(registry_name=registry_name) | ||
| .add_cercarbono_project_url() # this must be called before adding project id because the url function uses the original project_id value | ||
| .add_cercarbono_project_id() | ||
| .harmonize_country_names() | ||
| .harmonize_status_codes() | ||
| .map_protocol(inverted_protocol_mapping=inverted_protocol_mapping) | ||
| .infer_project_type() | ||
| .override_project_types( | ||
| override_data_path=BERKELEY_PROJECT_TYPE_UPATH, source_str='berkeley' | ||
| ) | ||
| .add_category( | ||
| type_category_mapping=type_category_mapping, | ||
| protocol_mapping=protocol_mapping, | ||
| ) # category derived from protocol; project_type is independent | ||
| .map_project_type_to_display_name(type_category_mapping=type_category_mapping) | ||
| .add_is_compliance_flag() | ||
| .add_retired_and_issued_totals(credits=credits) | ||
| .add_first_issuance_and_retirement_dates(credits=credits) | ||
| .add_missing_columns(schema=project_schema) | ||
| .convert_to_datetime(columns=['listed_at', 'first_issuance_at', 'first_retirement_at']) | ||
| .validate(schema=project_schema) | ||
| ) | ||
|
|
||
| return data | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.