Skip to content

Parquet data system implementation#434

Open
mcuadera wants to merge 29 commits into
devfrom
264-parquet-data-system-implementation
Open

Parquet data system implementation#434
mcuadera wants to merge 29 commits into
devfrom
264-parquet-data-system-implementation

Conversation

@mcuadera

@mcuadera mcuadera commented Mar 31, 2026

Copy link
Copy Markdown
Collaborator

Closes #264

There are several functions that were built to implement a "raw_data" parquet system. These are:

  1. create_raw_data_parquet(): transforms the raw_data output of get_all_polio_data() into the parquet folder hierarchal system.
  2. upload_parquet_to_edav(): uploads the folder produced by create_raw_data_parquet() to EDAV.
  3. build_parquet_raw_data(): compiles the folder into a raw_data parquet connection.
  4. from_wkb_to_sf(): converts sf datasets that are loaded as parquet connections into regular shapefile objects. These are the raw_data$global.ctry/prov/dist, raw_data$cities and raw_data$roads datasets. Use this function after doing a dplyr::collect().

Most of the end users will be using build_parquet_raw_data(). However, my vision is that these functions are used within get_all_polio_data(). For example, when recreate.static.files = TRUE, after compiling the big raw_data and just before the splitting step, run create_raw_data_parquet() using a temporary directory, then upload the directory to the analytics folder on EDAV. build_parquet_raw_data() can be run when pulling data in a regular way, where recreate.static.files = FALSE. Then, the filtering and the specification of the datasets can be added into get_all_polio_data().

How to test:

library(devtools)
load_all()
raw_data <- get_all_polio_data()
create_raw_data_parquet(raw_data, "path/to/desktop/parquet_folder")
upload_parquet_to_edav("path/to/desktop/parquet_folder", "GID/PEB/SIR/Sandbox/parquet_sandbox_mkc" ) # rename sandbox folder with your initials so you have your own folder
raw_data_pq <- build_parquet_raw_data("GID/PEB/SIR/Sandbox/parquet_sandbox_mkc")
pos_pq <-    build_parquet_raw_data("GID/PEB/SIR/Sandbox/parquet_sandbox_mkc", "pos")
afp_pos_pq <- build_parquet_raw_data("GID/PEB/SIR/Sandbox/parquet_sandbox_mkc", c("pos", "afp"))

# Converting binary shape columns to sf columns
# The ability to filter prior to loading will speed up analyses as you don't need to load the entire
# dataset prior to filtering
raw_data_pq$global.ctry |> dplyr::filter(ctry == "KENYA") |> dplyr::collect() |> from_wkb_to_sf()

@mcuadera mcuadera self-assigned this Mar 31, 2026
@mcuadera mcuadera added the enhancement New feature or request label Mar 31, 2026
@codecov-commenter

codecov-commenter commented Mar 31, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 1260 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (dev@5b0c777). Learn more about missing BASE report.

Files with missing lines Patch % Lines
R/get_all_polio_data.R 0.00% 650 Missing ⚠️
R/get_all_polio_data_2.R 0.00% 365 Missing ⚠️
R/dal.parquet.R 0.00% 245 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##             dev    #434   +/-   ##
=====================================
  Coverage       ?   1.93%           
=====================================
  Files          ?      41           
  Lines          ?   15614           
  Branches       ?       0           
=====================================
  Hits           ?     302           
  Misses         ?   15312           
  Partials       ?       0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mcuadera

mcuadera commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator Author

I added an example of get_all_polio_data() using the parquet system calling it get_all_polio_data_2(). It doesn't have as much features as the original, but the features from the original function (like local caching and archiving) can be added later on

Examples:
raw_data <- get_all_polio_data_2(data_folder = "GID/PEB/SIR/Sandbox/data", recreate.static.files = T, cache = F)
raw_data <- get_all_polio_data_2(data_folder = "GID/PEB/SIR/Sandbox/data")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants