All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project follows Semantic Versioning.
- TUI Columns modal crash on slugs with duplicate column names.
Eleven slugs ship parquet schemas with legitimately repeated
top-level column names β the
osmi-mental-health-in-tech-*survey series (2016 through 2023) repeats "Why or why not?" follow-ups under each yes/no item, anduci-spambase,uci-parkinsons, anduk-price-paideach have one or more repeated headers. The new Columns modal used the bare column name as the Textual DataTable row key, so the second occurrence crashed withDuplicateKey. Repeated names are now suffixed with(2),(3), etc. for display + lookup; the by-name stats dict no longer silently collapses entries either. The underlying parquet's column names are unchanged.
0.1.4 - 2026-05-17
- Catalog discoverability β new ways to navigate the 249-spec
catalog without scrolling the full
docs/v1/datasets.md. - TUI faceted side panel (
browse.py) β filter groups for showcase, domain tags, size, shape traits, license, fetch type. View-preset bar (encoding,stress) on top, selectable from theViewrow. Counts header showsN of 249. - TUI search β
/focuses a search input above the table. Bare tokens match any field (substring, case-insensitive); qualified clauses (slug:foo desc:bar tag:enums col:lat lic:cc0 handler:β¦ reader:β¦ fetch:β¦) scope to one field. Clauses AND together and AND with the facet selection. Aliases:name/desc[ription]/tag[s]/col[umn][s]/lic[ense]. - TUI Columns-modal rendering refresh β pessimistic per-codepoint
cell-width accounting fixes Sinhala / CJK / Arabic content overflow.
Block-glyph histograms scale with pane width; new x-axis tick labels
(
lo / mid / hi) under each numeric histogram and horizontal bars for top-K string distributions. Modal widened (90% β 95%); the unreliable yellow border replaced by$surfacebackground contrast. - Per-column profiles β new opt-in stage
python -m scripts.pipeline.profile [<slug>]producesoutputs/v1/<slug>/profile.jsonwith per-dtype stats (numeric histograms, string NDV + top-K, bool T/F/null, date/timestamp ranges, list/map length stats). Surfaced in the TUI's detail pane and vialist_datasets --inspect <slug>. Auto-promotes the result intodocs/v1/profiles/<slug>.jsonso fresh clones can render sparklines without rebuilding;--no-promoteopts out. promote_profilestooling β newpython -m scripts.pipeline.promote_profilesmirrors built profiles into the trackeddocs/v1/profiles/directory. Idempotent (byte-identical destinations are skipped);--checkfor CI audits.- List-element dtypes in profiles.
profile.pynow renders list, large_list, and fixed_size_list element types recursively in the dtype label (list<float>,fixed_size_list<float>[100],list<struct>). Downstream consumers (e.g.autotag) can distinguish embedding-shaped columns from lists-of-structs without re-opening the parquet. - Editorial metadata in
sources.jsonβ optionaltags(closed vocab, 13 data-kind entries grouped by content axis: string β urls / prose / enums / identifiers / code-strings; numeric β timestamps / embeddings / counts / monetary / measurements; payload β coordinates / binary-payload / nested-json) andshowcase(closed vocab, 2 tiers: encoding / stress) perDatasetSpec.scripts.pipeline.autotagproposes tags from each slug's profile + handler/slug-name fallbacks; hand-edit insources.jsonafter that like any other manifest field. - Public BI workload descriptions. All 46
bi-*slugs in the Public BI Benchmark now carry per-workload descriptions grounded in actual column names rather than the workbook label, with a data-shape lead (N rows Γ M cols, dtype-family mix, notable columns) and aBackground:note. Many workbook names mislead about contents β e.g.bi-romanceis Instagram social posts;bi-physiciansis CMS Medicare payment records;bi-iglocations1is US Census geographic codes;bi-eixoandbi-uberlandiashare a schema withbi-mulheresmil(a Brazilian education program). Two slugs (bi-arade,bi-wins) retain a generic description because their columns are anonymised beyond recognition. - Derived signals in
docs/snapshot.jsonβ per-slugshape_traits(has_nested, has_timestamp, has_variant, string_heavy, wide_row, high_cardinality_present) andsize_bucket(xs/s/m/l/xl), derived bydocs.pyfrom on-disk parquets. - CLI parity β
list_datasetsgains--tag,--showcase,--size,--trait(with!negation),--view,--inspect,--tags-help,--showcase-help.--inspectfalls back from the built-parquet profile to the trackeddocs/v1/profiles/<slug>.jsonmirror, so a fresh clone can inspect any slug in the catalog without rebuilding. - Curated-picks header in
docs/v1/datasets.mdβ one block per showcase tier, regenerated fromsources.json. - README "Discover" subsection β directs newcomers at the TUI first.
- Skills: new
raincloud-profile, newraincloud-discover; updatedraincloud-list-datasets,raincloud-build. - Tracked profiles for all 249 specs.
docs/v1/profiles/ships a per-slug profile for every entry in the manifest, including the multi-hour heavyweights (clickbench-hits,fineweb-sample-10bt,wikipedia-structured-contents,jsonbench-bluesky-100m,osm-germany-nodes, the OpenLibrary dumps, etc.). A fresh clone can render the TUI Columns pane and uselist_datasets --inspect <slug>on any slug without building anything locally.
autotagenums classifier tightened. A string column counts as enum-shaped only whenndv β€ 32 AND mean_len β€ 24, or whenndv β€ 256 AND ndv/rows β€ 0.001 AND mean_len β€ 24for very wide datasets. The slug-levelenumstag additionally requires β₯2 qualifying columns, so a single class-label column no longer promotes the whole dataset to enum-shaped.autotagembeddings detection now reads the list-element dtype written byprofile.pyand recogniseslist<float>/list<double>/fixed_size_list<float>columns as embeddings without relying on slug-name heuristics. The remaining slug-name fallback uses word-boundary matching (\b(embeddings?|word vectors?|dense vector| glove|word2vec|fasttext|encoder output)\b) so unrelated copy like "sensors embedded in β¦" no longer matches.
DatasetSpec.familyfield and--familyCLI flag. The field was used to invoke batched builds (python -m scripts.pipeline.build --family uci); each slug is now invoked by name, and--allremains available for whole-catalog passes. Pass multiple slugs space-separated tobuild/convertfor ad-hoc batches.- Subject-matter
TAG_VOCAB(12 entries: geospatial / nlp-text / web-analytics / e-commerce / finance / social / scientific / healthcare / sports / transportation / government / benchmark) replaced by the 13 data-kind vocab above. curation.json+scripts/pipeline/curate.py+tests/test_curate.pyremoved. Tags now sit inline insources.jsonalongsidedescription/license/showcase. Thecurate applybridge is gone.
profile.pyDECIMAL overflow in histogram-bucket SQL. DuckDB was inferring DECIMAL types from inlinedlo_f/hi_fPython repr (e.g.0.26851799179226266β DECIMAL(18,17));(value - lo) * 10then overflowed. All histogram-bucket literals are now::DOUBLE-cast.profile.pyzero-length identifier on empty column names. Some upstream CSVs ship an unnamed pandas-index column whose Arrow field hasname == ""; DuckDB rejects empty delimited identifiers. Skip with a placeholder__unnamed_column__entry.profile.pyTIME-of-day column cast. DuckDB doesn't implementCAST(time AS TIMESTAMP); standalone TIME columns now route through the string profile (null_count + NDV + top-K of rendered HH:MM:SS).profile.pyfixed_size_listcolumns were silently profiled asnullbecause the dispatcher only checkedis_list/is_large_list. They now route through the list profile and pick up the new element-type rendering, so e.g.glove-6b-100d'svector: fixed_size_list<float>[100]is fully described.- WDI re-enabled. The upstream redirect target
databankfiles.worldbank.orgserves an expired TLS cert, so Python's defaulturllibrefused the connection. The newfetch.verify_tlsfield (boolean, defaulttrue) lets a slug bypass verification when itsexpected_sha256provides independent integrity. WDI ships at 395,276 rows Γ 70 columns (70 MB parquet).
sources.schema.jsonadds three optional fields, all additive (existing manifests are accepted unchanged):DatasetSpec.tags(array of TAG_VOCAB strings, default[]).DatasetSpec.showcase(array of SHOWCASE_TIERS strings, default[]).DatasetSpec.fetch.verify_tls(boolean, defaulttrue) β escape hatch for upstreams whose TLS certs have rotted but whose payload integrity is gated byexpected_sha256.
- New
profile.schema.json(Draft 2020-12) for the per-slug profile output format.
0.1.3 - 2026-05-10
- Validate stage no longer hard-fails on row/schema_hash drift by
default. A mismatch now emits a
[WARN]line to stderr and the build continues. Users invokingpython -m scripts.pipeline.build <slug>have already opted into "fetch whatever is upstream now"; an upstream Arrow- conversion bump or a slightly-grown row count shouldn't turn that into a failed build. Pass--strict(new flag onscripts.pipeline.build) to upgrade warnings to errors β recommended for CI / pre-release gates. - The previous
--looseflag has been removed; its behaviour (warn, don't raise) is now the default. Migrate--looseinvocations to dropping the flag entirely; replace any "default-strict" CI invocations with--strict.
validate.pynow comparesexpect.schema_hashas a prefix when the manifest value is shorter than the full 64-char SHA-256. All 37 slugs withschema_hashset insources.jsonuse a 12-char short hash (matching the[validate] schema_hash=print convention, akin to git short SHAs); the previous full-string equality made every one of them fail validation on rebuild. Equal-length values still use strict equality, so full hashes remain enforceable for callers that prefer them.sources.schema.mdupdated to document the prefix-match rule and the new warn-vs---strictsemantics for theexpectblock.
0.1.2 - 2026-05-10
- All
uv syncinstructions across the docs (README, AGENTS, CONTRIBUTING, SKILLS, in-code install hints, and skill files) now pass--inexactso installing one extra no longer uninstalls the others. Without this, the documented sequential setup (uv sync --extra tuiβ bareuv syncβuv sync --extra huggingface) silently left the user with only the last extra installed, and subsequent builds of HF/Kaggle slugs failed withImportError. uv has no project-level toggle for this β--inexactis per-command β so the fix is documentation-wide.
- TUI build action (
python -m scripts.pipeline.browse, thenbon a row) now runsuv sync --extra <kaggle|huggingface> --inexactautomatically before the build subprocess when the dataset'sfetch.typerequires an upstream-fetch backend. Sync output streams into the same RichLog as the build; sync failure aborts the build with a visible exit code. Pure-HTTP and custom-fetch slugs see the same flow as before (no extra sync).BuildConfirmModalsurfaces the sync command line above the build command line so the user sees both before confirming.
0.1.1 - 2026-05-07
- README badges (CI status, latest release, license, citation).
- Convert stage now streams parquet batches via
pf.iter_batches() β RecordBatchReader β vxio.writeinstead of materialising whole tables. ResolvesArrowNotImplementedError: Nested data conversions not implemented for chunked array outputsfrom pyarrow on slugs whose nested columns (list<struct>,struct<bytes,β¦>) would need to be chunked across multiple Arrow arrays. Re-enables Vortex output forosm-germany-ways,ultrachat-200k,mmmu,websight-v01,peoples-speech-clean-validation. code-contestsVortex skip re-diagnosed: not the chunked-array path; a separate upstream FSST i32-offset overflow onlist<string>>2 GB.open-food-factsdescription aligned with shipped output (currently a singleraw_json: stringcolumn viajsonl_as_string_parse; VARIANT promotion deferred).- PR template: dropped the "Test plan" checklist (CI runs the same gates on every PR; CONTRIBUTING.md documents them once).
- Agent-tooling docs (AGENTS.md, SKILLS.md,
raincloud-docsskill) now flagdocs/snapshot.jsonas load-bearing β TUI fallback and the row-count / file-size fallback fordatasets.mdregen. Stale "six derived docs" reference in AGENTS.md cleaned up to three.
docs/datasets.mdregeneration now falls back todocs/snapshot.json(top-level scratch, thendocs/v{schema_version}/snapshot.jsonon a fresh clone) for slugs whose parquet isn't built locally. Previously, partial-build regen would silently dash-out row counts and file sizes for any slug not on disk, destroying ground truth in the v1 snapshot. Snapshot regen now also captureslast_built_row_groups. Five regression tests added intests/test_docs.py.
0.1.0 - 2026-05-06
Initial public release.
Raincloud is a client-reproducible pipeline for building a curated catalog
of public datasets as analytics-ready Parquet + Vortex files. See
README.md for the user-facing overview,
AGENTS.md for the architecture, and
SKILLS.md for procedural playbooks.
This release bundles:
- The 7-stage build pipeline (fetch β extract β parse β transform β write β validate β convert) plus the optional opt-in hydrate stage.
- 249 dataset specs across 5 families (
direct,kaggle-upstream,nyc-tlc,public-bi,uci). - 24 named transform handlers covering CSV / Parquet / JSONL / XML / PBF / custom-format upstreams plus streaming variants for memory-constrained shapes.
- A read-only Textual TUI for browsing the catalog
(
python -m scripts.pipeline.browse, requires--extra tui). - Per-dataset Vortex conversion via the
convert.vortexflag. - Apache License 2.0, with SPDX file headers on all Python sources.
- Governance:
SECURITY.md,CONTRIBUTING.md,CODE_OF_CONDUCT.md(Contributor Covenant 2.1),DISCLAIMER.md(AS IS posture, content and license disclaimers, dataset-removal reporting), andHYDRATING.md(policy for the optional hydrate stage). - Tooling:
rufflint (rulesE,F,W,I) + GitHub Actions CI (.github/workflows/ci.yml) running lint, manifest validation, andpyteston every push and PR todevelop. - Dataset-removal issue template
(
.github/ISSUE_TEMPLATE/dataset-removal.yml) β structured form for the channelDISCLAIMER.mdpoints readers at. - Pull-request template (
.github/pull_request_template.md) prompting for summary, test-plan checkbox list against the standard pre-PR gate, and change-type tags. CITATION.cffβ GitHub-native citation metadata; surfaces the "Cite this repository" button in the repo sidebar with BibTeX / APA / Chicago exports.