perf: Speed up purge_table with manifest dedup, parallel deletion, and schema caching#3233
Open
damahua wants to merge 1 commit intoapache:mainfrom
Open
perf: Speed up purge_table with manifest dedup, parallel deletion, and schema caching#3233damahua wants to merge 1 commit intoapache:mainfrom
damahua wants to merge 1 commit intoapache:mainfrom
Conversation
7ec0217 to
576fd03
Compare
…elizing file deletion Three changes to reduce purge_table wall time from ~7s to ~0.13s (54x) on a table with 200 snapshots: 1. Deduplicate manifests by path before iterating in delete_data_files(). The same manifest appears across many snapshots' manifest lists. For 200 snapshots this reduces 20,100 manifest opens to 200. 2. Parallelize file deletion using the existing ExecutorFactory ThreadPoolExecutor, matching the pattern already used for manifest reading in plan_files() and data file reading in to_arrow(). This aligns with the Java reference implementation (CatalogUtil.dropTableData) which also deletes files concurrently via a worker thread pool. 3. Cache Avro-to-Iceberg schema conversion and reader tree resolution. All manifests of the same type share the same Avro schema, but it was being JSON-parsed, converted, and resolved into a reader tree on every open. Uses explicit threading.Lock for thread safety across all Python implementations.
576fd03 to
559eef3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #
Rationale for this change
purge_tableis significantly slower than necessary due to three compounding inefficiencies:Manifest deduplication: When iterating through snapshots to collect manifests, the same manifest file appears in every subsequent snapshot's manifest list. For a table with 200 snapshots, this means 20,100 manifest file opens for only 200 unique files (100× redundancy).
Sequential file deletion: Data files, manifest files, and metadata files are deleted one at a time in sequential loops. The Java reference implementation (
CatalogUtil.dropTableData) already deletes files concurrently using a worker thread pool.Redundant Avro schema parsing: Every manifest file open triggers JSON parsing of the embedded Avro schema, conversion to Iceberg schema, and resolution of a reader tree — even though all manifests of the same type share the identical schema.
Benchmark
Table with 200 snapshots, 200 data files, 200 manifests (generated by appending 200 batches of 20K rows each):
purge_tablemean (N=3)Changes
pyiceberg/catalog/__init__.py:delete_data_files(): Deduplicate manifests bymanifest_pathbefore iterating. Collect all unique data file paths, then delegate todelete_files().delete_files(): UseExecutorFactory.get_or_create().map()for concurrent deletion — the sameThreadPoolExecutorpattern already used inplan_files(),to_arrow(), and_write_added_manifest().pyiceberg/avro/file.py:(file_schema, read_schema, read_types, read_enums).threading.Lockwith double-checked locking for thread safety across all Python implementations (not just CPython).read()takes a decoder argument, no mutation ofself), so sharing cached readers across calls is safe.Are these changes tested?
Yes:
Are there any user-facing changes?
No API changes.
purge_tableproduces the same result (all files deleted), just faster. Thedelete_filesfunction now deletes concurrently instead of sequentially, with the same error handling (OSError logged as warning, continues with remaining files).