Skip to content

[Data] Add TPCH Q20, 21, 22 benchmark scripts to nightly tests#62333

Merged
goutamvenkat-anyscale merged 3 commits intoray-project:masterfrom
ryankert01:tpch-q20-21-22
Apr 20, 2026
Merged

[Data] Add TPCH Q20, 21, 22 benchmark scripts to nightly tests#62333
goutamvenkat-anyscale merged 3 commits intoray-project:masterfrom
ryankert01:tpch-q20-21-22

Conversation

@ryankert01
Copy link
Copy Markdown
Member

Description

As title. All three queries follow the established patterns from the existing TPC-H benchmark suite

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Ray Data implementations for TPC-H queries 20, 21, and 22, along with their corresponding configurations in the nightly release test suite for scale factor 100. The review feedback highlights several performance optimization opportunities, specifically recommending against hardcoding low partition counts (like num_partitions=16) which can cause memory pressure at 100GB scale. Additionally, suggestions were made for Query 21 to push filters down before materialization and remove redundant filter steps to improve efficiency.

ps_forest = partsupp.join(
forest_parts,
join_type="left_semi",
num_partitions=16,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding num_partitions=16 is likely too low for Scale Factor 100 (100GB), where tables like lineitem and partsupp contain hundreds of millions of rows. This can lead to excessively large partitions (several GBs each), causing memory pressure or underutilization of the cluster. It is generally better to let Ray Data automatically determine the number of partitions or set it to a much higher value (e.g., 200+) for this scale.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like a convention across existing test.

Comment thread release/nightly_tests/dataset/tpch/tpch_q21.py Outdated
Comment thread release/nightly_tests/dataset/tpch/tpch_q21.py Outdated
Comment thread release/nightly_tests/dataset/tpch/tpch_q21.py Outdated
Comment thread release/nightly_tests/dataset/tpch/tpch_q21.py Outdated
Comment thread release/nightly_tests/dataset/tpch/tpch_q22.py
@ryankert01 ryankert01 marked this pull request as ready for review April 3, 2026 17:57
@ryankert01
Copy link
Copy Markdown
Member Author

fixed at 44f1bbe

@ryankert01
Copy link
Copy Markdown
Member Author

cc @owenowenisme

@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Apr 3, 2026
Signed-off-by: ryankert01 <ryankert01@gmail.com>
Signed-off-by: ryankert01 <ryankert01@gmail.com>
@owenowenisme owenowenisme added the go add ONLY when ready to merge, run all tests label Apr 19, 2026
Copy link
Copy Markdown
Member

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@goutamvenkat-anyscale goutamvenkat-anyscale merged commit c33b607 into ray-project:master Apr 20, 2026
8 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…roject#62333)

## Description
As title. All three queries follow the established patterns from the
existing TPC-H benchmark suite

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: ryankert01 <ryankert01@gmail.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants