Performance Enhancement: Significant Optimization of `compare_dataset()` Function by shaikh-ma · Pull Request #497 · Crunch-io/scrunch

shaikh-ma · 2026-04-02T12:16:49Z

This Pull Request introduces substantial performance improvements to the compare_dataset() function, dramatically enhancing its efficiency when processing large-scale data comparisons.

Key Improvements: 🥇

The optimization delivers exceptional speed gains, reducing execution time from several minutes to approximately 2 seconds when comparing large datasets. Benchmark testing demonstrates consistent performance even in edge cases, such as comparing a dataset against itself with zero modifications.

Impact: 📈

This enhancement represents a fundamental improvement in dataset comparison capabilities, enabling:

Near-instantaneous processing of previously time-intensive operations
Improved scalability for large-scale data workflows
Enhanced user experience through significantly reduced wait times
The performance metrics indicate an order-of-magnitude improvement, transforming compare_dataset() from a potential bottleneck into a highly efficient operation suitable for production environments at scale.

Performance Improvement ⚡

• 99.95% reduction in execution time
• Performance improvement - 660X to 2,190X faster
• More than three orders of magnitude faster.

Snapshots 📷

Master branch	This branch	Performance improvement
		•From 73 minutes to 2 seconds •2,190x performance improvement •Nearly three orders of magnitude faster •99.95% reduction in execution time
		• From 22 minutes to 2 seconds • 99.85% reduction in execution time • 660X performance improvement • Two orders of magnitude faster

Tests

Testing `compare_dataset()` on multiple dataset with 2200+ vairables

shaikh-ma · 2026-04-17T14:10:59Z

Hi @jjdelc, @aless10, 👋

I've just submitted this PR with some significant improvements to the compare_dataset() function that we think will make a real difference.

The Data Processing users will now be able to handle their workflows more efficiently directly - the performance enhancements mean they can consolidate their tools and work faster, even with larger datasets, without needing to rely on other alternatives like R, etc., for dataset comparision.

I've tested it against typical datasets, and the results are pretty promising. Would appreciate it if you could take a look when you have a moment and let me know your thoughts. 😄

Cheers!
Aamir

alexbuchhammer · 2026-04-23T05:33:12Z

Hey team - hope all is well! Could anyone take a look at the rewrite of the method? It would help our python-based DP teams a ton as it offers functionality and performance that is matching the R/rcrunch equivalent.

Cheers!
Alex

jjdelc

I am looking at the testing for this function and it's very minimal only checking the empty case.

Would it be helpful to move this out of the dataset entity and factor out the logic as compare_datasets(metadata_a, metadata_b) then we can test it easily without having to mock the MutableDataset instances

shaikh-ma · 2026-04-27T12:19:13Z

Thanks @jjdelc ,

I've updated as per your feedback; the previous function will now give DeprecationWarning and will suggest to use the new standalone function compare_datasets().

shaikh-ma added 6 commits April 2, 2026 17:45

Updating compare_dataset function

a59ef38

Update docstring

22587b8

pin cr.cube version for fixing failing tests coz of syntax errors

3d83ae2

fixing version for crcube

9f30654

Fixing for missing_rules

29d1e36

Fixing issue of variables with same nme within different folders

3630a2f

shaikh-ma marked this pull request as ready for review April 15, 2026 08:23

Fix value

bed50b6

shaikh-ma changed the title ~~Optimizing compare_dataset function~~ Performance Enhancement: Significant Optimization of compare_dataset() Function Apr 16, 2026

jjdelc reviewed Apr 27, 2026

View reviewed changes

Comment thread scrunch/mutable_dataset.py Outdated

Updates as per feedback

ab39361

jjdelc approved these changes May 5, 2026

View reviewed changes

jjdelc merged commit 47fc1b8 into Crunch-io:master May 5, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Enhancement: Significant Optimization of `compare_dataset()` Function#497

Performance Enhancement: Significant Optimization of `compare_dataset()` Function#497
jjdelc merged 8 commits intoCrunch-io:masterfrom
shaikh-ma:task/enhance_compare_dataset_method

shaikh-ma commented Apr 2, 2026 •

edited

Loading

Uh oh!

shaikh-ma commented Apr 17, 2026 •

edited

Loading

Uh oh!

alexbuchhammer commented Apr 23, 2026 •

edited

Loading

Uh oh!

jjdelc left a comment

Uh oh!

Uh oh!

shaikh-ma commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shaikh-ma commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Improvements: 🥇

Impact: 📈

Performance Improvement ⚡

Snapshots 📷

Tests

Testing compare_dataset() on multiple dataset with 2200+ vairables

Uh oh!

shaikh-ma commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexbuchhammer commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjdelc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shaikh-ma commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shaikh-ma commented Apr 2, 2026 •

edited

Loading

Testing `compare_dataset()` on multiple dataset with 2200+ vairables

shaikh-ma commented Apr 17, 2026 •

edited

Loading

alexbuchhammer commented Apr 23, 2026 •

edited

Loading