Skip to content

Performance Enhancement: Significant Optimization of compare_dataset() Function#497

Merged
jjdelc merged 8 commits intoCrunch-io:masterfrom
shaikh-ma:task/enhance_compare_dataset_method
May 5, 2026
Merged

Performance Enhancement: Significant Optimization of compare_dataset() Function#497
jjdelc merged 8 commits intoCrunch-io:masterfrom
shaikh-ma:task/enhance_compare_dataset_method

Conversation

@shaikh-ma
Copy link
Copy Markdown
Contributor

@shaikh-ma shaikh-ma commented Apr 2, 2026

This Pull Request introduces substantial performance improvements to the compare_dataset() function, dramatically enhancing its efficiency when processing large-scale data comparisons.

Key Improvements: 🥇

The optimization delivers exceptional speed gains, reducing execution time from several minutes to approximately 2 seconds when comparing large datasets. Benchmark testing demonstrates consistent performance even in edge cases, such as comparing a dataset against itself with zero modifications.

Impact: 📈

This enhancement represents a fundamental improvement in dataset comparison capabilities, enabling:

  • Near-instantaneous processing of previously time-intensive operations
  • Improved scalability for large-scale data workflows
  • Enhanced user experience through significantly reduced wait times
  • The performance metrics indicate an order-of-magnitude improvement, transforming compare_dataset() from a potential bottleneck into a highly efficient operation suitable for production environments at scale.

Performance Improvement

99.95% reduction in execution time
• Performance improvement - 660X to 2,190X faster
• More than three orders of magnitude faster.

Snapshots 📷

Master branch This branch Performance improvement
image image •From 73 minutes to 2 seconds
•2,190x performance improvement
•Nearly three orders of magnitude faster
99.95% reduction in execution time
image image • From 22 minutes to 2 seconds
• 99.85% reduction in execution time
• 660X performance improvement
• Two orders of magnitude faster

Tests

Testing compare_dataset() on multiple dataset with 2200+ vairables

image

@shaikh-ma shaikh-ma marked this pull request as ready for review April 15, 2026 08:23
@shaikh-ma shaikh-ma changed the title Optimizing compare_dataset function Performance Enhancement: Significant Optimization of compare_dataset() Function Apr 16, 2026
@shaikh-ma
Copy link
Copy Markdown
Contributor Author

shaikh-ma commented Apr 17, 2026

Hi @jjdelc, @aless10, 👋

I've just submitted this PR with some significant improvements to the compare_dataset() function that we think will make a real difference.

The Data Processing users will now be able to handle their workflows more efficiently directly - the performance enhancements mean they can consolidate their tools and work faster, even with larger datasets, without needing to rely on other alternatives like R, etc., for dataset comparision.

I've tested it against typical datasets, and the results are pretty promising. Would appreciate it if you could take a look when you have a moment and let me know your thoughts. 😄

Cheers!
Aamir

@alexbuchhammer
Copy link
Copy Markdown

alexbuchhammer commented Apr 23, 2026

Hey team - hope all is well! Could anyone take a look at the rewrite of the method? It would help our python-based DP teams a ton as it offers functionality and performance that is matching the R/rcrunch equivalent.

Cheers!
Alex

Copy link
Copy Markdown
Contributor

@jjdelc jjdelc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking at the testing for this function and it's very minimal only checking the empty case.

Would it be helpful to move this out of the dataset entity and factor out the logic as compare_datasets(metadata_a, metadata_b) then we can test it easily without having to mock the MutableDataset instances

Comment thread scrunch/mutable_dataset.py Outdated
@shaikh-ma
Copy link
Copy Markdown
Contributor Author

Thanks @jjdelc ,

I've updated as per your feedback; the previous function will now give DeprecationWarning and will suggest to use the new standalone function compare_datasets().

@jjdelc jjdelc merged commit 47fc1b8 into Crunch-io:master May 5, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants