Performance Enhancement: Significant Optimization of compare_dataset() Function#497
Conversation
compare_dataset functioncompare_dataset() Function
|
I've just submitted this PR with some significant improvements to the The Data Processing users will now be able to handle their workflows more efficiently directly - the performance enhancements mean they can consolidate their tools and work faster, even with larger datasets, without needing to rely on other alternatives like R, etc., for dataset comparision. I've tested it against typical datasets, and the results are pretty promising. Would appreciate it if you could take a look when you have a moment and let me know your thoughts. 😄 Cheers! |
|
Hey team - hope all is well! Could anyone take a look at the rewrite of the method? It would help our python-based DP teams a ton as it offers functionality and performance that is matching the R/ Cheers! |
jjdelc
left a comment
There was a problem hiding this comment.
I am looking at the testing for this function and it's very minimal only checking the empty case.
Would it be helpful to move this out of the dataset entity and factor out the logic as compare_datasets(metadata_a, metadata_b) then we can test it easily without having to mock the MutableDataset instances
|
Thanks @jjdelc , I've updated as per your feedback; the previous function will now give |
This Pull Request introduces substantial performance improvements to the compare_dataset() function, dramatically enhancing its efficiency when processing large-scale data comparisons.
Key Improvements: 🥇
The optimization delivers exceptional speed gains, reducing execution time from several minutes to approximately 2 seconds when comparing large datasets. Benchmark testing demonstrates consistent performance even in edge cases, such as comparing a dataset against itself with zero modifications.
Impact: 📈
This enhancement represents a fundamental improvement in dataset comparison capabilities, enabling:
compare_dataset()from a potential bottleneck into a highly efficient operation suitable for production environments at scale.Performance Improvement ⚡
• 99.95% reduction in execution time
• Performance improvement - 660X to 2,190X faster
• More than three orders of magnitude faster.
Snapshots 📷
•2,190x performance improvement
•Nearly three orders of magnitude faster
•99.95% reduction in execution time
• 99.85% reduction in execution time
• 660X performance improvement
• Two orders of magnitude faster
Tests
Testing
compare_dataset()on multiple dataset with 2200+ vairables