-
Notifications
You must be signed in to change notification settings - Fork 192
feat(csharp/src/Drivers/Databricks): Implement straggler download mitigation for CloudFetch #3637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
msrathore-db
wants to merge
14
commits into
apache:main
Choose a base branch
from
msrathore-db:straggle-download-claude-code
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
25b4faf
Added design docs for implementation of straggle download mitigation
msrathore-db be15927
Updated doc to handle edge cases
msrathore-db 1dec164
feat(cloudfetch): Implement straggler download detection and mitigation
msrathore-db 2740b76
test(cloudfetch): Add tests for straggler download mitigation
msrathore-db 09ae6e7
feat(cloudfetch): Implement sequential download fallback using second…
msrathore-db 609059f
Fix straggler mitigation implementation issues
msrathore-db 7aa4a3f
Fix critical code review issues in straggler mitigation
msrathore-db 774412c
Add comprehensive E2E tests for straggler mitigation
msrathore-db 232f55c
test(cloudfetch): Add critical bug fix validation tests to E2E suite
msrathore-db d76482f
test(cloudfetch): Add E2E tests for straggler mitigation with mocked …
msrathore-db 982d29f
refactor(test): Move straggler tests to unit tests and reduce to 10 f…
msrathore-db 3e67347
feat(csharp): Implement straggler download mitigation for CloudFetch
msrathore-db 7ac9a7f
fix(csharp): Fix .NET Framework 4.7.2 compatibility for string.Split()
msrathore-db 8b9b4f7
Removed unnecessary stragglerDetector file
msrathore-db File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
69 changes: 69 additions & 0 deletions
69
csharp/src/Drivers/Databricks/Reader/CloudFetch/prompts.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| PROMPTS FOR STRAGGLER DOWNLOAD MITIGATION LLD | ||
| ============================================== | ||
|
|
||
| Prompt 1: | ||
| --------- | ||
| I want to implement a functionality in the databricks ADBC driver. [SIMBA] Addressing the Straggling File Download Issue for Cloud Fetch | ||
|
|
||
| Overview | ||
| In Cloud Fetch mode, the driver uses a thread pool with 80 threads by default (set by MaxNumResultFileDownloadThreads) to download files using pre-signed URLs generated by the server. The server caps the amount of data in the set of URLs returned per fetch to 300 MB (set by MaxBytesPerFetchRequest, hard-capped by the server to 1GB) and each file has a maximum size of 20 MB (server side configuration). | ||
| File download | ||
| The driver receives a set of file links which are downloaded in parallel. Each such set of files is considered a batch and all files within the batch need to be successfully downloaded to move to the next one. If some of the file downloads fail, the driver re-attempts to download them after requesting new URLs from the server for a maximum of 10 times. The retry count is configurable by MaxConsecutiveResultFileDownloadRetries. | ||
|
|
||
| If one of the file downloads fails, the driver requests new URLs starting from the offset of that file. The files preceding the offset which were successfully downloaded are skipped. The files from higher offsets than the failed one that have been downloaded successfully are re-downloaded. Basically, all re-generated URLs are re-downloaded irrespective of their prior attempts. | ||
|
|
||
| The driver uses another knob to disable the parallel downloads and fall-back to sequential downloads EnableAsyncQueryResultDownload. | ||
| Pitfalls | ||
| Few customers reported issues with the parallel file download from Azure in which a single file would experience very low download speeds, roughly 10x slower than the other concurrent file downloads, i.e., in the order of KB. The file transfer would eventually complete, though the progress is very slow, leading to noticeable regressions. We've seen this issue rarely and we have not been successful in reproducing it. However, we observed the issue is isolated to a single file download and that subsequent batches typically complete without experiencing the issue again. | ||
| Proposed solution | ||
| Currently, the driver doesn't enforce a timeout nor cancels and retries file downloads that are slow. We would like to implement a strategy for re-trying the straggling file downloads. | ||
|
|
||
| Retry policy. This section explains how to identify a straggling file download. | ||
| The driver keeps track of how long each file transfer takes within a batch. Detecting a straggler is done based on a fresh calculation for the batch. To do so, the driver derives the download throughput for each of the files within a batch as the ratio between the time it takes to complete the download and its size. When at least a fraction of the file downloads within the batch have completed (e.g., 0.75), the driver identifies straggler downloads. To do so, it computes the median throughput across the completed file tasks. A straggler download is a download that takes longer than f x file_size x median_throughput + padding, where f is a straggler multiplier (e.g., 1.5) and the padding adds an extra buffer of a few seconds (e.g., 5 s). | ||
|
|
||
| Cancellation mechanism. This section explains how to cancel the file download. | ||
| The timeout cannot be set proactively, as the timeout value depends on runtime metrics such as the current progress of the file download. This is a limitation of the libcURL layer. Instead, the driver will cancel the download in between receiving chunks of the file and will re-attempt the download | ||
|
|
||
| Fallback policy. This section explains how to disable parallel downloads. | ||
| If a query experiences more than a predefined number of straggler file downloads, let the driver disable asynchronous download mode and continue to download the files within a batch sequentially. Apply only for the current query. | ||
|
|
||
| Configuration Default value Description | ||
| EnableStragglerDownloadMitigation 0 If 1, the driver timeouts and retries straggler downloads. Disabled by default. | ||
| StragglerDownloadMultiplier 1.5 How many times slower a file download needs to be to be considered a straggler. | ||
| StragglerDownloadQuantile 0.6 Fraction of downloads which must be completed before enabling straggler mitigation. | ||
| StraggleDownloadPadding 5s Extra buffer in seconds before declaring a file download is a straggler. | ||
| MaximumStragglersPerQuery 10 Maximum stragglers re-attempted per query before switching to sequential downloads. | ||
| EnableSynchronousDownloadFallback 0 If 1 & EnableStragglerDownloadMitigation, the driver falls-back automatically to sequential downloads if MaximumStragglersPerQuery is exceeded. Applies only to the current query. | ||
|
|
||
|
|
||
|
|
||
| This is a connection param of straggle download. This is implemented in ODBC and we want to implement this is ADBC databricks as well | ||
| . I want you to create a concise LLD doc for implementing this feature. Try to keep the number of classes minimal. Use DRY principles wherever possible. Keep the doc short. | ||
|
|
||
| Prompt 2: | ||
| --------- | ||
| Remove the details on testing from the design doc. Also make sure the variable and function naming is appropriate and defining enough. | ||
|
|
||
| Prompt 3: | ||
| --------- | ||
| Instead of one, create two docs. One which is sort of a summary and the other one refers to the integration. Refer to the PR. Also create a .txt that contains the prompts I give. https://github.com/apache/arrow-adbc/pull/3624 . There are a lot of comments on the PR. Learn from those comments on what they suggest and do not make those mistakes | ||
|
|
||
| Prompt 4: | ||
| --------- | ||
| For connection params, follow the general adbc repo structure. Make changes in the design doc to align with the existing implementation in the databricks ADBC C# driver | ||
|
|
||
| Prompt 5: | ||
| --------- | ||
| We're aligned. Is the logging pattern defined in the design doc aligned with the general logging pattern in cloudFetch? | ||
|
|
||
| Prompt 6: | ||
| --------- | ||
| Update the design doc accordingly | ||
|
|
||
| Prompt 7: | ||
| --------- | ||
| Why are we just using a single retry upon straggle identification. Instead we should just retry straggler and the remaining behaviour stays the same. Basically straggle retry should just be one of the retries which in a way ensures this download won't straggle the next time but there could be some other error so we'll still be following the standard retry policy just adding this one extra retry | ||
|
|
||
| Prompt 8: | ||
| --------- | ||
| Now add testing details to both the docs as well. Follow the structure from the current repo. Also remember to take care of the comments on this PR https://github.com/apache/arrow-adbc/pull/3624 and follow the right practises. I see there are two comments saying: "we don't need this level of detail in a design doc, in stead we should focus more on interface/contract between different class objects". "Focus on adding more class diagram and sequence diagram, etc, instead of putting big block of code into the design doc." Are we following these in our design docs? If not modify to follow this pattern | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.