Skip to content

Refactored checkpointing system#2595

Merged
devos50 merged 12 commits into
Tribler:develfrom
devos50:fix_resume_data
Nov 19, 2016
Merged

Refactored checkpointing system#2595
devos50 merged 12 commits into
Tribler:develfrom
devos50:fix_resume_data

Conversation

@devos50

@devos50 devos50 commented Oct 26, 2016

Copy link
Copy Markdown
Contributor

I've found out that our resume data is not written away which can be addressed to the checkpointing mechanism and the fact that we are not waiting on our shutdown deferred when closing Tribler from the twistd plugin.

In this PR, I've refactored the complete checkpointing mechanism to make it more simple and so it is using Twisted. Moreover, I've created a get_handle method for each download that returns a deferred that fires when the libtorrent handle is available and valid. This allowed me to remove the waitForHandleAndSynchronized method which was rather ugly.

Also, various Twisted issues have been fixed, i.e. when we don't wait for a deferred to fire.

Fixes #2531

@devos50 devos50 force-pushed the fix_resume_data branch 5 times, most recently from 1256f71 to 85c244b Compare October 27, 2016 19:18
@devos50

devos50 commented Oct 27, 2016

Copy link
Copy Markdown
Contributor Author

@qstokkink can you please check the first three commits of my PR to see whether it's not too disruptive for your work (since you're currently working with the download mechanism)? The work in this PR is not completely done yet since I have to fix some additional tests and add notifications for the state of the video player.

@qstokkink

Copy link
Copy Markdown
Contributor

@devos50 I don't see any issues at all 👍 .

@devos50

devos50 commented Oct 28, 2016

Copy link
Copy Markdown
Contributor Author

@qstokkink great!

@devos50

devos50 commented Oct 28, 2016

Copy link
Copy Markdown
Contributor Author

Now that we don't have a grace time when shutting down, it turns out that we are never waiting until the socks5 server is completely shut down (https://github.com/Tribler/tribler/blob/devel/Tribler/community/tunnel/Socks5/server.py#L319), leading to dirty reactor errors. Returning a deferred from this method requires the unload_community to become asynchronous, together with a part of Dispersy and Tribler. Doing this is already part of the work of @lfdversluis in Tribler/dispersy#481. Since this might be too much work for now, I'm thinking about implementing a temporary workaround where we wait blocking for a deferred. Very ugly solution but otherwise, I have to refactor many methods in Dispersy...

EDIT: it might not be as much work as I initially thought. The consequence is that the stop method of Dispersy will return a Deferred now.

@qstokkink

Copy link
Copy Markdown
Contributor

@devos50 I think a blocking_call_on_reactor_thread with a DeferredList would be most in line in what we have in the rest of Tribler (also the easiest to make non-blocking when the time comes).

@synctext

Copy link
Copy Markdown
Member

download.stop ()... etc.. Looking forward to testing the impact of all this on my Linux,Mac and Win boxes.

@devos50

devos50 commented Oct 28, 2016

Copy link
Copy Markdown
Contributor Author

@qstokkink a DeferredList is indeed the best solution here, however, this makes the stop method asynchronous. In turn, this stop method is called by the unload_community in tunnel_community.py (https://github.com/Tribler/tribler/blob/devel/Tribler/community/tunnel/tunnel_community.py#L394). So if I place a yield before this call, it would make the unload_communities asynchronous. This would propagate through Dispersy.

Fortunately, the unload_communities appear to be only invoked by the stop method in dispersy.py so this refactoring would be manageable I think.

@devos50

devos50 commented Oct 28, 2016

Copy link
Copy Markdown
Contributor Author

@synctext FYI, this PR significantly improves the speed of the test suite (on Mac, it went from 4 minutes to 3 minutes duration: https://jenkins.tribler.org/job/GH_Tribler_PR_tests_mac/3384/ vs https://jenkins.tribler.org/job/GH_Tribler_PR_tests_mac/3381/. But this comparison might a bit unfair since not all tests are passing yet...

@devos50

devos50 commented Oct 28, 2016

Copy link
Copy Markdown
Contributor Author

It also improves the speed of a Tribler shutdown in general :)

@devos50

devos50 commented Oct 28, 2016

Copy link
Copy Markdown
Contributor Author

Even worse: it turns out that we don't wait until Dispersy is correctly closed... (https://github.com/Tribler/tribler/blob/devel/Tribler/Core/APIImplementation/LaunchManyCore.py#L786 returns a Deferred, not a boolean)

@devos50

devos50 commented Oct 28, 2016

Copy link
Copy Markdown
Contributor Author

Fixed both issues in #2601 so that PR should be merged first.

@devos50 devos50 force-pushed the fix_resume_data branch 3 times, most recently from 59ec003 to 631128e Compare November 3, 2016 09:20
@devos50

devos50 commented Nov 3, 2016

Copy link
Copy Markdown
Contributor Author

retest this please

@devos50

devos50 commented Nov 4, 2016

Copy link
Copy Markdown
Contributor Author

retest this please

@devos50

devos50 commented Nov 4, 2016

Copy link
Copy Markdown
Contributor Author

@ardhipoetra it turns out that there is a save_resume_data_failed_alert (http://www.rasterbar.com/products/libtorrent/manual.html) which is emitted when the resume data fails to write. We don't process this alert, causing the resume data deferred to not fire. Next time, please pay more attention to error handling so we can avoid these kind of bugs.

@devos50 devos50 force-pushed the fix_resume_data branch 2 times, most recently from b49dfc4 to 7a10fa6 Compare November 14, 2016 14:11
@devos50

devos50 commented Nov 14, 2016

Copy link
Copy Markdown
Contributor Author

retest this please

1 similar comment
@devos50

devos50 commented Nov 14, 2016

Copy link
Copy Markdown
Contributor Author

retest this please

@devos50

devos50 commented Nov 14, 2016

Copy link
Copy Markdown
Contributor Author

retest this please

1 similar comment
@synctext

Copy link
Copy Markdown
Member

retest this please

The resume data was not saved due to various bugs in the code. I refactored the complete checkpoint mechanism so it is more twisted-like. Moreover, we now wait until the checkpointing has been completed before continuing the shutdown procedure of Tribler.
In setUp and tearDown of tests.
@devos50

devos50 commented Nov 19, 2016

Copy link
Copy Markdown
Contributor Author

retest this please

@devos50

devos50 commented Nov 19, 2016

Copy link
Copy Markdown
Contributor Author

Seems stable!

@devos50 devos50 changed the title WIP: Refactored checkpointing system Refactored checkpointing system Nov 19, 2016
@devos50 devos50 merged commit 71b5242 into Tribler:devel Nov 19, 2016
@devos50 devos50 deleted the fix_resume_data branch November 19, 2016 13:47
@brussee

brussee commented Nov 20, 2016

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

[devel] Errors logged to console

4 participants