Add a git based sync option #441#722
Conversation
and then to bed
djmitche
left a comment
There was a problem hiding this comment.
This is so cool! I would like to take the time to read and experiment with it, but a few initial reactions to the comments and the PR description:
- Optional configuration for the path to the
gitbinary will probably be useful for someone, and should be easy to add - It looks like
remoteaccepts the string "None"? Let's make that anOption<String>instead, if that's the case, or make the empty string the special case. - For cleanup, two observations:
- Git keeps all of the data anyway, so "deleting" a version is really just removing the name from the directory. So, I don't think deleting old versions has much impact.
- When a replica syncs, it needs all versions since the last version it saw. Otherwise, it has to restore from the snapshot and lose any local data since it last sync'd.
- So, it's beneficial to keep versions around for as long as is practical. The cloud servers keep a half-year's worth of versions (
src/server/cloud/server.rs,MAX_VERSION_AGE_SECS), and I think that's reasonable here, too.
I'd be interested to know what @ryneeverett thinks, too!
|
Thanks for the review! I'll take a look at those. My reason behind the git rm was to make the globs when adding new versions faster, but I didn't realize how that interacted with restoration. |
|
Hm, good point, but local globs are probably pretty fast even for 100's of dentries. |
djmitche
left a comment
There was a problem hiding this comment.
A bunch of comments here, but all toward improving the implementation rather than show-stopping issues. I think most of what I've suggested is relatively straightforward, but if necessary some of it can be done in followup PRs.
One general observation is, despite Server being an async trait, this invokes Git synchronously. I think that's fine for the expected use-cases for this sync model, and it's something that can be improved later if desired.
tl;dr: This looks great, and I look forward to merging it after some minor revisions!
| # Suppport for sync to another SQLite database on the same machine | ||
| server-local = ["dep:rusqlite"] | ||
| # Support for sync via Git | ||
| server-git = ["git-sync"] |
There was a problem hiding this comment.
Why the two levels of features here? I think one would be sufficient. The cloud syncs have cloud because it enables some common functionality, but there's no such thing for git. So,
server-git = ["dep:serde_with", "dep:glob", "encryption"]
and update the cfg(feature..) in the code.
| //! | ||
| //! - I haven't done any performance testing, but it seems reasonably quick for manual use. | ||
| //! - Currently is uses the same salt for all files. This isn't great security practice, | ||
| //! but does seem to be what the other servers are doing. |
There was a problem hiding this comment.
This is a good point! In defense of the idea, the salt is used in the key derivation, and we use the same key for every file, so in that sense it's only used once. If there's further concern, let's open an issue about it -- I'm sure others would like to chime in too!
| //! | ||
| //! Notes for Reviewers | ||
| //! | ||
| //! - I haven't done any performance testing, but it seems reasonably quick for manual use. |
There was a problem hiding this comment.
This seems fine. We have not focused on performance testing of the sync operations since most of the time is network-related.
| //! create a 'task' branch and let TaskChampion manage that branch. | ||
| //! - This does support both defining a remote and having `local_only` mode set at the same | ||
| //! time. The idea is that maybe the remote isn't ready yet, or eithe rtemporarily or | ||
| //! permanantly down. Either way, you can use this in local mode in the mean time. |
There was a problem hiding this comment.
What are the risks here? In general, the more user-facing bits of the doc here might be better as docstrings on the configuration enum.
| //! | ||
| //! - Since this shells out to git, it assumes that you havea reasonably functional git | ||
| //! setup. I.e. 'git init', 'git add', 'git commit', etc shoud just work. | ||
| //! - If you are using a remote, 'git push' and 'git pull' shoud work. |
There was a problem hiding this comment.
I suspect there's some room for more robustness here, such as disabling prompts. Maybe we can address that as we find issues, but it might be worth thinking about ahead of time.
I see that nothing needs to parse the output of a git command, so that simplifies things!
| } | ||
|
|
||
| /// Run a git command in a given directory, returning an error if it exits non-zero. | ||
| fn git_cmd(dir: &Path, args: &[&str]) -> Result<()> { |
There was a problem hiding this comment.
This ends up putting a lot of git's output in the cargo test output and probably in the taskwarrior output as well. What do you think of amending this function so that it only shows the output on an unexpected error, or logs it all to log::debug or something similar?
| fs::create_dir_all(local_path)?; | ||
|
|
||
| // Check if path is already a git repo. | ||
| let is_repo = Command::new("git") |
There was a problem hiding this comment.
I think this doesn't use git_cmd because a nonzero exit status is expected, and similar for git checkout below. Maybe git_cmd could be extended to support that situation?
| git_cmd( | ||
| local_path, | ||
| &["clean", "-f", "--", "v-*", "snapshot", "meta"], | ||
| )?; |
There was a problem hiding this comment.
This happens at least twice and could be a helper function!
|
|
||
| /// Fetch and fast-forward to the remote branch. No-op in local-only mode. | ||
| /// If the remote branch does not yet exist (e.g. fresh bare repo), this is also a no-op. | ||
| fn pull(&self) -> Result<()> { |
There was a problem hiding this comment.
This isn't really a pull, since it does a hard reset. Maybe fn reset_to_remote?
ryneeverett
left a comment
There was a problem hiding this comment.
I don't see any mention of purging old versions or snapshots which suggests they will live on in git history indefinitely. This seems somewhat worse than other implementations which may not purge snapshots but at least truly delete versions. This doesn't necessarily need to be addressed in the initial implementation unless it informs the design. Have we thought about this?
I think there are several options for handling disk size with a git backend.
|
How confident are we that new replicas are slower to initialize? The assumption is that it is faster to download and index a snapshot than to traverse the additional versions it squashed, right? |
|
Completely untested by me. I'm only basing it on the comments in the other servers code and the TaskChampion book. But it does make sense that having to walk the entire history will be slower than starting from a reasonably recent point and only walking through updates since then. Though with also having to pull a full git repo that may not be true. |
|
I think we can push this particular problem into the future, rather than solving it here -- and I think keeping snapshots is the most future-compatible way to do that. Let's wrap this PR up and open an issue for bounding size. |
|
If bounding size is actually going to happen then I agree that snapshots are essential. But if the problem is being pushed into the indefinite (possibly never) future then it might be better to reduce the impact of unbound size. |
|
That's fair, but I'll be optimistic (which is unusual for me) and say it's better to plan for the future being positive than to assume it will not. And, I worry that something like not including snapshots would leave us with library users that depend on the full history, meaning however well-intentioned future devs are, they can't fix this without breaking things. Realistically, this seems like like something bite-sized that any of the three of us, given a bit of time, could do -- as could any motivated contributor. So if we see users suffering from excessive repo sizes, I think we'll see some pressure to fix it and that the fix will happen. Now, the WASM sync issues are another matter entirely... |
|
If it is that easy then great. I haven't come up with any better ideas than the ones @carmiac listed though and it isn't clear that there is a robust option. I guess worst case scenario is the client rebases and force pushes the working tree as the initial commit? I was thinking that unbounded size might actually be tolerable for the vast majority of users who would choose git. I would see this as a alternative positive outcome. |
|
I will say that for my use case, unbounded repo size doesn't really matter. If it ever gets too ridiculous, I could manually clean up the remote and then re-init the locals. It would be interesting to see what the average storage size per 1000 tasks is to help get a handle on how much bike shedding is worth it. FWIW, in 2020 torvalds/linux took up about 4GB on disk with over a million commits. I think we'll be fine. |
|
Frankly, I think your position supports my point -- we'll probably never be motivated to address this issue. So does it make sense to have the repo size grow many times faster in order to make it easier to stop the growth "someday maybe"? |
|
I can see a path here where we don't store snapshots, but still look for them when initializing a replica. Then if we decide to do the followup work, we can start creating snapshots and ensure there is an active snapshot before deleting old data. However, that seems higher-risk than the alternative of landing this and following up (either soon or when there's user demand) with a mechanism for removing stale data. In particular, I don't think we have enough data to understand the performance of applying many, many versions when initializing a replica. Nor do we have enough data to be confident that storing versions but not snapshots is likely to provide enough space savings to make this a non-issue. Taskwarrior users have surprisingly large and frequently-changing task DB's. I'm inclined to land this PR more-or-less as-is, mark the issue for later followup, and see what happens. |
|
Is there any more work needed for this? |
|
Just rebase or merge to avoid the conflicts and I'm happy to merge to main. |
|
Done! |
|
Woo, we have a Git backend! Do you mind opening issues for the followups? |
This add a git backed sync server to TaskChampion.
Repo Layout
v-{parent_uuid}-{child_uuid}, containing encrypted [HistorySegment] bytes.snapshot, containing a JSON wrapper around an encrypted full-state blob.meta) holds the latest version UUID and the encryption salt as JSON.Writes
After each write (
add_version,add_snapshot) the server stages the changed files, creates a commit, and pushes to the remote. If the push is rejected , the commit is rolled back and the caller receives an [AddVersionResult::ExpectedParentVersion] or an [Error] so it can retry.After a snapshot is stored, [
GitSyncServer::cleanup] automatically removes all version files whose history is now captured by the snapshot, keeping the repository compact.Options
There are several configuration options.
branchsets the branch to use for TaskChampion. This would let someone keep tasks alongside a project without dealing with history issues.remotesets the remote. Same format the git uses.local_onlymakespushandpullno-ops. This could be used if the remote isn't ready yet, or is unavailable for some other reason.General Notes for Reviewers