Skip to content

Add a git based sync option #441#722

Merged
djmitche merged 48 commits into
GothenburgBitFactory:mainfrom
carmiac:git-sync
May 2, 2026
Merged

Add a git based sync option #441#722
djmitche merged 48 commits into
GothenburgBitFactory:mainfrom
carmiac:git-sync

Conversation

@carmiac
Copy link
Copy Markdown
Contributor

@carmiac carmiac commented Apr 18, 2026

This add a git backed sync server to TaskChampion.

Repo Layout

  • Versions are stored as files named v-{parent_uuid}-{child_uuid}, containing encrypted [HistorySegment] bytes.
  • Snapshots are stored as a single file named snapshot, containing a JSON wrapper around an encrypted full-state blob.
  • Metadata (meta) holds the latest version UUID and the encryption salt as JSON.

Writes

After each write (add_version, add_snapshot) the server stages the changed files, creates a commit, and pushes to the remote. If the push is rejected , the commit is rolled back and the caller receives an [AddVersionResult::ExpectedParentVersion] or an [Error] so it can retry.

After a snapshot is stored, [GitSyncServer::cleanup] automatically removes all version files whose history is now captured by the snapshot, keeping the repository compact.

Options

There are several configuration options.

  • branch sets the branch to use for TaskChampion. This would let someone keep tasks alongside a project without dealing with history issues.
  • remote sets the remote. Same format the git uses.
  • local_only makes push and pull no-ops. This could be used if the remote isn't ready yet, or is unavailable for some other reason.

General Notes for Reviewers

  • I haven't done any performance testing, but it seems reasonably quick for manual use. I didn't see a general performance testing module.
  • Currently is uses the same salt for all files. This isn't great security practice, but does seem to be what the other servers are doing. I could add a per-file salt, at the cost of however long Cryptor::new() takes on each read and write.

@carmiac carmiac requested a review from djmitche as a code owner April 18, 2026 05:33
Copy link
Copy Markdown
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool! I would like to take the time to read and experiment with it, but a few initial reactions to the comments and the PR description:

  • Optional configuration for the path to the git binary will probably be useful for someone, and should be easy to add
  • It looks like remote accepts the string "None"? Let's make that an Option<String> instead, if that's the case, or make the empty string the special case.
  • For cleanup, two observations:
    • Git keeps all of the data anyway, so "deleting" a version is really just removing the name from the directory. So, I don't think deleting old versions has much impact.
    • When a replica syncs, it needs all versions since the last version it saw. Otherwise, it has to restore from the snapshot and lose any local data since it last sync'd.
    • So, it's beneficial to keep versions around for as long as is practical. The cloud servers keep a half-year's worth of versions (src/server/cloud/server.rs, MAX_VERSION_AGE_SECS), and I think that's reasonable here, too.

I'd be interested to know what @ryneeverett thinks, too!

@carmiac
Copy link
Copy Markdown
Contributor Author

carmiac commented Apr 19, 2026

Thanks for the review! I'll take a look at those. My reason behind the git rm was to make the globs when adding new versions faster, but I didn't realize how that interacted with restoration.

@djmitche
Copy link
Copy Markdown
Collaborator

Hm, good point, but local globs are probably pretty fast even for 100's of dentries.

Copy link
Copy Markdown
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of comments here, but all toward improving the implementation rather than show-stopping issues. I think most of what I've suggested is relatively straightforward, but if necessary some of it can be done in followup PRs.

One general observation is, despite Server being an async trait, this invokes Git synchronously. I think that's fine for the expected use-cases for this sync model, and it's something that can be improved later if desired.

tl;dr: This looks great, and I look forward to merging it after some minor revisions!

Comment thread Cargo.toml Outdated
# Suppport for sync to another SQLite database on the same machine
server-local = ["dep:rusqlite"]
# Support for sync via Git
server-git = ["git-sync"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the two levels of features here? I think one would be sufficient. The cloud syncs have cloud because it enables some common functionality, but there's no such thing for git. So,

server-git = ["dep:serde_with", "dep:glob", "encryption"]

and update the cfg(feature..) in the code.

Comment thread src/server/gitsync/mod.rs Outdated
//!
//! - I haven't done any performance testing, but it seems reasonably quick for manual use.
//! - Currently is uses the same salt for all files. This isn't great security practice,
//! but does seem to be what the other servers are doing.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point! In defense of the idea, the salt is used in the key derivation, and we use the same key for every file, so in that sense it's only used once. If there's further concern, let's open an issue about it -- I'm sure others would like to chime in too!

Comment thread src/server/gitsync/mod.rs Outdated
//!
//! Notes for Reviewers
//!
//! - I haven't done any performance testing, but it seems reasonably quick for manual use.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. We have not focused on performance testing of the sync operations since most of the time is network-related.

Comment thread src/server/gitsync/mod.rs Outdated
//! create a 'task' branch and let TaskChampion manage that branch.
//! - This does support both defining a remote and having `local_only` mode set at the same
//! time. The idea is that maybe the remote isn't ready yet, or eithe rtemporarily or
//! permanantly down. Either way, you can use this in local mode in the mean time.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the risks here? In general, the more user-facing bits of the doc here might be better as docstrings on the configuration enum.

Comment thread src/server/gitsync/mod.rs Outdated
//!
//! - Since this shells out to git, it assumes that you havea reasonably functional git
//! setup. I.e. 'git init', 'git add', 'git commit', etc shoud just work.
//! - If you are using a remote, 'git push' and 'git pull' shoud work.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there's some room for more robustness here, such as disabling prompts. Maybe we can address that as we find issues, but it might be worth thinking about ahead of time.

I see that nothing needs to parse the output of a git command, so that simplifies things!

Comment thread src/server/gitsync/mod.rs Outdated
}

/// Run a git command in a given directory, returning an error if it exits non-zero.
fn git_cmd(dir: &Path, args: &[&str]) -> Result<()> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up putting a lot of git's output in the cargo test output and probably in the taskwarrior output as well. What do you think of amending this function so that it only shows the output on an unexpected error, or logs it all to log::debug or something similar?

Comment thread src/server/gitsync/mod.rs Outdated
fs::create_dir_all(local_path)?;

// Check if path is already a git repo.
let is_repo = Command::new("git")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't use git_cmd because a nonzero exit status is expected, and similar for git checkout below. Maybe git_cmd could be extended to support that situation?

Comment thread src/server/gitsync/mod.rs Outdated
Comment on lines +245 to +248
git_cmd(
local_path,
&["clean", "-f", "--", "v-*", "snapshot", "meta"],
)?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens at least twice and could be a helper function!

Comment thread src/server/gitsync/mod.rs Outdated

/// Fetch and fast-forward to the remote branch. No-op in local-only mode.
/// If the remote branch does not yet exist (e.g. fresh bare repo), this is also a no-op.
fn pull(&self) -> Result<()> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a pull, since it does a hard reset. Maybe fn reset_to_remote?

Comment thread src/server/gitsync/mod.rs
Copy link
Copy Markdown
Collaborator

@ryneeverett ryneeverett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any mention of purging old versions or snapshots which suggests they will live on in git history indefinitely. This seems somewhat worse than other implementations which may not purge snapshots but at least truly delete versions. This doesn't necessarily need to be addressed in the initial implementation unless it informs the design. Have we thought about this?

Comment thread src/server/gitsync/mod.rs Outdated
Comment thread src/server/gitsync/mod.rs Outdated
Comment thread src/server/gitsync/mod.rs Outdated
Comment thread src/server/gitsync/mod.rs Outdated
Comment thread src/server/gitsync/mod.rs Outdated
@carmiac
Copy link
Copy Markdown
Contributor Author

carmiac commented Apr 25, 2026

Sorry but I can't stop thinking about the disk size issue. Maybe snapshots should be disabled entirely for the git backend? Whereas they (eventually) save disk space on other backends, on git they only serve to increase the disk space and data transfer.

I think there are several options for handling disk size with a git backend.

  1. What it does now. Disk size increases unbounded on all replicas, but the implementation is relatively simple and doesn't cause a large slowdown for new replicas of existing servers.
  2. No snapshots. Disk size still increases unbounded, but at a slower rate. Implementation is even easier but new replicas of existing servers are slow to initialize.
  3. Shallow clones only. Local disk space is more or less bounded, but origin still grows unbounded. Will take a bit of thinking about how to handle local only mode.
  4. Purge old files via 'git filter-repo'. All repos have a more-or-less bounded size, but push/pull starts to get dicey as the repo history is rewritten.
  5. Something tricky that combines 3 and 4, with only origin purging. It will take some thinking that I'm not capable of right now due to having a cold.
  6. Something tricky with re-initializing origin every so often. Also not capable of figuring that our right now.

@ryneeverett
Copy link
Copy Markdown
Collaborator

ryneeverett commented Apr 25, 2026

  1. No snapshots. Disk size still increases unbounded, but at a slower rate. Implementation is even easier but new replicas of existing servers are slow to initialize.

How confident are we that new replicas are slower to initialize? The assumption is that it is faster to download and index a snapshot than to traverse the additional versions it squashed, right?

@carmiac
Copy link
Copy Markdown
Contributor Author

carmiac commented Apr 25, 2026

Completely untested by me. I'm only basing it on the comments in the other servers code and the TaskChampion book. But it does make sense that having to walk the entire history will be slower than starting from a reasonably recent point and only walking through updates since then. Though with also having to pull a full git repo that may not be true.

@djmitche
Copy link
Copy Markdown
Collaborator

I think we can push this particular problem into the future, rather than solving it here -- and I think keeping snapshots is the most future-compatible way to do that. Let's wrap this PR up and open an issue for bounding size.

@ryneeverett
Copy link
Copy Markdown
Collaborator

If bounding size is actually going to happen then I agree that snapshots are essential. But if the problem is being pushed into the indefinite (possibly never) future then it might be better to reduce the impact of unbound size.

@djmitche
Copy link
Copy Markdown
Collaborator

That's fair, but I'll be optimistic (which is unusual for me) and say it's better to plan for the future being positive than to assume it will not. And, I worry that something like not including snapshots would leave us with library users that depend on the full history, meaning however well-intentioned future devs are, they can't fix this without breaking things.

Realistically, this seems like like something bite-sized that any of the three of us, given a bit of time, could do -- as could any motivated contributor. So if we see users suffering from excessive repo sizes, I think we'll see some pressure to fix it and that the fix will happen.

Now, the WASM sync issues are another matter entirely...

@ryneeverett
Copy link
Copy Markdown
Collaborator

If it is that easy then great. I haven't come up with any better ideas than the ones @carmiac listed though and it isn't clear that there is a robust option. I guess worst case scenario is the client rebases and force pushes the working tree as the initial commit?

I was thinking that unbounded size might actually be tolerable for the vast majority of users who would choose git. I would see this as a alternative positive outcome.

@carmiac
Copy link
Copy Markdown
Contributor Author

carmiac commented Apr 26, 2026

I will say that for my use case, unbounded repo size doesn't really matter. If it ever gets too ridiculous, I could manually clean up the remote and then re-init the locals. It would be interesting to see what the average storage size per 1000 tasks is to help get a handle on how much bike shedding is worth it.

FWIW, in 2020 torvalds/linux took up about 4GB on disk with over a million commits. I think we'll be fine.

@ryneeverett
Copy link
Copy Markdown
Collaborator

Frankly, I think your position supports my point -- we'll probably never be motivated to address this issue. So does it make sense to have the repo size grow many times faster in order to make it easier to stop the growth "someday maybe"?

@djmitche
Copy link
Copy Markdown
Collaborator

I can see a path here where we don't store snapshots, but still look for them when initializing a replica. Then if we decide to do the followup work, we can start creating snapshots and ensure there is an active snapshot before deleting old data.

However, that seems higher-risk than the alternative of landing this and following up (either soon or when there's user demand) with a mechanism for removing stale data.

In particular, I don't think we have enough data to understand the performance of applying many, many versions when initializing a replica. Nor do we have enough data to be confident that storing versions but not snapshots is likely to provide enough space savings to make this a non-issue. Taskwarrior users have surprisingly large and frequently-changing task DB's.

I'm inclined to land this PR more-or-less as-is, mark the issue for later followup, and see what happens.

@carmiac
Copy link
Copy Markdown
Contributor Author

carmiac commented May 1, 2026

Is there any more work needed for this?

@djmitche
Copy link
Copy Markdown
Collaborator

djmitche commented May 1, 2026

Just rebase or merge to avoid the conflicts and I'm happy to merge to main.

@carmiac
Copy link
Copy Markdown
Contributor Author

carmiac commented May 1, 2026

Done!

@djmitche djmitche merged commit d7fd376 into GothenburgBitFactory:main May 2, 2026
20 checks passed
@djmitche
Copy link
Copy Markdown
Collaborator

djmitche commented May 2, 2026

Woo, we have a Git backend!

Do you mind opening issues for the followups?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants