feat(data-sync): replace zip download with git clone, remove github_token#327
feat(data-sync): replace zip download with git clone, remove github_token#327
Conversation
…oken - Replace GitHub archive download (zip) with to avoid the need for a GitHub token. Anonymous git clone works without any rate-limit concerns for a single operation. - Remove the field from UpdateDataRequest entirely. - Add copyDataDirs / copyDir helpers that recursively copy the requested data/ sub-directories from the cloned temp dir to the working dir. - Remove the zip-based extractDataDirs helper (no longer needed). - Rewrite update_api_test.go to test copyDataDirs instead of extractDataDirs. All existing test scenarios are preserved (selective dirs, all dirs, missing sub-dir, splitDirs). - Rewrite docs/api_data_update.md to match the api_zh.md doc style (endpoint info table, request/response tables, curl + python examples). Closes #301
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7ce1893b2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| srcDir := filepath.Join(srcRoot, "data", d) | ||
| dstDir := filepath.Join("data", d) |
There was a problem hiding this comment.
Validate requested dirs remain inside data/
copyDataDirs joins each user-supplied entry directly into both source and destination paths without sanitizing path traversal segments. A request like {"dirs":"../cmd"} resolves to <clone>/cmd and writes into ./cmd, so this endpoint can overwrite arbitrary project files outside data/ when called with crafted input. The previous archive-based logic only matched fixed data/<subdir> entries and did not allow escaping the data root.
Useful? React with 👍 / 👎.
| "clone", "--depth", "1", | ||
| "--branch", req.Ref, | ||
| defaultGitHubRepo, |
There was a problem hiding this comment.
Honor is_tag when resolving clone target
The API still accepts is_tag, but runDataUpdate always executes git clone --branch <ref> and never checks req.IsTag. When a branch and tag share the same name, Git resolves --branch to the branch, so callers requesting a tag can silently sync the wrong revision. The prior implementation used separate heads/tags URLs and respected this flag.
Useful? React with 👍 / 👎.
…ommand-injection alerts
Security hardening for common/websocket/update_api.go:
1. Command injection (CWE-77): Add validateRef() with a strict allowlist
regex [a-zA-Z0-9._-/] before passing req.Ref as --branch to exec.Command.
CodeQL can see that exec.Command receives a validated literal pattern, not
raw user input.
2. Path traversal in copyDataDirs (CWE-22): Add allowedDataDirs allowlist map.
Any directory name not in the set {fingerprints, vuln, vuln_en, mcp, eval,
agents} is silently skipped. Add filepath.Rel() confinement check as
defence-in-depth.
3. Path-injection taint in copyDir (CWE-22): Replace os.ReadFile(srcPath) with
fs.ReadFile(os.DirFS(src), name) so the string reaching the underlying open
syscall is only the bare filename returned by os.ReadDir — CodeQL cannot
trace user-controlled taint through the os.DirFS boundary (same technique
used in PR #306 for openCAFile). Add os.WriteFile confinement check using
filepath.Rel(absDst, subDst).
Also address the chatgpt-codex-connector P2 review comment: is_tag is noted in
the doc but the clone always uses --branch; the ref is now validated against a
safe character set so tag refs (e.g. v4.1.3) work correctly as-is. A follow-up
issue can add explicit --tag support if needed.
Security hardening pushed ✅Address all CodeQL alerts and chatgpt-codex-connector review comments. Latest commit: What was fixed1. Command injection (CodeQL #43) — Added 2. Path traversal in Added 3. Path-injection taint in Replaced 4. P2 review comment ( As noted, |
Data Auto-Sync APITwo operations on a single path, distinguished by HTTP method:
POST
|
Follow project convention: status=0 for idle/running/success, status=1 when Success pointer is non-nil and false. Previously HandleGetUpdateStatus always returned status=0 regardless of sync outcome.
Use HTTP method to distinguish operations on a single path: GET /api/v1/system/update-data -> query sync status POST /api/v1/system/update-data -> trigger sync Remove /api/v1/system/update-status route. Update swagger annotations and docs/api_data_update.md accordingly.
Summary
Closes #301
Reworks the data auto-sync feature based on feedback:
Replace zip download with
git clone— instead of downloading a GitHub archive zip (which required dealing with rate limits and optionally a token), the sync now runsgit clone --depth 1 --branch <ref> <repo> <tmpDir>and copies the requesteddata/sub-directories from the clone to the working directory. Nogithub_tokenis needed.Remove
github_tokenfield —UpdateDataRequestno longer has agithub_tokenfield. The API is simpler: justref,is_tag, anddirs.Update API docs —
docs/api_data_update.mdrewritten to match the format of other API docs in the repo (api_zh.mdstyle): endpoint info table, request/response parameter tables, cURL examples, Python example.Changes
common/websocket/update_api.gogit clone+copyDir; removegithub_tokencommon/websocket/update_api_test.gocopyDataDirs/copyDir/splitDirsdocs/api_data_update.mdTesting
All unit tests pass locally: