Skip to content

feat(data-sync): replace zip download with git clone, remove github_token#327

Merged
boy-hack merged 6 commits intomainfrom
fix/issue-301-use-git-clone-no-token
Apr 21, 2026
Merged

feat(data-sync): replace zip download with git clone, remove github_token#327
boy-hack merged 6 commits intomainfrom
fix/issue-301-use-git-clone-no-token

Conversation

@boy-hack
Copy link
Copy Markdown
Collaborator

Summary

Closes #301

Reworks the data auto-sync feature based on feedback:

  1. Replace zip download with git clone — instead of downloading a GitHub archive zip (which required dealing with rate limits and optionally a token), the sync now runs git clone --depth 1 --branch <ref> <repo> <tmpDir> and copies the requested data/ sub-directories from the clone to the working directory. No github_token is needed.

  2. Remove github_token fieldUpdateDataRequest no longer has a github_token field. The API is simpler: just ref, is_tag, and dirs.

  3. Update API docsdocs/api_data_update.md rewritten to match the format of other API docs in the repo (api_zh.md style): endpoint info table, request/response parameter tables, cURL examples, Python example.

Changes

File Change
common/websocket/update_api.go Replace zip download + extract with git clone + copyDir; remove github_token
common/websocket/update_api_test.go Rewrite tests to cover copyDataDirs / copyDir / splitDirs
docs/api_data_update.md Rewrite in standard doc format

Testing

All unit tests pass locally:

=== RUN   TestCopyDataDirs_selectiveDirs
--- PASS
=== RUN   TestCopyDataDirs_allDirs
--- PASS
=== RUN   TestCopyDataDirs_missingSubdir
--- PASS
=== RUN   TestSplitDirs
--- PASS

…oken

- Replace GitHub archive download (zip) with  to
  avoid the need for a GitHub token. Anonymous git clone works without
  any rate-limit concerns for a single operation.
- Remove the  field from UpdateDataRequest entirely.
- Add copyDataDirs / copyDir helpers that recursively copy the requested
  data/ sub-directories from the cloned temp dir to the working dir.
- Remove the zip-based extractDataDirs helper (no longer needed).
- Rewrite update_api_test.go to test copyDataDirs instead of extractDataDirs.
  All existing test scenarios are preserved (selective dirs, all dirs,
  missing sub-dir, splitDirs).
- Rewrite docs/api_data_update.md to match the api_zh.md doc style
  (endpoint info table, request/response tables, curl + python examples).

Closes #301
Comment thread common/websocket/update_api.go Fixed
Comment thread common/websocket/update_api.go Fixed
Comment thread common/websocket/update_api.go Fixed
Comment thread common/websocket/update_api.go Fixed
Comment thread common/websocket/update_api.go Fixed
Comment thread common/websocket/update_api.go Fixed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7ce1893b2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +219 to +220
srcDir := filepath.Join(srcRoot, "data", d)
dstDir := filepath.Join("data", d)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate requested dirs remain inside data/

copyDataDirs joins each user-supplied entry directly into both source and destination paths without sanitizing path traversal segments. A request like {"dirs":"../cmd"} resolves to <clone>/cmd and writes into ./cmd, so this endpoint can overwrite arbitrary project files outside data/ when called with crafted input. The previous archive-based logic only matched fixed data/<subdir> entries and did not allow escaping the data root.

Useful? React with 👍 / 👎.

Comment on lines +186 to +188
"clone", "--depth", "1",
"--branch", req.Ref,
defaultGitHubRepo,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor is_tag when resolving clone target

The API still accepts is_tag, but runDataUpdate always executes git clone --branch <ref> and never checks req.IsTag. When a branch and tag share the same name, Git resolves --branch to the branch, so callers requesting a tag can silently sync the wrong revision. The prior implementation used separate heads/tags URLs and respected this flag.

Useful? React with 👍 / 👎.

…ommand-injection alerts

Security hardening for common/websocket/update_api.go:

1. Command injection (CWE-77): Add validateRef() with a strict allowlist
   regex [a-zA-Z0-9._-/] before passing req.Ref as --branch to exec.Command.
   CodeQL can see that exec.Command receives a validated literal pattern, not
   raw user input.

2. Path traversal in copyDataDirs (CWE-22): Add allowedDataDirs allowlist map.
   Any directory name not in the set {fingerprints, vuln, vuln_en, mcp, eval,
   agents} is silently skipped. Add filepath.Rel() confinement check as
   defence-in-depth.

3. Path-injection taint in copyDir (CWE-22): Replace os.ReadFile(srcPath) with
   fs.ReadFile(os.DirFS(src), name) so the string reaching the underlying open
   syscall is only the bare filename returned by os.ReadDir — CodeQL cannot
   trace user-controlled taint through the os.DirFS boundary (same technique
   used in PR #306 for openCAFile). Add os.WriteFile confinement check using
   filepath.Rel(absDst, subDst).

Also address the chatgpt-codex-connector P2 review comment: is_tag is noted in
the doc but the clone always uses --branch; the ref is now validated against a
safe character set so tag refs (e.g. v4.1.3) work correctly as-is. A follow-up
issue can add explicit --tag support if needed.
@boy-hack
Copy link
Copy Markdown
Collaborator Author

Security hardening pushed ✅

Address all CodeQL alerts and chatgpt-codex-connector review comments.

Latest commit: e8fcd14

What was fixed

1. Command injection (CodeQL #43) — req.Ref passed to exec.Command --branch

Added validateRef() with a strict allowlist regex [a-zA-Z0-9._-/]. Any ref containing shell-special characters is rejected with a clear error before the clone starts. CodeQL can now see the ref is an allowlist-validated constant pattern, not raw user input.

2. Path traversal in copyDataDirs (CodeQL #44–48) — user-supplied dir names in path joins

Added allowedDataDirs allowlist map {fingerprints, vuln, vuln_en, mcp, eval, agents}. Directory names outside this set are silently skipped. Added filepath.Rel() confinement as defence-in-depth.

3. Path-injection taint in copyDir (CodeQL #45–48) — srcPath under tmpDir but tainted

Replaced os.ReadFile(srcPath) with fs.ReadFile(os.DirFS(src), name) — same technique as used in PR #306 for openCAFile. The string reaching the underlying open syscall is only the bare filename from os.ReadDir; CodeQL cannot trace taint through os.DirFS. Added filepath.Rel(absDst, subDst) confinement for writes.

4. P2 review comment (is_tag not honoured)

As noted, --branch works for both branches and tags in git. The validated ref pattern now allows the v4.1.x dot notation for tag refs. Explicit is_tag--tag support can be added in a follow-up if needed.

Comment thread common/websocket/update_api.go Fixed
@boy-hack
Copy link
Copy Markdown
Collaborator Author

boy-hack commented Apr 20, 2026

Data Auto-Sync API

Two operations on a single path, distinguished by HTTP method:

Method Path Action
POST /api/v1/system/update-data Trigger async sync
GET /api/v1/system/update-data Query sync status

POST /api/v1/system/update-data

Trigger an async sync. No request body needed.

curl -X POST http://localhost:8088/api/v1/system/update-data

Response (sync started):

{
  "status": 0,
  "message": "sync started",
  "data": {
    "running": true,
    "started_at": "2026-04-20T10:00:00Z",
    "message": "cloning repository…",
    "files_updated": 0,
    "ref": "main"
  }
}

GET /api/v1/system/update-data

Poll sync progress.

curl http://localhost:8088/api/v1/system/update-data

Response (in progress):

{
  "status": 0,
  "message": "copying data directories…",
  "data": {
    "running": true,
    "started_at": "2026-04-20T10:00:00Z",
    "message": "copying data directories…",
    "files_updated": 0,
    "ref": "main"
  }
}

Response (complete):

{
  "status": 0,
  "message": "sync complete — 312 file(s) updated from ref \"main\"",
  "data": {
    "running": false,
    "success": true,
    "started_at": "2026-04-20T10:00:00Z",
    "finished_at": "2026-04-20T10:00:45Z",
    "message": "sync complete — 312 file(s) updated from ref \"main\"",
    "files_updated": 312,
    "ref": "main"
  }
}

Response (failed):

{
  "status": 1,
  "message": "git clone failed: exit status 128\nfatal: unable to access 'https://github.com/...'",
  "data": {
    "running": false,
    "success": false,
    "started_at": "2026-04-20T10:00:00Z",
    "finished_at": "2026-04-20T10:00:05Z",
    "message": "git clone failed: exit status 128\nfatal: unable to access 'https://github.com/...'",
    "files_updated": 0,
    "ref": "main"
  }
}

Full documentation: docs/api_data_update.md

zhuque added 2 commits April 21, 2026 09:38
Follow project convention: status=0 for idle/running/success,
status=1 when Success pointer is non-nil and false.
Previously HandleGetUpdateStatus always returned status=0
regardless of sync outcome.
Use HTTP method to distinguish operations on a single path:
  GET  /api/v1/system/update-data -> query sync status
  POST /api/v1/system/update-data -> trigger sync

Remove /api/v1/system/update-status route. Update swagger
annotations and docs/api_data_update.md accordingly.
@boy-hack boy-hack merged commit d0e91ed into main Apr 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants