Skip to content

fix(docker): supervise Java process in entrypoints instead of tail -f /dev/null#3051

Merged
imbajin merged 2 commits into
apache:masterfrom
bitflicker64:fix/docker-entrypoints-supervision
Jun 8, 2026
Merged

fix(docker): supervise Java process in entrypoints instead of tail -f /dev/null#3051
imbajin merged 2 commits into
apache:masterfrom
bitflicker64:fix/docker-entrypoints-supervision

Conversation

@bitflicker64
Copy link
Copy Markdown
Contributor

Purpose of the PR

All three docker-entrypoint.sh files used tail -f /dev/null to keep the container alive. When Java crashed, tail kept running and the container stayed up with no Java inside — Docker restart policy never fired.

Main Changes

PD and Store: replace daemon start + tail -f /dev/null with -d false, which triggers the foreground branch added in #3047. exec java replaces the shell with Java directly, so the entrypoint blocks and exits with Java's real exit code when Java exits.

Server: keep daemon start (wait-partition.sh must run after Java is already up in the background). Replace tail -f /dev/null with tail --pid=$(cat ./bin/pid) -f /dev/null. When Java exits, tail --pid returns, entrypoint exits 1, restart policy fires.

Note: server entrypoint always exits 1 — start-hugegraph.sh calls disown before returning so Java is not a child of the entrypoint shell and its real exit code is not available. This is sufficient for --restart=on-failure.

Component Before After
PD daemon start + tail -f /dev/null -d falseexec java blocks
Store daemon start + tail -f /dev/null -d falseexec java blocks
Server daemon start + tail -f /dev/null daemon start + tail --pid supervision

Verifying these changes

  • Need tests and can be verified as follows:
    • docker run --restart=on-failurekill -9 Java inside container → container exits and Docker restarts it automatically
    • Verified locally: 3 kill-9 tests on server container, PD restart loop confirmed

Does this PR potentially affect the following parts?

  • Dependencies
  • Modify configurations
  • The public API
  • Other affects
  • Nope

Documentation Status

  • Doc - TODO — README/docs update for Docker deployment lifecycle will follow in a separate PR

… /dev/null

Problem: all three docker-entrypoint.sh files used tail -f /dev/null to
keep the container alive. When Java crashed, tail kept running and the
container stayed up with no Java inside — Docker restart policy never fired.

PD and Store fix: replace daemon start + tail -f /dev/null with
-d false, which triggers the foreground branch added in chunks 2-3.
exec java replaces the shell with Java directly, so the entrypoint
blocks and exits with Java's real exit code when Java exits.

Server fix: keep daemon start (wait-partition.sh must run after Java is
up). Replace tail -f /dev/null with tail --pid=$(cat ./bin/pid) -f
/dev/null. When Java exits, tail --pid returns, entrypoint exits 1,
restart policy fires.

Note: server entrypoint always exits 1 (cannot get Java's real exit code
since start-hugegraph.sh calls disown before returning). This is
sufficient for --restart=on-failure.

Verified locally: kill -9 Java inside running container -> container
exits -> Docker restarts it automatically (tested 3 times for server,
PD restart loop confirmed for pd).

Related to: apache#3043
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. ci-cd Build or deploy labels Jun 5, 2026
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: no. Summary: server Docker entrypoint still needs graceful shutdown forwarding. Evidence: static review of hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh; bash -n passed for all three changed entrypoints.

tail -f /dev/null
PID=$(cat ./bin/pid 2>/dev/null || true)
if [[ -n "$PID" ]]; then
tail --pid="$PID" -f /dev/null
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Forward shutdown to HugeGraphServer

Evidence: PD and Store now execute their foreground scripts with -d false, but the server entrypoint still starts start-hugegraph.sh in daemon mode and then blocks on tail --pid="$PID" -f /dev/null at this line. Docker sends SIGTERM to PID 1, so this shell/tail path can exit without signaling the Java PID; the server then waits for Docker's forced kill instead of a graceful shutdown.

Please add a TERM/INT trap that kills and waits for $PID, or otherwise run the server foreground path after the post-start checks, so docker stop shuts HugeGraphServer down cleanly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch added a TERM/INT trap before the tail --pid block in f20b8a4:

PID=$(cat ./bin/pid 2>/dev/null || true)
if [[ -n "$PID" ]]; then
    trap 'kill -TERM "$PID" 2>/dev/null; while kill -0 "$PID" 2>/dev/null; do sleep 1; done; exit 0' TERM INT
    tail --pid="$PID" -f /dev/null
    exit 1
fi

When dumb-init forwards SIGTERM from docker stop, the trap fires: sends SIGTERM to Java, polls with kill -0 until Java exits cleanly, then exits 0. The tail --pid crash-detection path still exits 1 so restart policy fires on a crash.

tail --pid does not forward signals to the supervised process. Add a
TERM/INT trap that sends SIGTERM to Java and polls with kill -0 until
Java exits cleanly before the entrypoint exits.

Addresses review feedback from imbajin on apache#3049.
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.25%. Comparing base (1f61c48) to head (f20b8a4).

Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3051       +/-   ##
=============================================
+ Coverage     36.11%   93.25%   +57.14%     
+ Complexity      338       65      -273     
=============================================
  Files           803        9      -794     
  Lines         68234      267    -67967     
  Branches       8964       22     -8942     
=============================================
- Hits          24640      249    -24391     
+ Misses        40936        8    -40928     
+ Partials       2658       10     -2648     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from my side. The previous server SIGTERM concern is addressed at the current head: PD/Store now use the foreground exec path, and Server keeps the hstore post-start check while supervising the Java PID with crash detection plus TERM/INT forwarding. CI is green and I don’t see remaining blockers.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 8, 2026
@imbajin imbajin merged commit 817887d into apache:master Jun 8, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-cd Build or deploy lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants