[core] Pin legacy OOM killing policy in test_memory_pressure fixtures#62922
[core] Pin legacy OOM killing policy in test_memory_pressure fixtures#62922elliot-barn wants to merge 2 commits into
Conversation
PR #62643 ([Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy) flipped the default OOM killing policy from the legacy owner-group, single-worker policy to the new time-based, multi-worker policy. Two tests in test_memory_pressure.py assert on the legacy semantics: - test_memory_pressure_kill_newest_worker asserts exactly one named actor survives after the OOM kill - test_newer_task_not_retriable_kill_older_retriable_task_first asserts the older retriable task is the one chosen to be killed Under the new policy these assertions aren't guaranteed, and the mempress CI job on compile-req-compiled-py313 fails all 3 bazel attempts (the raylet log itself prints the "RAY_worker_killing_policy _by_group" escape hatch). Set `worker_killing_policy_by_group=True` in the `_system_config` of both `ray_with_memory_monitor` fixtures to pin the legacy policy these tests were written against. Not specific to the py3.13 dep refresh — master inherits the same regression from PR #62643; fixing here to unblock the refresh branch's CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates python/ray/tests/test_memory_pressure.py to explicitly set worker_killing_policy_by_group to True in test fixtures, ensuring tests maintain their original semantics after a default policy change. A review comment suggests adding a descriptive comment to one of the fixtures for consistency and maintainability.
| "task_oom_retries": 0, | ||
| "min_memory_free_bytes": -1, | ||
| "task_oom_retry_delay_base_ms": 0, | ||
| "worker_killing_policy_by_group": True, |
There was a problem hiding this comment.
For consistency and maintainability, it would be beneficial to add a comment here explaining why worker_killing_policy_by_group is set to True, similar to the one in the ray_with_memory_monitor fixture. This will help future developers understand the context for this configuration.
# PR #62643 flipped the default OOM killing policy to the
# time-based multi-worker policy; these tests assert on
# single-worker, owner-group semantics.
"worker_killing_policy_by_group": TrueThere was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5049c59. Configure here.
| "task_oom_retries": 0, | ||
| "min_memory_free_bytes": -1, | ||
| "task_oom_retry_delay_base_ms": 0, | ||
| "worker_killing_policy_by_group": True, |
There was a problem hiding this comment.
Missed pinning legacy kill policy in inline ray.init test
Medium Severity
test_one_actor_max_lifo_kill_next_actor also asserts on legacy LIFO killing semantics (newest actor is killed, oldest survives) but its inline ray.init() _system_config was not updated with "worker_killing_policy_by_group": True. Since the default flipped to the time-based policy in PR #62643, this test is susceptible to the same regression the PR intends to fix — the assertions at lines 572–574 and 584–588 may fail under the new default policy.
Reviewed by Cursor Bugbot for commit 5049c59. Configure here.
|
Hmm, the properties the test are testing should be respected by the new policy as well. If the new policy is failing these properties, we will need to fix the new policy instead. I will take a look. |
|
@elliot-barn The failing integration test tests for a policy specific behavior that's already covered in unit tests. I've removed similar policy specific tests in favor of unit test instead, as the memory pressure related integration tests can be very environment sensitive and flaky. Here's my PR that does this: #63004. |
aslonnie
left a comment
There was a problem hiding this comment.
(for @ray-project/ray-core team to review)
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |


PR #62643 ([Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy) flipped the default OOM killing policy from the legacy owner-group, single-worker policy to the new time-based, multi-worker policy. Two tests in test_memory_pressure.py assert on the legacy semantics:
Under the new policy these assertions aren't guaranteed, and the mempress CI job on compile-req-compiled-py313 fails all 3 bazel attempts (the raylet log itself prints the "RAY_worker_killing_policy _by_group" escape hatch).
Set
worker_killing_policy_by_group=Truein the_system_configof bothray_with_memory_monitorfixtures to pin the legacy policy these tests were written against. Not specific to the py3.13 dep refresh — master inherits the same regression from PR #62643; fixing here to unblock the refresh branch's CI.