Skip to content

[core] Pin legacy OOM killing policy in test_memory_pressure fixtures#62922

Open
elliot-barn wants to merge 2 commits into
masterfrom
elliot-barn-fix-test-memory-pressure-oom-policy
Open

[core] Pin legacy OOM killing policy in test_memory_pressure fixtures#62922
elliot-barn wants to merge 2 commits into
masterfrom
elliot-barn-fix-test-memory-pressure-oom-policy

Conversation

@elliot-barn
Copy link
Copy Markdown
Collaborator

PR #62643 ([Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy) flipped the default OOM killing policy from the legacy owner-group, single-worker policy to the new time-based, multi-worker policy. Two tests in test_memory_pressure.py assert on the legacy semantics:

  • test_memory_pressure_kill_newest_worker asserts exactly one named actor survives after the OOM kill
  • test_newer_task_not_retriable_kill_older_retriable_task_first asserts the older retriable task is the one chosen to be killed

Under the new policy these assertions aren't guaranteed, and the mempress CI job on compile-req-compiled-py313 fails all 3 bazel attempts (the raylet log itself prints the "RAY_worker_killing_policy _by_group" escape hatch).

Set worker_killing_policy_by_group=True in the _system_config of both ray_with_memory_monitor fixtures to pin the legacy policy these tests were written against. Not specific to the py3.13 dep refresh — master inherits the same regression from PR #62643; fixing here to unblock the refresh branch's CI.

PR #62643 ([Core] (Resource Isolation 12/n) Switch group killing
policy to by time killing policy) flipped the default OOM killing
policy from the legacy owner-group, single-worker policy to the new
time-based, multi-worker policy. Two tests in test_memory_pressure.py
assert on the legacy semantics:

  - test_memory_pressure_kill_newest_worker
    asserts exactly one named actor survives after the OOM kill
  - test_newer_task_not_retriable_kill_older_retriable_task_first
    asserts the older retriable task is the one chosen to be killed

Under the new policy these assertions aren't guaranteed, and the
mempress CI job on compile-req-compiled-py313 fails all 3 bazel
attempts (the raylet log itself prints the "RAY_worker_killing_policy
_by_group" escape hatch).

Set `worker_killing_policy_by_group=True` in the `_system_config`
of both `ray_with_memory_monitor` fixtures to pin the legacy policy
these tests were written against. Not specific to the py3.13
dep refresh — master inherits the same regression from PR #62643;
fixing here to unblock the refresh branch's CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn requested a review from a team as a code owner April 24, 2026 21:39
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates python/ray/tests/test_memory_pressure.py to explicitly set worker_killing_policy_by_group to True in test fixtures, ensuring tests maintain their original semantics after a default policy change. A review comment suggests adding a descriptive comment to one of the fixtures for consistency and maintainability.

"task_oom_retries": 0,
"min_memory_free_bytes": -1,
"task_oom_retry_delay_base_ms": 0,
"worker_killing_policy_by_group": True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and maintainability, it would be beneficial to add a comment here explaining why worker_killing_policy_by_group is set to True, similar to the one in the ray_with_memory_monitor fixture. This will help future developers understand the context for this configuration.

            # PR #62643 flipped the default OOM killing policy to the
            # time-based multi-worker policy; these tests assert on
            # single-worker, owner-group semantics.
            "worker_killing_policy_by_group": True

@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Apr 25, 2026
@elliot-barn elliot-barn requested a review from aslonnie April 28, 2026 14:50
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5049c59. Configure here.

"task_oom_retries": 0,
"min_memory_free_bytes": -1,
"task_oom_retry_delay_base_ms": 0,
"worker_killing_policy_by_group": True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed pinning legacy kill policy in inline ray.init test

Medium Severity

test_one_actor_max_lifo_kill_next_actor also asserts on legacy LIFO killing semantics (newest actor is killed, oldest survives) but its inline ray.init() _system_config was not updated with "worker_killing_policy_by_group": True. Since the default flipped to the time-based policy in PR #62643, this test is susceptible to the same regression the PR intends to fix — the assertions at lines 572–574 and 584–588 may fail under the new default policy.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5049c59. Configure here.

@elliot-barn elliot-barn added the go add ONLY when ready to merge, run all tests label Apr 28, 2026
@Kunchd
Copy link
Copy Markdown
Contributor

Kunchd commented Apr 28, 2026

Hmm, the properties the test are testing should be respected by the new policy as well. If the new policy is failing these properties, we will need to fix the new policy instead. I will take a look.

@Kunchd
Copy link
Copy Markdown
Contributor

Kunchd commented Apr 28, 2026

@elliot-barn The failing integration test tests for a policy specific behavior that's already covered in unit tests. I've removed similar policy specific tests in favor of unit test instead, as the memory pressure related integration tests can be very environment sensitive and flaky. Here's my PR that does this: #63004.

@aslonnie aslonnie changed the title [tests] Pin legacy OOM killing policy in test_memory_pressure fixtures [core] Pin legacy OOM killing policy in test_memory_pressure fixtures Apr 29, 2026
Copy link
Copy Markdown
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for @ray-project/ray-core team to review)

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants