Skip to content

Remove redundant flaky integration test in favor of unit tests#63004

Merged
MengjinYan merged 4 commits intoray-project:masterfrom
Kunchd:deflake-test_mem_pressure
Apr 30, 2026
Merged

Remove redundant flaky integration test in favor of unit tests#63004
MengjinYan merged 4 commits intoray-project:masterfrom
Kunchd:deflake-test_mem_pressure

Conversation

@Kunchd
Copy link
Copy Markdown
Contributor

@Kunchd Kunchd commented Apr 28, 2026

Description

PR [Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy, enabled the new by-time killing policy by default opposed to the legacy by-group killing policy. This resulted in test_memory_pressure failures in post merge. We found the following in our investigation:

  • The integration test tests for policy specific behaviors when the memory pressure integration test suite should instead tests for the memory monitoring system's general ability to reduce memory pressure.
  • The failing integration test should be unit test that tests for the killing policy's behavior directly.

In general, we prefer unit test over integration tests for memory threshold sensitive tests as the test environment can have significant impact on the test result, leading to flaky test behaviors.

This PR removes redundant integration tests that tests for policy specific behaviors already covered by the policy's unit testing, and introduces a new unit test for cases that were previously covered by the integration test. The following are the removed integration test and their replacements:

Related issues

Additional information

test_memory_pressure run: https://buildkite.com/ray-project/postmerge/builds/17288

davik added 2 commits April 28, 2026 21:17
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested a review from a team as a code owner April 28, 2026 21:43
@Kunchd Kunchd added the go add ONLY when ready to merge, run all tests label Apr 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes a significant number of Python-based memory pressure tests and associated fixtures from test_memory_pressure.py. Concurrently, it introduces a new C++ unit test in task_manager_test.cc to validate the task manager's behavior regarding finite OOM retries, ensuring it correctly handles the retry counter and eventually reports an OOM error when the limit is reached. I have no feedback to provide.

@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Apr 29, 2026
@MengjinYan MengjinYan merged commit 6f95fe9 into ray-project:master Apr 30, 2026
6 checks passed
RitaKaniska added a commit to chichic21039/ray that referenced this pull request May 5, 2026
* [ci] Migrate LLM auto-select and multi-node compute configs to new schema (#62873)

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* [serve] Deflake test_haproxy_metrics against HAProxy soft-reload (#62930)

test_haproxy_metrics asserts
`haproxy_backend_http_responses_total{proxy="http-default",code="2xx"}
1` after one request. The counter is racy:
- HAProxy backend health checks can increment it above 1, and
- a HAProxyManager soft-reload (which fires on every backend config
change) can zero it in the new worker.

Also, CI failures are unreadable today because pytest truncates the
metrics body in `assert x in y` to "...Har...".

Fix: poll with wait_for_condition, send a request each iteration, accept
counter >= 1. Also dump full /metrics on timeout so the next failure is
debuggable.

Passes 5/5 locally

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

* [data][1/n] DataSourceV2: refactor V2 listing/scanner/reader infrastructure (#62975)

Internal refactor of V2 listing/scanner/reader infrastructure to prep
for the upcoming ListFiles/ReadFiles op split. No public API change.

- Listing: partition-column helpers on FileManifest, sample_files +
  _build_pruners helpers in listing_utils.
- FileReader.read(manifest): cached_property file_dataset_schema,
  _broadcast_partition_value helper, derived_items synthesis loop,
  early-return on empty manifest. Caller-supplied schema overrides
  pyarrow's per-fragment inference for the all-null first-file case.
- FileScanner: drop bucketing helper plan() (moved upstream to
  plan_list_files_op in PR-A2), add prune_manifest hook, keep
  compute_local_scheduling (used by V1 dispatch until PR-D).
- ArrowFileScanner / ParquetFileReader / Scanner: simplifications
  aligned with the new manifest-driven read path.
- arrow_block.py + dataset.py: Schema.names hides _bsp_stub stub
  column produced when the scanner emits zero-column batches.

This is breaking up PR: #62880

Co-authored-by: Goutam V. <>

* [Docs] Replace deprecated busyboxplus curl image in Kubernetes examples (fixes #61538) (#63019)

## Summary

Fixes broken Kubernetes example in RayService quickstart docs.

The image `radial/busyboxplus:curl` is no longer usable due to
deprecated
Docker manifest format, causing ImagePullBackOff errors.

## Changes

- Replaced `radial/busyboxplus:curl` with `curlimages/curl:latest`

## Testing

- Verified the new image works with `kubectl run`
- Confirmed curl commands execute successfully inside the pod

## Issue

Closes #61538

---------

Signed-off-by: Chaitanya Bharadwaj <venkatachaitanyametta@gmail.com>
Signed-off-by: Chaitanya Bharadwaj <74806126+mvcb@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [serve] Evict per-deployment LongPollHost state on deployment delete (#62820)

## Problem

`LongPollHost` had no eviction API. The deletion path in
`DeploymentStateManager.update()` cleaned scheduler, autoscaler, and
`_deployment_states` but never told the long-poll host. Three
per-deployment keys — `(DEPLOYMENT_TARGETS, id)`, the Java-compat
`(DEPLOYMENT_TARGETS, name)`, and `(DEPLOYMENT_CONFIG, id)` — survived
for the life of the controller, bounded by unique `(name, app_name)`
pairs.

It also meant **handle routers** (the routers embedded in
`serve.get_deployment_handle(...)`, in replicas or user driver code)
never received `is_available=False` on delete. `is_available` is derived
from `not _terminally_failed()` at `deployment_state.py:3198-3216`, not
from "deleting"; healthy deletes emit `is_available=True`, and
`broadcast_running_replicas_if_changed` can even early-return and emit
nothing at all. Requests through the handle then queue or hang instead
of failing fast with `DeploymentUnavailableError`. HTTP/gRPC proxies are
unaffected — they subscribe to `ROUTE_TABLE`, which
`EndpointState.delete_endpoint()` handles correctly.

## Fix

- **`LongPollHost.remove_keys(keys)`** — pops the four per-key maps,
decrements the pending-clients gauge by the number of woken waiters,
fires each waiter's event.
- **`listen_for_change` hardening** — done branch skips evicted keys
(was `KeyError`); `not_done` cleanup uses `.get()` instead of indexing
to avoid resurrecting `defaultdict` entries; empty sets are popped.
- **Delete path** — tombstones `DEPLOYMENT_TARGETS` via `notify_changed`
and evicts only `DEPLOYMENT_CONFIG`. The tombstoned key is intentionally
*not* evicted in the same sync tick: parked waiters run only after
`update()` returns, by which point the done-branch guard would drop the
tombstone.
- **Batched gauge writes** (per Gemini review) — collect affected
namespace tags, flush one `pending_clients_gauge.set(...)` per unique
tag after each loop.

After this, handle routers flip to `is_available=False` within ms of
delete and raise `DeploymentUnavailableError` immediately, rather than
relying on side channels (handle lifetime, driver GC, caller timeouts)
to eventually notice.

---------

Signed-off-by: harshit <harshit@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [Core] remove pydantic v1 support (#62716)

## Description
Drop Pydantic v1 support in Ray and require Pydantic v2 for Ray extras
that depend on it.
Removing Pydantic v1 support instead of keeping an additional
compatibility fix for
Python 3.14. This makes the dependency behavior clearer and lets us
delete v1-specific compatibility code.

## Related issues
https://github.com/ray-project/ray/issues/62664
## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

* [Data] Reduce verbosity of arrow conversion warning logs (#61486)

## Description

When Arrow conversion fails and Ray Data falls back to pickle
serialization, the warning log includes the full exception traceback
(`exc_info=ace`), which can be extremely noisy — especially for nested
datatypes like image arrays where the data representation alone spans
many lines.

This PR moves the detailed error message and traceback to `DEBUG` level,
keeping the `WARNING` concise and actionable:

**Before:**
```
WARNING arrow.py:290 -- Failed to convert column 'flat_images' into pyarrow array due to: Error converting data to Arrow: [[array([[[130, 118, 255], [132, 117, 255], ...]]]...; falling back to serialize as pickled python objects
Traceback (most recent call last):
  File ".../arrow.py", line 258, in _convert_to_pyarrow_native_array
    ...
  (10+ lines of traceback)
```

**After:**
```
WARNING arrow.py:290 -- Failed to convert column 'flat_images' into pyarrow array; falling back to serialize as pickled python objects. To see the full error, set logging level to DEBUG.
```

## Related issues

Fixes #57840

## Additional information

- The full error details + traceback are still available at `DEBUG`
level for anyone who needs to investigate
- All existing unit tests pass (`test_transform_pyarrow.py`,
`test_arrow_type_conversion.py`)
- The `ArrowConversionError` already truncates data to 200 chars, but
even that plus the traceback was excessively verbose for a warning

---------

Signed-off-by: slxswaa1993 <470093691@qq.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [serve] Increase controller benchmark frequency (#63029)

## Description
We need denser benchmark results to identify regressions.

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

* [ci][deps][1/3] PY313 DEP UNIFICATION: compiling requirements_compiled_py3.13.txt and depsets (#62864)

Refreshes `requirements_compiled_py3.13.txt` and the full set of
raydepsets locks against current source pins, and adds the supporting CI
plumbing and source-file changes needed to make the py3.13 lock
resolvable as a constraint across all py3.10/3.11/3.12/3.13 depsets.

## CI infrastructure

- **`.buildkite/dependencies.rayci.yml`** — new
`pip_compile_313_dependencies` Buildkite step (mirror of the existing
3.11 compile job). Runs `compile_313_pip_dependencies`, uploads the
artifact, and fails the build if `requirements_compiled_py3.13.txt`
drifts from source.
- **`ci/ci.sh`** — new `compile_313_pip_dependencies()` function that
points pip-compile at the `python/requirements/py313/` and
`python/requirements/ml/py313/` overrides and emits
`requirements_compiled_py3.13.txt`.

## Source-file pins

These drive the lock changes — no manual edits to the generated lock
files.

### `python/requirements/py313/test-requirements.txt`
- `fastapi==0.121.0` — FastAPI 0.125+ removed `pydantic.v1` route
support; `test_pydantic_serialization` still uses v1 BaseModel.
- `asgiref==3.9.2` — 3.10+ regresses Serve direct-ingress timeout /
disconnect handling.
- `redis==4.5.4` — TLS test compatibility.
- `opentelemetry-proto==1.39.0` and
`opentelemetry-exporter-otlp-proto-grpc==1.39.0` — co-pinned with
`opentelemetry-sdk` so vllm (rayllm depset) can satisfy the in-family
pins.
- `grpcio==1.76.0` + matching `grpcio-tools` / `grpcio-status` —
bisecting `test_raylet_and_agent_share_fate` against grpcio 1.80 startup
cost on the runtime-env agent.
- `jsonschema>=4.23.0,<4.25.0` — 4.25 introduced `rfc3987-syntax` which
pins `lark==1.3.1`, conflicting with vllm's `lark==1.2.2`.
- Dual `python_version`-marker pins for `protobuf`, `scipy`,
`contourpy`, `networkx` — these packages dropped py3.10 wheels at the
same time the py3.13 lock needed newer floors. Dual pinning preserves
the cross-py-version compat path when the py3.13 lock is consumed as a
constraint by py3.10 depsets.

### `python/requirements/ml/py313/`
- `data-requirements.txt` — `lance-namespace==0.6.1`.
- `dl-cpu-requirements.txt` / `dl-gpu-requirements.txt` —
`nvidia-nccl-cu12` aligned across CPU/GPU so the CPU-built lock doesn't
pin a version that conflicts with cu128 torch in GPU depsets.
- `ml-requirements.txt` — dual `keras` pin (3.12.1 for py<3.11, 3.14.0
for py>=3.11); keras 3.13 dropped py3.10.
- `rllib-requirements.txt` — dual `onnxruntime` pin (1.20.0 / 1.24.4)
keyed on python version.
- `train-requirements.txt` — `datasets==3.6.0`.

### `python/requirements/data/`
- `pyarrow-latest.txt` — added `delta-sharing`.
- `pyarrow-v9.txt` — pinned `datasets==2.14.4`, added `delta-sharing`.

## Depsets config

**`ci/raydepsets/configs/ci_data.depsets.yaml`** — added relax entries
so v9 / tfxbsl resolves can downgrade chains together:

- `relaxed_data`: relaxed `delta-sharing`, `dill`, `multiprocess`
(datasets 2.14.4 caps `dill<0.3.8` but py313 lock has `dill==0.4.1`).
- `relaxed_data_tfxbsl`: relaxed `absl-py`, `grpcio-status`,
`contourpy`, `scipy`, `delta-sharing` (tfx-bsl 1.16.x caps
`absl-py<2.0.0` and `protobuf<6`; contourpy 1.3.3 + apache-beam 2.53.0
numpy clash).

## Lock files

Regenerated `requirements_compiled_py3.13.txt` and ~70 depset locks
under `python/deplocks/` (base / ci / llm / ray_img / docs).

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [ci] Fix mismatch between bisect instance-type and runner-queue name (#62742)

## Description
A mismatch in the `instance_type` and `runner_queues` fields of bisect
pipeline rayci configs causes all `bisect` pipeline builds to fail.

## Related issues
None

## Additional information

https://buildkite.com/ray-project/bisect/builds/3673/steps/canvas?sid=019d9d9d-05de-4326-b5dc-d818fbcdc71f&tab=output

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>

* [ci] Migrate dataset GPU core compute configs to new schema (#62832)

## Summary

Migrates 2 Anyscale compute config files from the legacy schema to the
new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to all
corresponding test entries in `release_data_tests.yaml`.

### Compute configs migrated (2 files)

**Dataset tests** (`release/nightly_tests/dataset/`):
- `fixed_size_gpu_compute.yaml`
- `autoscaling_gpu_compute.yaml`

### Tests updated in release_data_tests.yaml (3 tests)

Via `{{scaling}}_gpu_compute.yaml` template:
1. `image_classification_{{scaling}}`
2. `image_classification_from_parquet_{{scaling}}`

Hardcoded `dataset/autoscaling_gpu_compute.yaml` (chaos test overrides
`working_dir: nightly_tests`):
3. `image_classification_chaos`

### Schema changes applied
- `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME`
- Removed `region: us-west-2`
- `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes`
- `min_workers` → `min_nodes`, `max_workers` → `max_nodes`
- `use_spot: false` → `market_type: ON_DEMAND`
- `advanced_configurations_json` → `advanced_instance_config`
- Dropped head/worker `name:` fields (single worker group per config)
- Dropped head-node `resources: {cpu: 0}` — new SDK defaults head CPU to
0 when `worker_nodes` is present (head is CPU-only coordinator; GPU
workloads run on `g4dn.2xlarge` workers)

## Test plan
- [x] Both config files validated against `ComputeConfig.from_yaml()`
- [x] CI passes with `anyscale_sdk_2026: true` flag on all 3 test
entries: https://buildkite.com/ray-project/release/builds/89918

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* [ci] Migrate scheduling and single node benchmark compute configs to new schema (#62489)

## Summary
- Migrated 4 compute config files to the new Anyscale SDK schema:
`scheduling.yaml`, `scheduling_gce.yaml`, `single_node.yaml`,
`single_node_gce.yaml`
- Updated 2 test entries (`single_node`,
`scheduling_test_many_0s_tasks_many_nodes`) in `release_tests.yaml` with
`anyscale_sdk_2026: true` flag
- Key transformations: `cloud_id` -> `cloud`, `head_node_type` ->
`head_node`, `worker_node_types` -> `worker_nodes`, flattened
`custom_resources`, renamed
`advanced_configurations_json`/`gcp_advanced_configurations_json` ->
`advanced_instance_config`, `use_spot: false` -> `market_type:
ON_DEMAND`, `min/max_workers` -> `min/max_nodes`

## Test plan
- [x] All 4 configs validated against `ComputeConfig.from_yaml()`
- [x] Verify `single_node` nightly tests pass on Buildkite
- [x] Verify `scheduling_test_many_0s_tasks_many_nodes` nightly tests
pass on Buildkite

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [Serve][LLM] Add rate-limiter logic for per request traceback spam (#62440)

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [ci] Migrate dask-on-ray and shuffle compute configs to new schema (#62605)

## Summary
- Migrate 9 compute config files (dask-on-ray and shuffle) from legacy
Anyscale schema to new SDK schema
- Add `anyscale_sdk_2026: true` to 5 test entries in
`release_tests.yaml`

## Config files migrated
-
`release/nightly_tests/dask_on_ray/dask_on_ray_sort_compute_template.yaml`
(AWS, head-only)
-
`release/nightly_tests/dask_on_ray/dask_on_ray_sort_compute_template_gce.yaml`
(GCE, head-only)
- `release/nightly_tests/dask_on_ray/1tb_sort_compute.yaml` (AWS, head +
32 workers)
- `release/nightly_tests/shuffle/shuffle_compute_multi.yaml` (AWS, head
+ 3 workers)
- `release/nightly_tests/shuffle/shuffle_compute_multi_gce.yaml` (GCE,
head + 3 workers)
- `release/nightly_tests/shuffle/shuffle_compute_single.yaml` (AWS,
head-only)
- `release/nightly_tests/shuffle/shuffle_compute_single_gce.yaml` (GCE,
head-only)
- `release/nightly_tests/shuffle/shuffle_compute_autoscaling.yaml` (AWS,
head + 0-19 workers)
- `release/nightly_tests/shuffle/shuffle_compute_autoscaling_gce.yaml`
(GCE, head + 0-19 workers)

## Test entries updated (anyscale_sdk_2026: true)
- `dask_on_ray_100gb_sort`
- `dask_on_ray_1tb_sort`
- `shuffle_20gb_with_state_api`
- `shuffle_100gb`
- `autoscaling_shuffle_1tb_1000_partitions`

## Schema changes applied
- `cloud_id` → `cloud` (env var name updated)
- `head_node_type` → `head_node` (removed `name:` field)
- `worker_node_types` → `worker_nodes` (omitted for head-only configs)
- `min_workers`/`max_workers` → `min_nodes`/`max_nodes`
- `use_spot: false` → `market_type: ON_DEMAND`
- `advanced_configurations_json` / `gcp_advanced_configurations_json` →
`advanced_instance_config`
- GCE: `region` + `allowed_azs` → `zones`
- Removed: `region`, `max_workers`, commented-out blocks
- Capitalized `cpu` → `CPU` in resources

## Test plan
- [x] All 9 configs validated against `ComputeConfig.from_yaml()`
- [x] Verify CI passes with new configs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [ci] Migrate stress test and placement group compute configs to new schema (#62607)

## Summary

Migrates 15 Anyscale compute config files from the legacy schema to the
new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to all
corresponding test entries in `release_tests.yaml`.

### Compute configs migrated (15 files)

**Stress tests** (`release/nightly_tests/stress_tests/`):
- `stress_tests_compute.yaml` / `stress_tests_compute_gce.yaml`
- `stress_tests_compute_large.yaml` /
`stress_tests_compute_large_gce.yaml`
- `smoke_test_compute.yaml` / `smoke_test_compute_gce.yaml`
- `stress_test_threaded_actor_compute.yaml`
- `placement_group_tests_compute.yaml` /
`placement_group_tests_compute_gce.yaml`
- `stress_tests_single_node_oom_compute.yaml` /
`stress_tests_single_node_oom_compute_gce.yaml`

**Placement group tests**
(`release/nightly_tests/placement_group_tests/`):
- `compute.yaml` / `compute_gce.yaml`
- `pg_perf_test_compute.yaml` / `pg_perf_test_compute_gce.yaml`

### Tests updated in release_tests.yaml (9 tests)

1. `stress_test_placement_group`
2. `stress_test_state_api_scale`
3. `stress_test_many_tasks`
4. `stress_test_dead_actors`
5. `threaded_actors_stress_test`
6. `stress_test_many_runtime_envs`
7. `single_node_oom`
8. `pg_autoscaling_regression_test`
9. `placement_group_performance_test`

### Schema changes applied
- `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME`
- `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes`
- `min_workers` → `min_nodes`, `max_workers` → `max_nodes`
- `use_spot: false` → `market_type: ON_DEMAND`
- `advanced_configurations_json` / `gcp_advanced_configurations_json` →
`advanced_instance_config`
- GCE: `region` + `allowed_azs` → `zones`
- Resources: `cpu` → `CPU`, `gpu` → `GPU`, flattened `custom_resources`
- Removed: `region`, `max_workers`, head/worker `name` fields (kept
where multiple workers share instance type)
- Removed commented-out blocks
- Added `CPU` resources to head nodes where `wait_for_nodes` > worker
count

## Test plan
- [x] All 15 config files validated against `ComputeConfig.from_yaml()`
- [x] CI passes with `anyscale_sdk_2026: true` flag on all test entries:
https://buildkite.com/ray-project/release/builds/89908

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [ci] Migrate chaos test compute configs to new schema (#62606)

## Summary
- Migrated 2 chaos test compute config files (`compute_template.yaml`
and `compute_template_gce.yaml`) from legacy Anyscale compute config
schema to the new SDK schema
- Added `anyscale_sdk_2026: true` flag to all 16 chaos test entries in
`release_tests.yaml`

### Config changes
- `cloud_id` -> `cloud`, `ANYSCALE_CLOUD_ID` -> `ANYSCALE_CLOUD_NAME`
- `head_node_type` -> `head_node`, `worker_node_types` -> `worker_nodes`
- `min_workers`/`max_workers` -> `min_nodes`/`max_nodes`
- `use_spot: false` -> `market_type: ON_DEMAND`
- `advanced_configurations_json` -> `advanced_instance_config`
- Flattened `resources` (removed `custom_resources` nesting, capitalized
`cpu` -> `CPU`)
- GCE: replaced `region` + `allowed_azs` with `zones`
- Removed `region`, `max_workers`, and node `name` fields

### Tests updated (16)
-
`chaos_many_tasks_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}`
-
`chaos_many_actors_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}`
-
`chaos_streaming_generator_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}`
-
`chaos_object_ref_borrowing_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}`

## Test plan
- [x] Both config files validated against `ComputeConfig.from_yaml()`
- [x] Verify chaos tests pass on nightly run after merge

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [ci] Migrate microbenchmark, benchmark-worker-startup, and rllib compute configs to new schema (#62604)

## Summary
- Migrate 10 compute config files to the new Anyscale SDK schema
(`cloud_id` -> `cloud`, `head_node_type` -> `head_node`,
`worker_node_types` -> `worker_nodes`, etc.)
- Add `anyscale_sdk_2026: true` flag to 12 test cluster blocks in
`release_tests.yaml`

## Config files migrated
- `release/microbenchmark/tpl_64.yaml` (AWS, head-only)
- `release/microbenchmark/tpl_64_gce.yaml` (GCE, head-only)
- `release/microbenchmark/experimental/compute_t4_gpu.yaml` (AWS,
head-only GPU)
- `release/microbenchmark/experimental/compute_gpu_2x1_aws.yaml` (AWS,
head+worker GPU)
- `release/microbenchmark/experimental/compute_a100_gpu.yaml` (AWS,
head-only GPU)
- `release/microbenchmark/experimental/compute_l4_gpu.yaml` (AWS,
head-only GPU)
- `release/microbenchmark/experimental/compute_l4_gpu_2x1_aws.yaml`
(AWS, head+worker GPU)
- `release/benchmark-worker-startup/only_head_node_1gpu_64cpu.yaml`
(AWS, head-only GPU)
- `release/benchmark-worker-startup/only_head_node_1gpu_64cpu_gce.yaml`
(GCE, head-only)
- `release/rllib_tests/1gpu_16cpus.yaml` (AWS, head-only GPU)

## Tests updated with `anyscale_sdk_2026: true`
- `microbenchmark` (base + GCE variation)
- `compiled_graphs`
- `compiled_graphs_GPU`
- `compiled_graphs_GPU_multinode`
- `compiled_graphs_GPU_cu130`
- `compiled_graphs_GPU_multinode_cu130`
- `rdt_single_node_T4_microbenchmark`
- `rdt_single_node_A100_microbenchmark`
- `benchmark_worker_startup` (base + GCE variation)
- `rllib_learning_tests_pong_appo_torch`

## Test plan
- [x] All 10 config files validated against `ComputeConfig.from_yaml()`
- [x] CI passes with the new configs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* recompiling requirements_compiled_py313.txt

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

* [deps] updating tag on py313 deps (#63033)

updating tag on py313 deps to prevent unnecessary compilation in
premerge

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

* [core] fix the mypy type check on BaseContext.__exit__ (#62999)

## Description

Fix the type error on the `BaseContext.__exit__`. Also added the
reported use case to our mypy test case.

## Related issues
Fixes https://github.com/ray-project/ray/issues/62971

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

* [core] increase the cleanup timeout in the chaos iptable test (#62992)

## Description

Increase the waiting time for the cleanup according to the cluster logs
from the
[failure](https://buildkite.com/ray-project/release/builds/90709#019dd2b5-f78b-47eb-aa8f-331c5c68cad3):

### Timeline
- **23:14:00**: Actor workload starts with network failure injection
every 60s.
- **23:15:00**: First 5s network fault affects head + 4 workers.
- **23:16:02**: Raylet reports worker process `10563` did not register
within timeout.
- **23:16:02-23:17:00**: `ReportActor.add` retries pile up after
connection resets; progress stalls near 47%.
- **23:18:34-23:20:34**: Head state dumps show `128` total worker CPUs
and `0` available while actor work is still running.
- **23:21:34**: Head sees `128` total CPUs, `112` available. Missing
`16` CPUs are all on `10.0.45.36`.
- **23:21:43**: Worker `10.0.45.36` reports 16 `ReportActor.__init__`
workers, each holding `1 CPU`.
- **23:21:47**: Those 16 `ReportActor` workers disconnect gracefully.
- **23:21:49**: `wait_for_condition` times out before observing all CPUs
released; another network fault triggers at the same time.

So, increasing the cleanup consistency timeout should likely fix this
specific failure.


## Related issues
Fixes: https://github.com/anyscale/ray/issues/1534

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [core] move observability pubsub to ObservabilityPubSubService (#62806)

This PR is a follow-up to https://github.com/ray-project/ray/pull/62461,
which isolates 3 pubsub channels that have lower priorities and are not
for the critical control plan from the InternalPubSubGcsService to their
own io_context and the new ObservabilityPubSubService:

    pubsub_pb2.RAY_ERROR_INFO_CHANNEL
    pubsub_pb2.RAY_LOG_CHANNEL
    pubsub_pb2.RAY_NODE_RESOURCE_USAGE_CHANNEL

This will ensure that they won't block the critical control plan. The
new ObservabilityPubSubRpcClient client also allows us to move the
service out of GCS if needed in the future.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>

* [ci] Fix doc build failing on broken pytorch intersphinx inventory (#63038)

- The doc build (`make -C doc html`, which runs `sphinx-build -W
--keep-going`) is failing with `build finished with problems, 1
warning`. The single Sphinx warning is an intersphinx fetch failure:
`https://pytorch.org/docs/stable/objects.inv` 301s to
`https://docs.pytorch.org/docs/stable/objects.inv`, which currently 404s
upstream. With `-W`, that one warning fails CI.
- Repoint the `torch` intersphinx mapping in `doc/source/conf.py` to
bypass the broken `/stable/objects.inv`. The base URL stays at the
canonical `https://docs.pytorch.org/docs/stable/` so generated
cross-reference links still target /stable/, but the inventory is
fetched from a working pinned version:
`https://docs.pytorch.org/docs/2.7/objects.inv`.
- Pin matches Ray's runtime torch version (`torch==2.7.0` in
`python/requirements/ml/dl-{cpu,gpu}-requirements.txt`), so cross-refs
only resolve to symbols that actually exist in the torch users get.

## Why pin to 2.7 and not /stable/ or /main/

- `/stable/objects.inv` is the upstream-broken URL we're routing around,
so it can't be the source.
- `/main/objects.inv` works but tracks the development branch, which can
index APIs that don't exist in 2.7 — leading to cross-refs resolving to
symbols Ray users can't actually call.
- `/2.7/objects.inv` matches the runtime exactly. Tradeoff: when Ray
bumps torch, this URL needs to bump alongside the requirements pin.

Post merge run: https://buildkite.com/ray-project/postmerge/builds/17329

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

* [observability] add instance filter to gpu usage metric query (#62214)

## Description
Adds instance filter to the node gpu usage metric panel

Signed-off-by: carolynwang <carolyn@anyscale.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>

* Remove redundant flaky integration test in favor of unit tests (#63004)

## Description
PR [[Core] (Resource Isolation 12/n) Switch group killing policy to by
time killing policy](https://github.com/ray-project/ray/pull/62643),
enabled the new by-time killing policy by default opposed to the legacy
by-group killing policy. This resulted in `test_memory_pressure`
failures in post merge. We found the following in our investigation:
* The integration test tests for policy specific behaviors when the
memory pressure integration test suite should instead tests for the
memory monitoring system's general ability to reduce memory pressure.
* The failing integration test should be unit test that tests for the
killing policy's behavior directly.

In general, we prefer unit test over integration tests for memory
threshold sensitive tests as the test environment can have significant
impact on the test result, leading to flaky test behaviors.

This PR removes redundant integration tests that tests for policy
specific behaviors already covered by the policy's unit testing, and
introduces a new unit test for cases that were previously covered by the
integration test. The following are the removed integration test and
their replacements:
* `test_restartable_actor_oom_retry_off_throws_oom_error` -> redundant
to `test_restartable_actor_throws_oom_error`
* `test_memory_pressure_kill_newest_worker` -> replaced by
`TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
* `test_memory_pressure_kill_task_if_actor_submitted_task_first` ->
replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
* `test_task_oom_no_oom_retry_fails_immediately` -> replaced by
`TestTaskOomKillNoOomRetryFailsImmediately` in
https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc
* `test_task_oom_only_uses_oom_retry` -> replaced by
`TestTaskOomInfiniteRetry` in
https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc
* `test_newer_task_not_retriable_kill_older_retriable_task_first` ->
replaced by `TestPolicyPrioritizesRetriableOverNonRetriable` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
* `test_put_object_task_usage_slightly_below_limit_does_not_crash` ->
replaced by `TestMonitorDetectsMemoryBelowThresholdCallbackNotExecuted`
in
https://github.com/ray-project/ray/blob/master/src/ray/common/tests/threshold_memory_monitor_test.cc
* `test_last_task_of_the_group_fail_immediately` -> replaced by
`TestLastWorkerInGroupShouldNotRetry` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_group_by_owner_test.cc
* `test_one_actor_max_lifo_kill_next_actor` -> replaced by
`TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc

## Additional information
`test_memory_pressure` run:
https://buildkite.com/ray-project/postmerge/builds/17288

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>

* [Train] Add missing %s to logger.debug (#63039)

`logger.debug` was missing the %s and as a result clogging up the logs

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>

* Add perf metrics for 2.55.0 (#63060)

```
REGRESSION 52.54%: tasks_per_second (THROUGHPUT) regresses from 386.6133448073775 to 183.49078025658062 in benchmarks/many_nodes.json
REGRESSION 37.10%: tasks_per_second (THROUGHPUT) regresses from 594.0367087794571 to 373.6653345877981 in benchmarks/many_tasks.json
REGRESSION 4.22%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 5.723101265712336 to 5.481786077048712 in microbenchmark.json
REGRESSION 4.09%: multi_client_put_gigabytes (THROUGHPUT) regresses from 42.60577675231464 to 40.8627833341568 in microbenchmark.json
REGRESSION 1.86%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.982001139120161 to 0.9637211637507427 in microbenchmark.json
REGRESSION 0.84%: client__get_calls (THROUGHPUT) regresses from 1119.7606509262687 to 1110.3815800718512 in microbenchmark.json
REGRESSION 0.63%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 2985.2594797119345 to 2966.3149904468737 in microbenchmark.json
REGRESSION 0.48%: client__put_calls (THROUGHPUT) regresses from 851.7996054229982 to 847.7132252307356 in microbenchmark.json
REGRESSION 289.14%: dashboard_p95_latency_ms (LATENCY) regresses from 37.856 to 147.311 in benchmarks/many_pgs.json
REGRESSION 135.33%: dashboard_p99_latency_ms (LATENCY) regresses from 798.453 to 1879.035 in benchmarks/many_pgs.json
REGRESSION 110.53%: stage_4_spread (LATENCY) regresses from 0.3184540688712737 to 0.6704279092079272 in stress_tests/stress_test_many_tasks.json
REGRESSION 48.31%: avg_pg_remove_time_ms (LATENCY) regresses from 1.154493106606675 to 1.7122211741741544 in stress_tests/stress_test_placement_group.json
REGRESSION 34.75%: dashboard_p50_latency_ms (LATENCY) regresses from 5.002 to 6.74 in benchmarks/many_pgs.json
REGRESSION 21.20%: stage_0_time (LATENCY) regresses from 7.112839698791504 to 8.620674133300781 in stress_tests/stress_test_many_tasks.json
REGRESSION 19.38%: stage_3_creation_time (LATENCY) regresses from 2.621494770050049 to 3.1294972896575928 in stress_tests/stress_test_many_tasks.json
REGRESSION 8.31%: dashboard_p95_latency_ms (LATENCY) regresses from 42.959 to 46.531 in benchmarks/many_nodes.json
REGRESSION 8.03%: 107374182400_large_object_time (LATENCY) regresses from 22.459637914999973 to 24.263247010999976 in scalability/single_node.json
REGRESSION 8.00%: 10000_args_time (LATENCY) regresses from 11.357349357000004 to 12.265755501000008 in scalability/single_node.json
REGRESSION 7.69%: avg_pg_create_time_ms (LATENCY) regresses from 1.5098637252248464 to 1.6259311876874045 in stress_tests/stress_test_placement_group.json
REGRESSION 3.67%: 3000_returns_time (LATENCY) regresses from 3.577688757000004 to 3.7088375179999957 in scalability/single_node.json
```

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Co-authored-by: Lonnie Liu <lonnie@anyscale.com>

* [core] rename InternalPubSub* to ControlPlanePubSub* (#63044)

## Description
Renaming `InternalPubSub*` to `ControlPlanePubSub*` for clarity.
Following up to
https://github.com/ray-project/ray/pull/62806#pullrequestreview-4207199543

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

* [Serve][2/5] Add custom ingress request router app interfaces (#62680)

Direct ingress needs an app-scoped ingress request router deployment
that HAProxy can call to map each request to a target replica ID before
forwarding the request to the selected replica.

This change attaches that router to the Serve application object itself, so
both imperative and declarative deployment paths consume the same
composed application graph.

## API shape

Imperative usage:

```python
llm_server = LLMServer.bind(...)

ingress_request_router = IngressRequestRouter.bind(
    llm_deployment=llm_server,
)

app = llm_server._with_ingress_request_router(ingress_request_router)

serve.run(app, route_prefix="/v1")
```

Declarative usage:

```python
# my_module.py
llm_server = LLMServer.bind(...)

ingress_request_router = IngressRequestRouter.bind(
    llm_deployment=llm_server,
)

app = llm_server._with_ingress_request_router(ingress_request_router)
```

```yaml
applications:
- name: llm
  route_prefix: /v1
  import_path: my_module:app
```

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>

* [Core] Match expected resource isolation integration test constraint to new cgroup constraint (#63054)

## Description
The resource isolation python integration tests are currently failing
because the resource isolation upper bound constraint has been adjusted
from `memory.max` to `memory.high` in the latest resource isolation
changes without updating the integration test. This PR adjust the
resource isolation integration test to match the latest changes in to
use `memory.high` upper bound constraint.

The resource isolation PR that updated the memory constraint without
updating the test:
https://github.com/ray-project/ray/pull/62705/changes#diff-60b34dab728b2e51426a465dd712767a8735682e137e52ebfe030123aeeb56d5L69-R77

## Related issues
Fixes failing core: cgroup tests

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>

* [serve] Enable logs in `LongPollHost` when `LongPollClient` stops its attached event loop (#63028)

---------

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

* [Train] Reduce `test_result_restore` flakiness (#63045)

Reviewing the logs for a flaky run of `test_result_restore` then it
shows that rank 1 has a training report but rank 0 doesn't (the
RuntimeError in rank-1 runs before the checkpoint in rank-0 is saved)
and therefore when computing the `get_best_checkpoints` there is missing
checkpoints and occasionally the wrong results are returned.

We can easily resolve this through adding a sync barrier between workers
before raising the error to ensure that the checkpoints are all saved.

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>

* [Data] Fix HashAggregate duplicate group rows for AggregateFnV2 (#63066)

## Summary

`TableBlockBuilder.build()` reordered rows across an internal compaction
boundary, so `_aggregate`'s per-block partial-aggregate output could be
unsorted by the group key. That violates the "inputs are sorted by key"
precondition that `_combine_aggregated_blocks`' `heapq.merge` relies on,
and surfaced as duplicate group rows in HashAggregate output whose count
varied with the parallelism arg.

Two issues, both fixed:

1. **`TableBlockBuilder.build()`** put the still-uncompacted
dict-of-lists (newest rows) in front of the previously-compacted tables.
Now appends the uncompacted tail after the compacted tables — preserving
insertion order.
2. **`ArrowBlockBuilder._combine_tables`** called
`transform_pyarrow.concat` without `preserve_order=True`. When block
schemas didn't unify exactly (common for V2 aggregators whose
accumulator varies in shape between rows — e.g. an empty list vs. a
non-empty list, inferring `list<null>` vs `list<string>`), `concat` took
a fast path that groups schema-matching blocks together and prepends
mismatched ones. Now passes `preserve_order=True` since the builder's
contract is to preserve insertion order regardless of internal
compaction or schema unification.

## Where `_combine_tables` sits in the hash-shuffle lifecycle

```mermaid
sequenceDiagram
    autonumber
    participant ShuffleTask as _shuffle_block (Ray task)
    participant Closure as input_block_transformer<br/>(_aggregate closure)
    participant TableAcc as TableBlockAccessor._aggregate
    participant Builder as TableBlockBuilder
    participant Combine as ArrowBlockBuilder._combine_tables
    participant Aggregator as HashShuffleAggregator
    participant Reducer as ReducingAggregation

    ShuffleTask->>Closure: block_transformer(block)
    Closure->>Closure: pruned.sort(sort_key)
    Closure->>TableAcc: target._aggregate(sort_key, aggs)
    loop for each group (sorted)
        TableAcc->>Builder: builder.add(row)
        Note over Builder: _compact_if_needed may flush<br/>_columns into _tables mid-loop
    end
    TableAcc->>Builder: builder.build()
    Builder->>Combine: _combine_tables(_tables + [_columns_partial])
    Note over Combine: ★ FIXES LIVE HERE<br/>build(): append uncompacted tail (was: prepend)<br/>_combine_tables: preserve_order=True
    Combine-->>Builder: sorted partial-aggregate block
    Builder-->>TableAcc: sorted partial-aggregate block
    TableAcc-->>Closure: sorted partial-aggregate block
    Closure-->>ShuffleTask: sorted partial-aggregate block
    ShuffleTask->>ShuffleTask: hash_partition (np.where + take, preserves order)
    ShuffleTask->>Aggregator: aggregator.submit.remote(shard)
    Aggregator->>Reducer: compact / finalize (List[Block])
    Reducer->>Reducer: _combine_aggregated_blocks<br/>(heapq.merge — now sees sorted inputs)
```

The bug was in step 8: `_combine_tables` and `build()` could permute
rows across compactions, propagating unsorted blocks through steps 9–14
to the `heapq.merge` in step 15, which silently produced duplicate group
rows because its consecutive-equal-key grouping only collapses adjacent
rows.

## Test plan

- [x] New regression test
`test_partial_aggregate_preserves_sort_after_builder_compaction` in
`python/ray/data/tests/test_hash_shuffle.py` forces compaction on every
row via `MAX_UNCOMPACTED_SIZE_BYTES=1` and asserts partial-aggregate
output is sorted by the group key. Fails on master, passes after this
change.
- [x] Full `test_hash_shuffle.py` suite (19 tests) passes.
- [x] `test_hash_shuffle_aggregator.py` suite passes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Goutam <goutam@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [Data][LLM] Fix wrong documented default for max_tasks_in_flight_per_actor (#62917)

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [train] Export default data execution options (#62784)

Follow-up to #59186, which only captured `execution_options` when the
user provided them per-dataset in the form of a dict, dropping the
default or user-provided global `ExecutionOptions`. This PR captures the
default and user-provided global options alongside the per-dataset
execution options, exposed via a typed `DataExecutionOptions` model
split into `default` and `per_dataset_execution_options`.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

* [Data] Convert abstract logical operator classes to frozen dataclasses (#62593)

## Description

#### Why this is needed:

This is the next PR in the `#60312` logical plan migration stack.

After finishing the remaining concrete operator coverage, the next step
in the split plan is to convert `LogicalOperator` and the abstract
logical operator base classes into frozen dataclasses.

#### What this PR changes:

Makes the following abstract logical operator classes frozen
dataclasses:
- `LogicalOperator`
- `NAry`
- `AbstractOneToOne`
- `AbstractMap`
- `AbstractUDFMap`
- `AbstractAllToAll`

This PR also makes `_name`, `_input_dependencies`, and `_num_outputs`
proper dataclass fields on `LogicalOperator`, removing the manual
`LogicalOperator.__init__`.

Extending the same step to additional abstract-base state runs into
concrete dataclass constructor-generation errors (for example,
`TypeError: non-default argument 'input_op' follows default argument`),
so the broader field-model cleanup remains in the later follow-up PRs.

This PR does not include the later `_name` derived-field work,
`_apply_transform` deduplication, `input_op: InitVar` replacement, or
broader logical-rule cleanup follow-ups.

## Related issues

Closes #60312

## Additional information

This PR corresponds to the current split-plan step for making
`LogicalOperator` and the abstract logical operator base classes frozen
dataclasses.

### Tests

- `python -m pre_commit run --files
python/ray/data/_internal/logical/interfaces/logical_operator.py
python/ray/data/_internal/logical/operators/n_ary_operator.py
python/ray/data/_internal/logical/operators/one_to_one_operator.py
python/ray/data/_internal/logical/operators/map_operator.py
python/ray/data/_internal/logical/operators/all_to_all_operator.py`
- `PYTHONPATH=python python -m pytest -q
python/ray/data/tests/test_execution_optimizer_basic.py -k 'map or
repartition or sort or union or zip'`
- `PYTHONPATH=python python -m pytest -q
python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py
-k 'union or split'`
- `PYTHONPATH=python python -m pytest -q
python/ray/data/tests/test_join.py -k 'inner or outer or semi or anti'`

### Stack Plan

Done:
- PR-A: Add a default property implementation for `LogicalOperator.name`
- PR-B: Move logical `output_dependencies` handling out of logical
operators
- PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs`
- PR-D1: Convert one-to-one logical operators to frozen dataclasses
- PR-D2: Convert map logical operators to frozen dataclasses
- PR-D3: Convert all-to-all, join, read, and write logical operators to
frozen dataclasses
- PR-D4: Convert remaining source logical operators to frozen
dataclasses
- PR-Next-0: Convert remaining concrete logical operators to frozen
dataclasses
- This PR: make `LogicalOperator` and the abstract logical operator base
classes frozen dataclasses

Next:
- make `_name` a derived field
- deduplicate `_apply_transform`
- replace `input_op: InitVar` with a real `input_dependencies` field
- remove `input_dependency` on `AbstractOneToOne`
- clean up `_get_args`
- remove redundant `__repr__` / `__str__`
- clean up special-casing in logical rules
- finalize equality / comparability work for `#60312`

---------

Signed-off-by: yaommen <myanstu@163.com>

* [ci] convert core.rayci.yml test steps to array and narrow subsets (#62799)

Convert the two remaining matrix test steps in core.rayci.yml — "core:
python {{matrix.python}} tests" (matrix setup with python + worker_id)
and "core: minimal tests" — to array syntax; their corebuild-multipy and
minbuild-core depends_on refine from (*) to ($). Narrow three (*)
fan-ins in core.rayci.yml down to (python=3.10) subsets for the wheel
tests, HA integration, and runtime env container steps that only
exercise python 3.10. Across cicd, data, dependencies, doc, kuberay,
llm, ml, others, rllib, and serve, narrow each oss-ci-base_* (*)
dependency to (python=X.Y) where the consuming step pins a single python
version; leave (*) in place where the step truly spans multiple versions
(data top-level ml base, ml mlbuild-multipy / mlgpubuild-multipy, serve
top-level build base).

Signed-off-by: andrew <andrew@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

* [Data] Make logical operator names derived by default (#63084)

## Description

#### Why this is needed:

This is the next PR in the `#60312` logical plan migration stack.

After moving the shared logical-operator backing fields to the
abstract-class layer, `_name` is still wired manually in many concrete
operators. Most of those assignments are just the operator class name,
so the next step is to make that default behavior come from the base
logical-operator layer.

#### What this PR changes:

Makes logical-operator names derived by default from the base
logical-operator layer. For operators without a special naming rule,
`name` now defaults to `self.__class__.__name__`.

This PR removes concrete `_name` wiring where the assigned value was
only the class name, while preserving the special naming cases that
still need explicit values, such as `Read`, `Limit`, `RandomShuffle`,
`RandomizeBlocks`, and UDF-based map operators.

This PR does not include the later `_apply_transform` deduplication,
`input_op: InitVar` replacement, `_get_args` cleanup, or broader
logical-rule cleanup follow-ups.

## Related issues

Part of #60312

## Additional information

This PR corresponds to the `_name` derived-field step in the current
split plan.

### Tests

- `python -m pre_commit run --files
python/ray/data/_internal/logical/interfaces/logical_operator.py
python/ray/data/_internal/logical/operators/one_to_one_operator.py
python/ray/data/_internal/logical/operators/n_ary_operator.py
python/ray/data/_internal/logical/operators/all_to_all_operator.py
python/ray/data/_internal/logical/operators/count_operator.py
python/ray/data/_internal/logical/operators/input_data_operator.py
python/ray/data/_internal/logical/operators/from_operators.py
python/ray/data/_internal/logical/operators/streaming_split_operator.py
python/ray/data/_internal/logical/operators/join_operator.py
python/ray/data/_internal/logical/operators/write_operator.py
python/ray/data/_internal/logical/operators/map_operator.py`
- `PYTHONPATH=python python -m pytest -q
python/ray/data/tests/test_state_export.py
python/ray/data/tests/unit/test_logical_plan.py
python/ray/data/tests/test_execution_optimizer_basic.py -k 'Project or
Count or InputData or Union or Zip or split or join or write or read or
map'`
- `PYTHONPATH=python python -m pytest -q
python/ray/data/tests/test_execution_optimizer_advanced.py
python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py
-k 'zip or union or split or project or read or write or join'`

### Stack Plan

Done:
- PR-A: Add a default property implementation for `LogicalOperator.name`
- PR-B: Move logical `output_dependencies` handling out of logical
operators
- PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs`
- PR-D1: Convert one-to-one logical operators to frozen dataclasses
- PR-D2: Convert map logical operators to frozen dataclasses
- PR-D3: Convert all-to-all, join, read, and write logical operators to
frozen dataclasses
- PR-D4: Convert remaining source logical operators to frozen
dataclasses
- PR-Next-0: Convert the remaining concrete logical operators to frozen
dataclasses
- PR-Next-1: Convert abstract logical operator classes to frozen
dataclasses
- This PR: make logical-operator names derived by default

Next:
- deduplicate `_apply_transform`
- replace `input_op: InitVar` with a real `input_dependencies` field
- remove `input_dependency` on `AbstractOneToOne`
- clean up `_get_args`
- remove redundant `__repr__` / `__str__`
- clean up special-casing in logical rules
- finalize equality / comparability work for `#60312`

Signed-off-by: yaommen <myanstu@163.com>

* [Data] Deduplicate logical operator apply transform (#63089)

## Description

#### Why this is needed:

This is the next PR in the `#60312` logical plan migration stack.

After making logical operators frozen dataclasses and moving logical
operator names to the base layer, most concrete operators still carry
near-identical `_apply_transform` implementations. Each implementation
recursively transforms its input operator, keeps `self` when the input
is unchanged, and rebuilds the operator when the input changes.

#### What this PR changes:

Adds a frozen-safe default `_apply_transform` implementation to
`LogicalOperator` and moves operator-specific rebuild details into small
`_with_new_input` / `_with_new_input_dependencies` hooks.

For single-input operators, concrete dataclass operators with `input_op`
still use `dataclasses.replace(self, input_op=...)`, while generic
custom subclasses keep the previous shallow-copy child rewiring
behavior. `RandomShuffle` and `Repartition` keep small hooks because
they still have InitVar-only constructor values that must be passed
during replacement. `NAry` owns the common n-ary rebuild path for `Zip`
and `Union`, and `Join` keeps the multi-input rebuild hook for its left
and right inputs.

This PR does not replace `input_op: InitVar` with a real
`input_dependencies` field, remove `input_dependency` from
`AbstractOneToOne`, clean up `_get_args`, or clean up logical-rule
special-casing. Those remain separate follow-ups in the current split
plan.

## Related issues

Part of #60312

## Additional information

This PR corresponds to the `_apply_transform` deduplication step in the
current split plan. It reopens the same change from #63086 against
`master`.

### Tests

- `python -m pre_commit run --files
python/ray/data/_internal/logical/interfaces/logical_operator.py
python/ray/data/_internal/logical/operators/all_to_all_operator.py
python/ray/data/_internal/logical/operators/count_operator.py
python/ray/data/_internal/logical/operators/join_operator.py
python/ray/data/_internal/logical/operators/map_operator.py
python/ray/data/_internal/logical/operators/n_ary_operator.py
python/ray/data/_internal/logical/operators/one_to_one_operator.py
python/ray/data/_internal/logical/operators/streaming_split_operator.py
python/ray/data/_internal/logical/operators/write_operator.py
python/ray/data/tests/unit/test_logical_plan.py`
- `PYTHONPATH=python python -m pytest -q
python/ray/data/tests/unit/test_logical_plan.py
python/ray/data/tests/test_execution_optimizer_limit_pushdown.py::test_limit_pushdown_recreates_frozen_download`
- In #63086: `PYTHONPATH=python python -m pytest -q
python/ray/data/tests/test_execution_optimizer_limit_pushdown.py
python/ray/data/tests/test_execution_optimizer_advanced.py
python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py
-k 'limit_pushdown_recreates_frozen_download or zip_e2e or union or
split or project or join'`

### Stack Plan

Done:
- PR-A: Add a default property implementation for `LogicalOperator.name`
- PR-B: Move logical `output_dependencies` handling out of logical
operators
- PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs`
- PR-D1: Convert one-to-one logical operators to frozen dataclasses
- PR-D2: Convert map logical operators to frozen dataclasses
- PR-D3: Convert all-to-all, join, read, and write logical operators to
frozen dataclasses
- PR-D4: Convert remaining source logical operators to frozen
dataclasses
- PR-Next-0: Convert the remaining concrete logical operators to frozen
dataclasses
- PR-Next-1: Convert abstract logical operator classes to frozen
dataclasses
- PR-Next-2: make logical-operator names derived by default
- This PR: deduplicate `_apply_transform`

Next:
- replace `input_op: InitVar` with a real `input_dependencies` field
- remove `input_dependency` on `AbstractOneToOne`
- clean up `_get_args`
- remove redundant `__repr__` / `__str__`
- clean up special-casing in logical rules
- finalize equality / comparability work for `#60312`

---------

Signed-off-by: yaommen <myanstu@163.com>

* [RLlib] Fix ValueError in MultiAgentEpisode.get_rewards() when agent inactive for all requested env steps (#62907)

## Summary

Fixes #62903 - `MultiAgentEpisode.get_rewards()` (and other `get_*`
methods) no longer crashes with `ValueError` when called on a finalized
multi-agent episode where an agent was inactive during the requested env
steps.

When retrieving per-agent data by env step indices,
`_get_single_agent_data_by_env_step_indices` filters out
`SKIP_ENV_TS_TAG` entries for agents that didn't participate in certain
env steps. If an agent was inactive for **all** requested env steps, the
filtered indices list became empty, causing
`InfiniteLookbackBuffer.get(indices=[])` → `batch([])` → `ValueError:
Input list_of_structs does not contain any items`.

This PR adds an early return of an empty list when all indices are
filtered out, allowing the caller's existing `if len(agent_values) > 0`
guard to correctly exclude the inactive agent from the result dict.

<details>
<summary>Before</summary>

```
episode.get_rewards(indices=slice(1, 3))

ValueError: Input `list_of_structs` does not contain any items.

  File "ray/rllib/env/multi_agent_episode.py", line 2554, in _get_data_by_env_steps
    agent_values = self._get_single_agent_data_by_env_step_indices(
  File "ray/rllib/env/multi_agent_episode.py", line 2753, in _get_single_agent_data_by_env_step_indices
    ret = inf_lookback_buffer.get(
  File "ray/rllib/env/utils/infinite_lookback_buffer.py", line 243, in get
    data = batch(data)
  File "ray/rllib/utils/spaces/space_utils.py", line 315, in batch
    raise ValueError("Input `list_of_structs` does not contain any items.")
```

</details>

<details>
<summary>After</summary>

```python
episode.get_rewards(indices=slice(1, 3))
# {'a0': array([0.2, 0.3])}
# a1 correctly excluded — it was inactive during env steps 1 and 2
```

</details>

<details>
<summary>Test results</summary>

```
$ python -m pytest test_multi_agent_episode.py -v -x

test_multi_agent_episode.py::TestMultiAgentEpisode::test_add_env_reset PASSED [  5%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_add_env_step PASSED [ 11%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_cut PASSED [ 17%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_actions PASSED [ 23%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_infos PASSED [ 29%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_observations PASSED [ 35%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_return PASSED [ 41%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_rewards PASSED [ 47%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_sample_batch PASSED [ 52%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_state_and_from_state PASSED [ 58%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_init PASSED [ 64%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_len PASSED [ 70%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_other_getters PASSED [ 76%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_setters PASSED [ 82%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_slice PASSED [ 88%]
test_multi_agent_episode.py::TestMultiAgentEpisode::test_slice_with_lookback PASSED [ 94%]
test_multi_agent_episode.py::test_multi_agent_episode_functionality PASSED [100%]

======================== 17 passed, 3 warnings in 5.03s ========================
```

</details>

## Test plan
- [x] Reproduced the bug: `get_rewards(indices=slice(1,3))` raises
`ValueError` on a finalized episode with an inactive agent
- [x] Verified the fix: same call now returns `{'a0': array([0.2,
0.3])}` with the inactive agent correctly excluded
- [x] Verified `get_rewards()` without indices still works as before
- [x] Added regression test to `test_get_rewards` in
`test_multi_agent_episode.py` covering:
- Finalized (numpy) episode with `get_rewards(indices=slice(1,3))` — the
exact crash scenario
- `get_actions(indices=slice(1,3))` — proves the fix covers all `get_*`
methods (shared code path)
- Non-finalized episode with same scenario — proves
finalized/non-finalized behavior is consistent
- [x] All 17 tests in `test_multi_agent_episode.py` pass (regression
test is inside existing `test_get_rewards`)

---------

Signed-off-by: Cursx <33718736+Cursx@users.noreply.github.com>

* Add shared Claude Code configuration for Ray development (#62554)

## Description:

Sets up hierarchical Claude Code instructions for the Ray repo so that
each team (Data, Serve, Train, Tune, RLlib, C++ Core) can maintain their
own scoped rules and skills.

## Primary changes:

- Root `.claude/CLAUDE.md` with shared instructions, per-library
templates teams can fill in
- Path-scoped `.claude/rules/ `for Python guidelines, C++ style,
security, debugging
- Shared skills: `/rebuild`, `/lint`, `/fetch-buildkite-logs` for common
workflows
- `.claude/settings.json` with common permissions
- Developer docs at `doc/source/ray-contribute/agent-development.rst`
covering personal setup, worktree support, and how to add team-specific
rules/skills
- `.gitignore` updated to version-control shared config while keeping
personal files local

reference: https://code.claude.com/docs/en/best-practices

## Future work: 
Support other coding agents like codex, the instructions can be written
in common markdown files and imported inside coding agent specific
instruction files.

we can also integrate with anyscale managed skills to help debug release
tests running on anyscale workspaces.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* refactor(raylet): split task option helpers into task_options_utils.pxi

Signed-off-by: hieuvous <trunghieu101112a1@gmail.com>

* refactor(raylet): move resource task option helpers

Signed-off-by: chichic21039 <Linhpham.1508055@gmail.com>

* Refactor/raylet task options utils resources fallback (#7)

* Move function and actor helpers to task options utils

* Update task options utils and raylet

* Resolve add/add merge conflict in task_options_utils.pxi

* Resolve add/add merge conflict in task_options_utils.pxi

* refactor(raylet): extract resources and fallback helpers to task_options_utils.pxi

Signed-off-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn>

---------

Signed-off-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn>
Signed-off-by: Duy Hưng <163378800+RitaKaniska@users.noreply.github.com>
Co-authored-by: hieuvous <trunghieu101112a1@gmail.com>
Co-authored-by: hieuvous <164620181+hieuvous@users.noreply.github.com>
Co-authored-by: HLDKNotFound <huynhleduykhanh25022005@gmail.com>
Co-authored-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn>

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Chaitanya Bharadwaj <venkatachaitanyametta@gmail.com>
Signed-off-by: Chaitanya Bharadwaj <74806126+mvcb@users.noreply.github.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: slxswaa1993 <470093691@qq.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: carolynwang <carolyn@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: yaommen <myanstu@163.com>
Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: Cursx <33718736+Cursx@users.noreply.github.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Signed-off-by: hieuvous <trunghieu101112a1@gmail.com>
Signed-off-by: chichic21039 <Linhpham.1508055@gmail.com>
Signed-off-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn>
Signed-off-by: Duy Hưng <163378800+RitaKaniska@users.noreply.github.com>
Co-authored-by: Sai Miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Co-authored-by: Chaitanya Bharadwaj <74806126+mvcb@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: harshit-anyscale <harshit@anyscale.com>
Co-authored-by: zhilong <121425509+Bye-legumes@users.noreply.github.com>
Co-authored-by: slxswaa <slxswaa1993@hotmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: Elliot Barnwell <elliot.barnwell@anyscale.com>
Co-authored-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Carolyn Wang <carolyn@anyscale.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Co-authored-by: davik <davik@anyscale.com>
Co-authored-by: Mark Towers <mark.m.towers@gmail.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: Lonnie Liu <lonnie@anyscale.com>
Co-authored-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: yaommen <myanstu@163.com>
Co-authored-by: Andrew Pollack-Gray <andrew@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Co-authored-by: Cursx <33718736+Cursx@users.noreply.github.com>
Co-authored-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: hieuvous <trunghieu101112a1@gmail.com>
Co-authored-by: chichic21039 <Linhpham.1508055@gmail.com>
Co-authored-by: hieuvous <164620181+hieuvous@users.noreply.github.com>
Co-authored-by: HLDKNotFound <huynhleduykhanh25022005@gmail.com>
Co-authored-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn>
chillCode404 pushed a commit to chillCode404/ray-contrib that referenced this pull request May 9, 2026
…roject#63004)

## Description
PR [[Core] (Resource Isolation 12/n) Switch group killing policy to by
time killing policy](ray-project#62643),
enabled the new by-time killing policy by default opposed to the legacy
by-group killing policy. This resulted in `test_memory_pressure`
failures in post merge. We found the following in our investigation:
* The integration test tests for policy specific behaviors when the
memory pressure integration test suite should instead tests for the
memory monitoring system's general ability to reduce memory pressure.
* The failing integration test should be unit test that tests for the
killing policy's behavior directly.

In general, we prefer unit test over integration tests for memory
threshold sensitive tests as the test environment can have significant
impact on the test result, leading to flaky test behaviors.

This PR removes redundant integration tests that tests for policy
specific behaviors already covered by the policy's unit testing, and
introduces a new unit test for cases that were previously covered by the
integration test. The following are the removed integration test and
their replacements:
* `test_restartable_actor_oom_retry_off_throws_oom_error` -> redundant
to `test_restartable_actor_throws_oom_error`
* `test_memory_pressure_kill_newest_worker` -> replaced by
`TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
* `test_memory_pressure_kill_task_if_actor_submitted_task_first` ->
replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
* `test_task_oom_no_oom_retry_fails_immediately` -> replaced by
`TestTaskOomKillNoOomRetryFailsImmediately` in
https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc
* `test_task_oom_only_uses_oom_retry` -> replaced by
`TestTaskOomInfiniteRetry` in
https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc
* `test_newer_task_not_retriable_kill_older_retriable_task_first` ->
replaced by `TestPolicyPrioritizesRetriableOverNonRetriable` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
* `test_put_object_task_usage_slightly_below_limit_does_not_crash` ->
replaced by `TestMonitorDetectsMemoryBelowThresholdCallbackNotExecuted`
in
https://github.com/ray-project/ray/blob/master/src/ray/common/tests/threshold_memory_monitor_test.cc
* `test_last_task_of_the_group_fail_immediately` -> replaced by
`TestLastWorkerInGroupShouldNotRetry` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_group_by_owner_test.cc
* `test_one_actor_max_lifo_kill_next_actor` -> replaced by
`TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in
https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc

## Additional information
`test_memory_pressure` run:
https://buildkite.com/ray-project/postmerge/builds/17288

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants