Remove redundant flaky integration test in favor of unit tests by Kunchd · Pull Request #63004 · ray-project/ray

Kunchd · 2026-04-28T21:43:25Z

Description

PR [Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy, enabled the new by-time killing policy by default opposed to the legacy by-group killing policy. This resulted in test_memory_pressure failures in post merge. We found the following in our investigation:

The integration test tests for policy specific behaviors when the memory pressure integration test suite should instead tests for the memory monitoring system's general ability to reduce memory pressure.
The failing integration test should be unit test that tests for the killing policy's behavior directly.

In general, we prefer unit test over integration tests for memory threshold sensitive tests as the test environment can have significant impact on the test result, leading to flaky test behaviors.

This PR removes redundant integration tests that tests for policy specific behaviors already covered by the policy's unit testing, and introduces a new unit test for cases that were previously covered by the integration test. The following are the removed integration test and their replacements:

test_restartable_actor_oom_retry_off_throws_oom_error -> redundant to test_restartable_actor_throws_oom_error
test_memory_pressure_kill_newest_worker -> replaced by TestPolicyPrioritizesNewerWorkersWithinSameRetriability in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
test_memory_pressure_kill_task_if_actor_submitted_task_first -> replaced by TestPolicyPrioritizesNewerWorkersWithinSameRetriability in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
test_task_oom_no_oom_retry_fails_immediately -> replaced by TestTaskOomKillNoOomRetryFailsImmediately in https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc
test_task_oom_only_uses_oom_retry -> replaced by TestTaskOomInfiniteRetry in https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc
test_newer_task_not_retriable_kill_older_retriable_task_first -> replaced by TestPolicyPrioritizesRetriableOverNonRetriable in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc
test_put_object_task_usage_slightly_below_limit_does_not_crash -> replaced by TestMonitorDetectsMemoryBelowThresholdCallbackNotExecuted in https://github.com/ray-project/ray/blob/master/src/ray/common/tests/threshold_memory_monitor_test.cc
test_last_task_of_the_group_fail_immediately -> replaced by TestLastWorkerInGroupShouldNotRetry in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_group_by_owner_test.cc
test_one_actor_max_lifo_kill_next_actor -> replaced by TestPolicyPrioritizesNewerWorkersWithinSameRetriability in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc

Related issues

Additional information

test_memory_pressure run: https://buildkite.com/ray-project/postmerge/builds/17288

Signed-off-by: davik <davik@anyscale.com>

gemini-code-assist

Code Review

This pull request removes a significant number of Python-based memory pressure tests and associated fixtures from test_memory_pressure.py. Concurrently, it introduces a new C++ unit test in task_manager_test.cc to validate the task manager's behavior regarding finite OOM retries, ensuring it correctly handles the retry counter and eventually reports an OOM error when the limit is reached. I have no feedback to provide.

* [ci] Migrate LLM auto-select and multi-node compute configs to new schema (#62873) Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * [serve] Deflake test_haproxy_metrics against HAProxy soft-reload (#62930) test_haproxy_metrics asserts `haproxy_backend_http_responses_total{proxy="http-default",code="2xx"} 1` after one request. The counter is racy: - HAProxy backend health checks can increment it above 1, and - a HAProxyManager soft-reload (which fires on every backend config change) can zero it in the new worker. Also, CI failures are unreadable today because pytest truncates the metrics body in `assert x in y` to "...Har...". Fix: poll with wait_for_condition, send a request each iteration, accept counter >= 1. Also dump full /metrics on timeout so the next failure is debuggable. Passes 5/5 locally --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> * [data][1/n] DataSourceV2: refactor V2 listing/scanner/reader infrastructure (#62975) Internal refactor of V2 listing/scanner/reader infrastructure to prep for the upcoming ListFiles/ReadFiles op split. No public API change. - Listing: partition-column helpers on FileManifest, sample_files + _build_pruners helpers in listing_utils. - FileReader.read(manifest): cached_property file_dataset_schema, _broadcast_partition_value helper, derived_items synthesis loop, early-return on empty manifest. Caller-supplied schema overrides pyarrow's per-fragment inference for the all-null first-file case. - FileScanner: drop bucketing helper plan() (moved upstream to plan_list_files_op in PR-A2), add prune_manifest hook, keep compute_local_scheduling (used by V1 dispatch until PR-D). - ArrowFileScanner / ParquetFileReader / Scanner: simplifications aligned with the new manifest-driven read path. - arrow_block.py + dataset.py: Schema.names hides _bsp_stub stub column produced when the scanner emits zero-column batches. This is breaking up PR: #62880 Co-authored-by: Goutam V. <> * [Docs] Replace deprecated busyboxplus curl image in Kubernetes examples (fixes #61538) (#63019) ## Summary Fixes broken Kubernetes example in RayService quickstart docs. The image `radial/busyboxplus:curl` is no longer usable due to deprecated Docker manifest format, causing ImagePullBackOff errors. ## Changes - Replaced `radial/busyboxplus:curl` with `curlimages/curl:latest` ## Testing - Verified the new image works with `kubectl run` - Confirmed curl commands execute successfully inside the pod ## Issue Closes #61538 --------- Signed-off-by: Chaitanya Bharadwaj <venkatachaitanyametta@gmail.com> Signed-off-by: Chaitanya Bharadwaj <74806126+mvcb@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [serve] Evict per-deployment LongPollHost state on deployment delete (#62820) ## Problem `LongPollHost` had no eviction API. The deletion path in `DeploymentStateManager.update()` cleaned scheduler, autoscaler, and `_deployment_states` but never told the long-poll host. Three per-deployment keys — `(DEPLOYMENT_TARGETS, id)`, the Java-compat `(DEPLOYMENT_TARGETS, name)`, and `(DEPLOYMENT_CONFIG, id)` — survived for the life of the controller, bounded by unique `(name, app_name)` pairs. It also meant **handle routers** (the routers embedded in `serve.get_deployment_handle(...)`, in replicas or user driver code) never received `is_available=False` on delete. `is_available` is derived from `not _terminally_failed()` at `deployment_state.py:3198-3216`, not from "deleting"; healthy deletes emit `is_available=True`, and `broadcast_running_replicas_if_changed` can even early-return and emit nothing at all. Requests through the handle then queue or hang instead of failing fast with `DeploymentUnavailableError`. HTTP/gRPC proxies are unaffected — they subscribe to `ROUTE_TABLE`, which `EndpointState.delete_endpoint()` handles correctly. ## Fix - **`LongPollHost.remove_keys(keys)`** — pops the four per-key maps, decrements the pending-clients gauge by the number of woken waiters, fires each waiter's event. - **`listen_for_change` hardening** — done branch skips evicted keys (was `KeyError`); `not_done` cleanup uses `.get()` instead of indexing to avoid resurrecting `defaultdict` entries; empty sets are popped. - **Delete path** — tombstones `DEPLOYMENT_TARGETS` via `notify_changed` and evicts only `DEPLOYMENT_CONFIG`. The tombstoned key is intentionally *not* evicted in the same sync tick: parked waiters run only after `update()` returns, by which point the done-branch guard would drop the tombstone. - **Batched gauge writes** (per Gemini review) — collect affected namespace tags, flush one `pending_clients_gauge.set(...)` per unique tag after each loop. After this, handle routers flip to `is_available=False` within ms of delete and raise `DeploymentUnavailableError` immediately, rather than relying on side channels (handle lifetime, driver GC, caller timeouts) to eventually notice. --------- Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Core] remove pydantic v1 support (#62716) ## Description Drop Pydantic v1 support in Ray and require Pydantic v2 for Ray extras that depend on it. Removing Pydantic v1 support instead of keeping an additional compatibility fix for Python 3.14. This makes the dependency behavior clearer and lets us delete v1-specific compatibility code. ## Related issues https://github.com/ray-project/ray/issues/62664 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> * [Data] Reduce verbosity of arrow conversion warning logs (#61486) ## Description When Arrow conversion fails and Ray Data falls back to pickle serialization, the warning log includes the full exception traceback (`exc_info=ace`), which can be extremely noisy — especially for nested datatypes like image arrays where the data representation alone spans many lines. This PR moves the detailed error message and traceback to `DEBUG` level, keeping the `WARNING` concise and actionable: **Before:** ``` WARNING arrow.py:290 -- Failed to convert column 'flat_images' into pyarrow array due to: Error converting data to Arrow: [[array([[[130, 118, 255], [132, 117, 255], ...]]]...; falling back to serialize as pickled python objects Traceback (most recent call last): File ".../arrow.py", line 258, in _convert_to_pyarrow_native_array ... (10+ lines of traceback) ``` **After:** ``` WARNING arrow.py:290 -- Failed to convert column 'flat_images' into pyarrow array; falling back to serialize as pickled python objects. To see the full error, set logging level to DEBUG. ``` ## Related issues Fixes #57840 ## Additional information - The full error details + traceback are still available at `DEBUG` level for anyone who needs to investigate - All existing unit tests pass (`test_transform_pyarrow.py`, `test_arrow_type_conversion.py`) - The `ArrowConversionError` already truncates data to 200 chars, but even that plus the traceback was excessively verbose for a warning --------- Signed-off-by: slxswaa1993 <470093691@qq.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [serve] Increase controller benchmark frequency (#63029) ## Description We need denser benchmark results to identify regressions. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> * [ci][deps][1/3] PY313 DEP UNIFICATION: compiling requirements_compiled_py3.13.txt and depsets (#62864) Refreshes `requirements_compiled_py3.13.txt` and the full set of raydepsets locks against current source pins, and adds the supporting CI plumbing and source-file changes needed to make the py3.13 lock resolvable as a constraint across all py3.10/3.11/3.12/3.13 depsets. ## CI infrastructure - **`.buildkite/dependencies.rayci.yml`** — new `pip_compile_313_dependencies` Buildkite step (mirror of the existing 3.11 compile job). Runs `compile_313_pip_dependencies`, uploads the artifact, and fails the build if `requirements_compiled_py3.13.txt` drifts from source. - **`ci/ci.sh`** — new `compile_313_pip_dependencies()` function that points pip-compile at the `python/requirements/py313/` and `python/requirements/ml/py313/` overrides and emits `requirements_compiled_py3.13.txt`. ## Source-file pins These drive the lock changes — no manual edits to the generated lock files. ### `python/requirements/py313/test-requirements.txt` - `fastapi==0.121.0` — FastAPI 0.125+ removed `pydantic.v1` route support; `test_pydantic_serialization` still uses v1 BaseModel. - `asgiref==3.9.2` — 3.10+ regresses Serve direct-ingress timeout / disconnect handling. - `redis==4.5.4` — TLS test compatibility. - `opentelemetry-proto==1.39.0` and `opentelemetry-exporter-otlp-proto-grpc==1.39.0` — co-pinned with `opentelemetry-sdk` so vllm (rayllm depset) can satisfy the in-family pins. - `grpcio==1.76.0` + matching `grpcio-tools` / `grpcio-status` — bisecting `test_raylet_and_agent_share_fate` against grpcio 1.80 startup cost on the runtime-env agent. - `jsonschema>=4.23.0,<4.25.0` — 4.25 introduced `rfc3987-syntax` which pins `lark==1.3.1`, conflicting with vllm's `lark==1.2.2`. - Dual `python_version`-marker pins for `protobuf`, `scipy`, `contourpy`, `networkx` — these packages dropped py3.10 wheels at the same time the py3.13 lock needed newer floors. Dual pinning preserves the cross-py-version compat path when the py3.13 lock is consumed as a constraint by py3.10 depsets. ### `python/requirements/ml/py313/` - `data-requirements.txt` — `lance-namespace==0.6.1`. - `dl-cpu-requirements.txt` / `dl-gpu-requirements.txt` — `nvidia-nccl-cu12` aligned across CPU/GPU so the CPU-built lock doesn't pin a version that conflicts with cu128 torch in GPU depsets. - `ml-requirements.txt` — dual `keras` pin (3.12.1 for py<3.11, 3.14.0 for py>=3.11); keras 3.13 dropped py3.10. - `rllib-requirements.txt` — dual `onnxruntime` pin (1.20.0 / 1.24.4) keyed on python version. - `train-requirements.txt` — `datasets==3.6.0`. ### `python/requirements/data/` - `pyarrow-latest.txt` — added `delta-sharing`. - `pyarrow-v9.txt` — pinned `datasets==2.14.4`, added `delta-sharing`. ## Depsets config **`ci/raydepsets/configs/ci_data.depsets.yaml`** — added relax entries so v9 / tfxbsl resolves can downgrade chains together: - `relaxed_data`: relaxed `delta-sharing`, `dill`, `multiprocess` (datasets 2.14.4 caps `dill<0.3.8` but py313 lock has `dill==0.4.1`). - `relaxed_data_tfxbsl`: relaxed `absl-py`, `grpcio-status`, `contourpy`, `scipy`, `delta-sharing` (tfx-bsl 1.16.x caps `absl-py<2.0.0` and `protobuf<6`; contourpy 1.3.3 + apache-beam 2.53.0 numpy clash). ## Lock files Regenerated `requirements_compiled_py3.13.txt` and ~70 depset locks under `python/deplocks/` (base / ci / llm / ray_img / docs). --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [ci] Fix mismatch between bisect instance-type and runner-queue name (#62742) ## Description A mismatch in the `instance_type` and `runner_queues` fields of bisect pipeline rayci configs causes all `bisect` pipeline builds to fail. ## Related issues None ## Additional information https://buildkite.com/ray-project/bisect/builds/3673/steps/canvas?sid=019d9d9d-05de-4326-b5dc-d818fbcdc71f&tab=output Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> * [ci] Migrate dataset GPU core compute configs to new schema (#62832) ## Summary Migrates 2 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to all corresponding test entries in `release_data_tests.yaml`. ### Compute configs migrated (2 files) **Dataset tests** (`release/nightly_tests/dataset/`): - `fixed_size_gpu_compute.yaml` - `autoscaling_gpu_compute.yaml` ### Tests updated in release_data_tests.yaml (3 tests) Via `{{scaling}}_gpu_compute.yaml` template: 1. `image_classification_{{scaling}}` 2. `image_classification_from_parquet_{{scaling}}` Hardcoded `dataset/autoscaling_gpu_compute.yaml` (chaos test overrides `working_dir: nightly_tests`): 3. `image_classification_chaos` ### Schema changes applied - `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME` - Removed `region: us-west-2` - `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes` - `min_workers` → `min_nodes`, `max_workers` → `max_nodes` - `use_spot: false` → `market_type: ON_DEMAND` - `advanced_configurations_json` → `advanced_instance_config` - Dropped head/worker `name:` fields (single worker group per config) - Dropped head-node `resources: {cpu: 0}` — new SDK defaults head CPU to 0 when `worker_nodes` is present (head is CPU-only coordinator; GPU workloads run on `g4dn.2xlarge` workers) ## Test plan - [x] Both config files validated against `ComputeConfig.from_yaml()` - [x] CI passes with `anyscale_sdk_2026: true` flag on all 3 test entries: https://buildkite.com/ray-project/release/builds/89918 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * [ci] Migrate scheduling and single node benchmark compute configs to new schema (#62489) ## Summary - Migrated 4 compute config files to the new Anyscale SDK schema: `scheduling.yaml`, `scheduling_gce.yaml`, `single_node.yaml`, `single_node_gce.yaml` - Updated 2 test entries (`single_node`, `scheduling_test_many_0s_tasks_many_nodes`) in `release_tests.yaml` with `anyscale_sdk_2026: true` flag - Key transformations: `cloud_id` -> `cloud`, `head_node_type` -> `head_node`, `worker_node_types` -> `worker_nodes`, flattened `custom_resources`, renamed `advanced_configurations_json`/`gcp_advanced_configurations_json` -> `advanced_instance_config`, `use_spot: false` -> `market_type: ON_DEMAND`, `min/max_workers` -> `min/max_nodes` ## Test plan - [x] All 4 configs validated against `ComputeConfig.from_yaml()` - [x] Verify `single_node` nightly tests pass on Buildkite - [x] Verify `scheduling_test_many_0s_tasks_many_nodes` nightly tests pass on Buildkite 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [Serve][LLM] Add rate-limiter logic for per request traceback spam (#62440) Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [ci] Migrate dask-on-ray and shuffle compute configs to new schema (#62605) ## Summary - Migrate 9 compute config files (dask-on-ray and shuffle) from legacy Anyscale schema to new SDK schema - Add `anyscale_sdk_2026: true` to 5 test entries in `release_tests.yaml` ## Config files migrated - `release/nightly_tests/dask_on_ray/dask_on_ray_sort_compute_template.yaml` (AWS, head-only) - `release/nightly_tests/dask_on_ray/dask_on_ray_sort_compute_template_gce.yaml` (GCE, head-only) - `release/nightly_tests/dask_on_ray/1tb_sort_compute.yaml` (AWS, head + 32 workers) - `release/nightly_tests/shuffle/shuffle_compute_multi.yaml` (AWS, head + 3 workers) - `release/nightly_tests/shuffle/shuffle_compute_multi_gce.yaml` (GCE, head + 3 workers) - `release/nightly_tests/shuffle/shuffle_compute_single.yaml` (AWS, head-only) - `release/nightly_tests/shuffle/shuffle_compute_single_gce.yaml` (GCE, head-only) - `release/nightly_tests/shuffle/shuffle_compute_autoscaling.yaml` (AWS, head + 0-19 workers) - `release/nightly_tests/shuffle/shuffle_compute_autoscaling_gce.yaml` (GCE, head + 0-19 workers) ## Test entries updated (anyscale_sdk_2026: true) - `dask_on_ray_100gb_sort` - `dask_on_ray_1tb_sort` - `shuffle_20gb_with_state_api` - `shuffle_100gb` - `autoscaling_shuffle_1tb_1000_partitions` ## Schema changes applied - `cloud_id` → `cloud` (env var name updated) - `head_node_type` → `head_node` (removed `name:` field) - `worker_node_types` → `worker_nodes` (omitted for head-only configs) - `min_workers`/`max_workers` → `min_nodes`/`max_nodes` - `use_spot: false` → `market_type: ON_DEMAND` - `advanced_configurations_json` / `gcp_advanced_configurations_json` → `advanced_instance_config` - GCE: `region` + `allowed_azs` → `zones` - Removed: `region`, `max_workers`, commented-out blocks - Capitalized `cpu` → `CPU` in resources ## Test plan - [x] All 9 configs validated against `ComputeConfig.from_yaml()` - [x] Verify CI passes with new configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [ci] Migrate stress test and placement group compute configs to new schema (#62607) ## Summary Migrates 15 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to all corresponding test entries in `release_tests.yaml`. ### Compute configs migrated (15 files) **Stress tests** (`release/nightly_tests/stress_tests/`): - `stress_tests_compute.yaml` / `stress_tests_compute_gce.yaml` - `stress_tests_compute_large.yaml` / `stress_tests_compute_large_gce.yaml` - `smoke_test_compute.yaml` / `smoke_test_compute_gce.yaml` - `stress_test_threaded_actor_compute.yaml` - `placement_group_tests_compute.yaml` / `placement_group_tests_compute_gce.yaml` - `stress_tests_single_node_oom_compute.yaml` / `stress_tests_single_node_oom_compute_gce.yaml` **Placement group tests** (`release/nightly_tests/placement_group_tests/`): - `compute.yaml` / `compute_gce.yaml` - `pg_perf_test_compute.yaml` / `pg_perf_test_compute_gce.yaml` ### Tests updated in release_tests.yaml (9 tests) 1. `stress_test_placement_group` 2. `stress_test_state_api_scale` 3. `stress_test_many_tasks` 4. `stress_test_dead_actors` 5. `threaded_actors_stress_test` 6. `stress_test_many_runtime_envs` 7. `single_node_oom` 8. `pg_autoscaling_regression_test` 9. `placement_group_performance_test` ### Schema changes applied - `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME` - `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes` - `min_workers` → `min_nodes`, `max_workers` → `max_nodes` - `use_spot: false` → `market_type: ON_DEMAND` - `advanced_configurations_json` / `gcp_advanced_configurations_json` → `advanced_instance_config` - GCE: `region` + `allowed_azs` → `zones` - Resources: `cpu` → `CPU`, `gpu` → `GPU`, flattened `custom_resources` - Removed: `region`, `max_workers`, head/worker `name` fields (kept where multiple workers share instance type) - Removed commented-out blocks - Added `CPU` resources to head nodes where `wait_for_nodes` > worker count ## Test plan - [x] All 15 config files validated against `ComputeConfig.from_yaml()` - [x] CI passes with `anyscale_sdk_2026: true` flag on all test entries: https://buildkite.com/ray-project/release/builds/89908 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [ci] Migrate chaos test compute configs to new schema (#62606) ## Summary - Migrated 2 chaos test compute config files (`compute_template.yaml` and `compute_template_gce.yaml`) from legacy Anyscale compute config schema to the new SDK schema - Added `anyscale_sdk_2026: true` flag to all 16 chaos test entries in `release_tests.yaml` ### Config changes - `cloud_id` -> `cloud`, `ANYSCALE_CLOUD_ID` -> `ANYSCALE_CLOUD_NAME` - `head_node_type` -> `head_node`, `worker_node_types` -> `worker_nodes` - `min_workers`/`max_workers` -> `min_nodes`/`max_nodes` - `use_spot: false` -> `market_type: ON_DEMAND` - `advanced_configurations_json` -> `advanced_instance_config` - Flattened `resources` (removed `custom_resources` nesting, capitalized `cpu` -> `CPU`) - GCE: replaced `region` + `allowed_azs` with `zones` - Removed `region`, `max_workers`, and node `name` fields ### Tests updated (16) - `chaos_many_tasks_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}` - `chaos_many_actors_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}` - `chaos_streaming_generator_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}` - `chaos_object_ref_borrowing_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}` ## Test plan - [x] Both config files validated against `ComputeConfig.from_yaml()` - [x] Verify chaos tests pass on nightly run after merge Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [ci] Migrate microbenchmark, benchmark-worker-startup, and rllib compute configs to new schema (#62604) ## Summary - Migrate 10 compute config files to the new Anyscale SDK schema (`cloud_id` -> `cloud`, `head_node_type` -> `head_node`, `worker_node_types` -> `worker_nodes`, etc.) - Add `anyscale_sdk_2026: true` flag to 12 test cluster blocks in `release_tests.yaml` ## Config files migrated - `release/microbenchmark/tpl_64.yaml` (AWS, head-only) - `release/microbenchmark/tpl_64_gce.yaml` (GCE, head-only) - `release/microbenchmark/experimental/compute_t4_gpu.yaml` (AWS, head-only GPU) - `release/microbenchmark/experimental/compute_gpu_2x1_aws.yaml` (AWS, head+worker GPU) - `release/microbenchmark/experimental/compute_a100_gpu.yaml` (AWS, head-only GPU) - `release/microbenchmark/experimental/compute_l4_gpu.yaml` (AWS, head-only GPU) - `release/microbenchmark/experimental/compute_l4_gpu_2x1_aws.yaml` (AWS, head+worker GPU) - `release/benchmark-worker-startup/only_head_node_1gpu_64cpu.yaml` (AWS, head-only GPU) - `release/benchmark-worker-startup/only_head_node_1gpu_64cpu_gce.yaml` (GCE, head-only) - `release/rllib_tests/1gpu_16cpus.yaml` (AWS, head-only GPU) ## Tests updated with `anyscale_sdk_2026: true` - `microbenchmark` (base + GCE variation) - `compiled_graphs` - `compiled_graphs_GPU` - `compiled_graphs_GPU_multinode` - `compiled_graphs_GPU_cu130` - `compiled_graphs_GPU_multinode_cu130` - `rdt_single_node_T4_microbenchmark` - `rdt_single_node_A100_microbenchmark` - `benchmark_worker_startup` (base + GCE variation) - `rllib_learning_tests_pong_appo_torch` ## Test plan - [x] All 10 config files validated against `ComputeConfig.from_yaml()` - [x] CI passes with the new configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * recompiling requirements_compiled_py313.txt Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> * [deps] updating tag on py313 deps (#63033) updating tag on py313 deps to prevent unnecessary compilation in premerge Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> * [core] fix the mypy type check on BaseContext.__exit__ (#62999) ## Description Fix the type error on the `BaseContext.__exit__`. Also added the reported use case to our mypy test case. ## Related issues Fixes https://github.com/ray-project/ray/issues/62971 Signed-off-by: Rueian Huang <rueiancsie@gmail.com> * [core] increase the cleanup timeout in the chaos iptable test (#62992) ## Description Increase the waiting time for the cleanup according to the cluster logs from the [failure](https://buildkite.com/ray-project/release/builds/90709#019dd2b5-f78b-47eb-aa8f-331c5c68cad3): ### Timeline - **23:14:00**: Actor workload starts with network failure injection every 60s. - **23:15:00**: First 5s network fault affects head + 4 workers. - **23:16:02**: Raylet reports worker process `10563` did not register within timeout. - **23:16:02-23:17:00**: `ReportActor.add` retries pile up after connection resets; progress stalls near 47%. - **23:18:34-23:20:34**: Head state dumps show `128` total worker CPUs and `0` available while actor work is still running. - **23:21:34**: Head sees `128` total CPUs, `112` available. Missing `16` CPUs are all on `10.0.45.36`. - **23:21:43**: Worker `10.0.45.36` reports 16 `ReportActor.__init__` workers, each holding `1 CPU`. - **23:21:47**: Those 16 `ReportActor` workers disconnect gracefully. - **23:21:49**: `wait_for_condition` times out before observing all CPUs released; another network fault triggers at the same time. So, increasing the cleanup consistency timeout should likely fix this specific failure. ## Related issues Fixes: https://github.com/anyscale/ray/issues/1534 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [core] move observability pubsub to ObservabilityPubSubService (#62806) This PR is a follow-up to https://github.com/ray-project/ray/pull/62461, which isolates 3 pubsub channels that have lower priorities and are not for the critical control plan from the InternalPubSubGcsService to their own io_context and the new ObservabilityPubSubService: pubsub_pb2.RAY_ERROR_INFO_CHANNEL pubsub_pb2.RAY_LOG_CHANNEL pubsub_pb2.RAY_NODE_RESOURCE_USAGE_CHANNEL This will ensure that they won't block the critical control plan. The new ObservabilityPubSubRpcClient client also allows us to move the service out of GCS if needed in the future. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> * [ci] Fix doc build failing on broken pytorch intersphinx inventory (#63038) - The doc build (`make -C doc html`, which runs `sphinx-build -W --keep-going`) is failing with `build finished with problems, 1 warning`. The single Sphinx warning is an intersphinx fetch failure: `https://pytorch.org/docs/stable/objects.inv` 301s to `https://docs.pytorch.org/docs/stable/objects.inv`, which currently 404s upstream. With `-W`, that one warning fails CI. - Repoint the `torch` intersphinx mapping in `doc/source/conf.py` to bypass the broken `/stable/objects.inv`. The base URL stays at the canonical `https://docs.pytorch.org/docs/stable/` so generated cross-reference links still target /stable/, but the inventory is fetched from a working pinned version: `https://docs.pytorch.org/docs/2.7/objects.inv`. - Pin matches Ray's runtime torch version (`torch==2.7.0` in `python/requirements/ml/dl-{cpu,gpu}-requirements.txt`), so cross-refs only resolve to symbols that actually exist in the torch users get. ## Why pin to 2.7 and not /stable/ or /main/ - `/stable/objects.inv` is the upstream-broken URL we're routing around, so it can't be the source. - `/main/objects.inv` works but tracks the development branch, which can index APIs that don't exist in 2.7 — leading to cross-refs resolving to symbols Ray users can't actually call. - `/2.7/objects.inv` matches the runtime exactly. Tradeoff: when Ray bumps torch, this URL needs to bump alongside the requirements pin. Post merge run: https://buildkite.com/ray-project/postmerge/builds/17329 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> * [observability] add instance filter to gpu usage metric query (#62214) ## Description Adds instance filter to the node gpu usage metric panel Signed-off-by: carolynwang <carolyn@anyscale.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> * Remove redundant flaky integration test in favor of unit tests (#63004) ## Description PR [[Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy](https://github.com/ray-project/ray/pull/62643), enabled the new by-time killing policy by default opposed to the legacy by-group killing policy. This resulted in `test_memory_pressure` failures in post merge. We found the following in our investigation: * The integration test tests for policy specific behaviors when the memory pressure integration test suite should instead tests for the memory monitoring system's general ability to reduce memory pressure. * The failing integration test should be unit test that tests for the killing policy's behavior directly. In general, we prefer unit test over integration tests for memory threshold sensitive tests as the test environment can have significant impact on the test result, leading to flaky test behaviors. This PR removes redundant integration tests that tests for policy specific behaviors already covered by the policy's unit testing, and introduces a new unit test for cases that were previously covered by the integration test. The following are the removed integration test and their replacements: * `test_restartable_actor_oom_retry_off_throws_oom_error` -> redundant to `test_restartable_actor_throws_oom_error` * `test_memory_pressure_kill_newest_worker` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_memory_pressure_kill_task_if_actor_submitted_task_first` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_task_oom_no_oom_retry_fails_immediately` -> replaced by `TestTaskOomKillNoOomRetryFailsImmediately` in https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc * `test_task_oom_only_uses_oom_retry` -> replaced by `TestTaskOomInfiniteRetry` in https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc * `test_newer_task_not_retriable_kill_older_retriable_task_first` -> replaced by `TestPolicyPrioritizesRetriableOverNonRetriable` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_put_object_task_usage_slightly_below_limit_does_not_crash` -> replaced by `TestMonitorDetectsMemoryBelowThresholdCallbackNotExecuted` in https://github.com/ray-project/ray/blob/master/src/ray/common/tests/threshold_memory_monitor_test.cc * `test_last_task_of_the_group_fail_immediately` -> replaced by `TestLastWorkerInGroupShouldNotRetry` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_group_by_owner_test.cc * `test_one_actor_max_lifo_kill_next_actor` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc ## Additional information `test_memory_pressure` run: https://buildkite.com/ray-project/postmerge/builds/17288 --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com> * [Train] Add missing %s to logger.debug (#63039) `logger.debug` was missing the %s and as a result clogging up the logs --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> * Add perf metrics for 2.55.0 (#63060) ``` REGRESSION 52.54%: tasks_per_second (THROUGHPUT) regresses from 386.6133448073775 to 183.49078025658062 in benchmarks/many_nodes.json REGRESSION 37.10%: tasks_per_second (THROUGHPUT) regresses from 594.0367087794571 to 373.6653345877981 in benchmarks/many_tasks.json REGRESSION 4.22%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 5.723101265712336 to 5.481786077048712 in microbenchmark.json REGRESSION 4.09%: multi_client_put_gigabytes (THROUGHPUT) regresses from 42.60577675231464 to 40.8627833341568 in microbenchmark.json REGRESSION 1.86%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.982001139120161 to 0.9637211637507427 in microbenchmark.json REGRESSION 0.84%: client__get_calls (THROUGHPUT) regresses from 1119.7606509262687 to 1110.3815800718512 in microbenchmark.json REGRESSION 0.63%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 2985.2594797119345 to 2966.3149904468737 in microbenchmark.json REGRESSION 0.48%: client__put_calls (THROUGHPUT) regresses from 851.7996054229982 to 847.7132252307356 in microbenchmark.json REGRESSION 289.14%: dashboard_p95_latency_ms (LATENCY) regresses from 37.856 to 147.311 in benchmarks/many_pgs.json REGRESSION 135.33%: dashboard_p99_latency_ms (LATENCY) regresses from 798.453 to 1879.035 in benchmarks/many_pgs.json REGRESSION 110.53%: stage_4_spread (LATENCY) regresses from 0.3184540688712737 to 0.6704279092079272 in stress_tests/stress_test_many_tasks.json REGRESSION 48.31%: avg_pg_remove_time_ms (LATENCY) regresses from 1.154493106606675 to 1.7122211741741544 in stress_tests/stress_test_placement_group.json REGRESSION 34.75%: dashboard_p50_latency_ms (LATENCY) regresses from 5.002 to 6.74 in benchmarks/many_pgs.json REGRESSION 21.20%: stage_0_time (LATENCY) regresses from 7.112839698791504 to 8.620674133300781 in stress_tests/stress_test_many_tasks.json REGRESSION 19.38%: stage_3_creation_time (LATENCY) regresses from 2.621494770050049 to 3.1294972896575928 in stress_tests/stress_test_many_tasks.json REGRESSION 8.31%: dashboard_p95_latency_ms (LATENCY) regresses from 42.959 to 46.531 in benchmarks/many_nodes.json REGRESSION 8.03%: 107374182400_large_object_time (LATENCY) regresses from 22.459637914999973 to 24.263247010999976 in scalability/single_node.json REGRESSION 8.00%: 10000_args_time (LATENCY) regresses from 11.357349357000004 to 12.265755501000008 in scalability/single_node.json REGRESSION 7.69%: avg_pg_create_time_ms (LATENCY) regresses from 1.5098637252248464 to 1.6259311876874045 in stress_tests/stress_test_placement_group.json REGRESSION 3.67%: 3000_returns_time (LATENCY) regresses from 3.577688757000004 to 3.7088375179999957 in scalability/single_node.json ``` Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Co-authored-by: Lonnie Liu <lonnie@anyscale.com> * [core] rename InternalPubSub* to ControlPlanePubSub* (#63044) ## Description Renaming `InternalPubSub*` to `ControlPlanePubSub*` for clarity. Following up to https://github.com/ray-project/ray/pull/62806#pullrequestreview-4207199543 ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Rueian Huang <rueiancsie@gmail.com> * [Serve][2/5] Add custom ingress request router app interfaces (#62680) Direct ingress needs an app-scoped ingress request router deployment that HAProxy can call to map each request to a target replica ID before forwarding the request to the selected replica. This change attaches that router to the Serve application object itself, so both imperative and declarative deployment paths consume the same composed application graph. ## API shape Imperative usage: ```python llm_server = LLMServer.bind(...) ingress_request_router = IngressRequestRouter.bind( llm_deployment=llm_server, ) app = llm_server._with_ingress_request_router(ingress_request_router) serve.run(app, route_prefix="/v1") ``` Declarative usage: ```python # my_module.py llm_server = LLMServer.bind(...) ingress_request_router = IngressRequestRouter.bind( llm_deployment=llm_server, ) app = llm_server._with_ingress_request_router(ingress_request_router) ``` ```yaml applications: - name: llm route_prefix: /v1 import_path: my_module:app ``` Signed-off-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com> * [Core] Match expected resource isolation integration test constraint to new cgroup constraint (#63054) ## Description The resource isolation python integration tests are currently failing because the resource isolation upper bound constraint has been adjusted from `memory.max` to `memory.high` in the latest resource isolation changes without updating the integration test. This PR adjust the resource isolation integration test to match the latest changes in to use `memory.high` upper bound constraint. The resource isolation PR that updated the memory constraint without updating the test: https://github.com/ray-project/ray/pull/62705/changes#diff-60b34dab728b2e51426a465dd712767a8735682e137e52ebfe030123aeeb56d5L69-R77 ## Related issues Fixes failing core: cgroup tests --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com> * [serve] Enable logs in `LongPollHost` when `LongPollClient` stops its attached event loop (#63028) --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> * [Train] Reduce `test_result_restore` flakiness (#63045) Reviewing the logs for a flaky run of `test_result_restore` then it shows that rank 1 has a training report but rank 0 doesn't (the RuntimeError in rank-1 runs before the checkpoint in rank-0 is saved) and therefore when computing the `get_best_checkpoints` there is missing checkpoints and occasionally the wrong results are returned. We can easily resolve this through adding a sync barrier between workers before raising the error to ensure that the checkpoints are all saved. --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> * [Data] Fix HashAggregate duplicate group rows for AggregateFnV2 (#63066) ## Summary `TableBlockBuilder.build()` reordered rows across an internal compaction boundary, so `_aggregate`'s per-block partial-aggregate output could be unsorted by the group key. That violates the "inputs are sorted by key" precondition that `_combine_aggregated_blocks`' `heapq.merge` relies on, and surfaced as duplicate group rows in HashAggregate output whose count varied with the parallelism arg. Two issues, both fixed: 1. **`TableBlockBuilder.build()`** put the still-uncompacted dict-of-lists (newest rows) in front of the previously-compacted tables. Now appends the uncompacted tail after the compacted tables — preserving insertion order. 2. **`ArrowBlockBuilder._combine_tables`** called `transform_pyarrow.concat` without `preserve_order=True`. When block schemas didn't unify exactly (common for V2 aggregators whose accumulator varies in shape between rows — e.g. an empty list vs. a non-empty list, inferring `list<null>` vs `list<string>`), `concat` took a fast path that groups schema-matching blocks together and prepends mismatched ones. Now passes `preserve_order=True` since the builder's contract is to preserve insertion order regardless of internal compaction or schema unification. ## Where `_combine_tables` sits in the hash-shuffle lifecycle ```mermaid sequenceDiagram autonumber participant ShuffleTask as _shuffle_block (Ray task) participant Closure as input_block_transformer (_aggregate closure) participant TableAcc as TableBlockAccessor._aggregate participant Builder as TableBlockBuilder participant Combine as ArrowBlockBuilder._combine_tables participant Aggregator as HashShuffleAggregator participant Reducer as ReducingAggregation ShuffleTask->>Closure: block_transformer(block) Closure->>Closure: pruned.sort(sort_key) Closure->>TableAcc: target._aggregate(sort_key, aggs) loop for each group (sorted) TableAcc->>Builder: builder.add(row) Note over Builder: _compact_if_needed may flush _columns into _tables mid-loop end TableAcc->>Builder: builder.build() Builder->>Combine: _combine_tables(_tables + [_columns_partial]) Note over Combine: ★ FIXES LIVE HERE build(): append uncompacted tail (was: prepend) _combine_tables: preserve_order=True Combine-->>Builder: sorted partial-aggregate block Builder-->>TableAcc: sorted partial-aggregate block TableAcc-->>Closure: sorted partial-aggregate block Closure-->>ShuffleTask: sorted partial-aggregate block ShuffleTask->>ShuffleTask: hash_partition (np.where + take, preserves order) ShuffleTask->>Aggregator: aggregator.submit.remote(shard) Aggregator->>Reducer: compact / finalize (List[Block]) Reducer->>Reducer: _combine_aggregated_blocks (heapq.merge — now sees sorted inputs) ``` The bug was in step 8: `_combine_tables` and `build()` could permute rows across compactions, propagating unsorted blocks through steps 9–14 to the `heapq.merge` in step 15, which silently produced duplicate group rows because its consecutive-equal-key grouping only collapses adjacent rows. ## Test plan - [x] New regression test `test_partial_aggregate_preserves_sort_after_builder_compaction` in `python/ray/data/tests/test_hash_shuffle.py` forces compaction on every row via `MAX_UNCOMPACTED_SIZE_BYTES=1` and asserts partial-aggregate output is sorted by the group key. Fails on master, passes after this change. - [x] Full `test_hash_shuffle.py` suite (19 tests) passes. - [x] `test_hash_shuffle_aggregator.py` suite passes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Data][LLM] Fix wrong documented default for max_tasks_in_flight_per_actor (#62917) Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [train] Export default data execution options (#62784) Follow-up to #59186, which only captured `execution_options` when the user provided them per-dataset in the form of a dict, dropping the default or user-provided global `ExecutionOptions`. This PR captures the default and user-provided global options alongside the per-dataset execution options, exposed via a typed `DataExecutionOptions` model split into `default` and `per_dataset_execution_options`. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> * [Data] Convert abstract logical operator classes to frozen dataclasses (#62593) ## Description #### Why this is needed: This is the next PR in the `#60312` logical plan migration stack. After finishing the remaining concrete operator coverage, the next step in the split plan is to convert `LogicalOperator` and the abstract logical operator base classes into frozen dataclasses. #### What this PR changes: Makes the following abstract logical operator classes frozen dataclasses: - `LogicalOperator` - `NAry` - `AbstractOneToOne` - `AbstractMap` - `AbstractUDFMap` - `AbstractAllToAll` This PR also makes `_name`, `_input_dependencies`, and `_num_outputs` proper dataclass fields on `LogicalOperator`, removing the manual `LogicalOperator.__init__`. Extending the same step to additional abstract-base state runs into concrete dataclass constructor-generation errors (for example, `TypeError: non-default argument 'input_op' follows default argument`), so the broader field-model cleanup remains in the later follow-up PRs. This PR does not include the later `_name` derived-field work, `_apply_transform` deduplication, `input_op: InitVar` replacement, or broader logical-rule cleanup follow-ups. ## Related issues Closes #60312 ## Additional information This PR corresponds to the current split-plan step for making `LogicalOperator` and the abstract logical operator base classes frozen dataclasses. ### Tests - `python -m pre_commit run --files python/ray/data/_internal/logical/interfaces/logical_operator.py python/ray/data/_internal/logical/operators/n_ary_operator.py python/ray/data/_internal/logical/operators/one_to_one_operator.py python/ray/data/_internal/logical/operators/map_operator.py python/ray/data/_internal/logical/operators/all_to_all_operator.py` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_execution_optimizer_basic.py -k 'map or repartition or sort or union or zip'` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py -k 'union or split'` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_join.py -k 'inner or outer or semi or anti'` ### Stack Plan Done: - PR-A: Add a default property implementation for `LogicalOperator.name` - PR-B: Move logical `output_dependencies` handling out of logical operators - PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs` - PR-D1: Convert one-to-one logical operators to frozen dataclasses - PR-D2: Convert map logical operators to frozen dataclasses - PR-D3: Convert all-to-all, join, read, and write logical operators to frozen dataclasses - PR-D4: Convert remaining source logical operators to frozen dataclasses - PR-Next-0: Convert remaining concrete logical operators to frozen dataclasses - This PR: make `LogicalOperator` and the abstract logical operator base classes frozen dataclasses Next: - make `_name` a derived field - deduplicate `_apply_transform` - replace `input_op: InitVar` with a real `input_dependencies` field - remove `input_dependency` on `AbstractOneToOne` - clean up `_get_args` - remove redundant `__repr__` / `__str__` - clean up special-casing in logical rules - finalize equality / comparability work for `#60312` --------- Signed-off-by: yaommen <myanstu@163.com> * [ci] convert core.rayci.yml test steps to array and narrow subsets (#62799) Convert the two remaining matrix test steps in core.rayci.yml — "core: python {{matrix.python}} tests" (matrix setup with python + worker_id) and "core: minimal tests" — to array syntax; their corebuild-multipy and minbuild-core depends_on refine from (*) to ($). Narrow three (*) fan-ins in core.rayci.yml down to (python=3.10) subsets for the wheel tests, HA integration, and runtime env container steps that only exercise python 3.10. Across cicd, data, dependencies, doc, kuberay, llm, ml, others, rllib, and serve, narrow each oss-ci-base_* (*) dependency to (python=X.Y) where the consuming step pins a single python version; leave (*) in place where the step truly spans multiple versions (data top-level ml base, ml mlbuild-multipy / mlgpubuild-multipy, serve top-level build base). Signed-off-by: andrew <andrew@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> * [Data] Make logical operator names derived by default (#63084) ## Description #### Why this is needed: This is the next PR in the `#60312` logical plan migration stack. After moving the shared logical-operator backing fields to the abstract-class layer, `_name` is still wired manually in many concrete operators. Most of those assignments are just the operator class name, so the next step is to make that default behavior come from the base logical-operator layer. #### What this PR changes: Makes logical-operator names derived by default from the base logical-operator layer. For operators without a special naming rule, `name` now defaults to `self.__class__.__name__`. This PR removes concrete `_name` wiring where the assigned value was only the class name, while preserving the special naming cases that still need explicit values, such as `Read`, `Limit`, `RandomShuffle`, `RandomizeBlocks`, and UDF-based map operators. This PR does not include the later `_apply_transform` deduplication, `input_op: InitVar` replacement, `_get_args` cleanup, or broader logical-rule cleanup follow-ups. ## Related issues Part of #60312 ## Additional information This PR corresponds to the `_name` derived-field step in the current split plan. ### Tests - `python -m pre_commit run --files python/ray/data/_internal/logical/interfaces/logical_operator.py python/ray/data/_internal/logical/operators/one_to_one_operator.py python/ray/data/_internal/logical/operators/n_ary_operator.py python/ray/data/_internal/logical/operators/all_to_all_operator.py python/ray/data/_internal/logical/operators/count_operator.py python/ray/data/_internal/logical/operators/input_data_operator.py python/ray/data/_internal/logical/operators/from_operators.py python/ray/data/_internal/logical/operators/streaming_split_operator.py python/ray/data/_internal/logical/operators/join_operator.py python/ray/data/_internal/logical/operators/write_operator.py python/ray/data/_internal/logical/operators/map_operator.py` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_state_export.py python/ray/data/tests/unit/test_logical_plan.py python/ray/data/tests/test_execution_optimizer_basic.py -k 'Project or Count or InputData or Union or Zip or split or join or write or read or map'` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_execution_optimizer_advanced.py python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py -k 'zip or union or split or project or read or write or join'` ### Stack Plan Done: - PR-A: Add a default property implementation for `LogicalOperator.name` - PR-B: Move logical `output_dependencies` handling out of logical operators - PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs` - PR-D1: Convert one-to-one logical operators to frozen dataclasses - PR-D2: Convert map logical operators to frozen dataclasses - PR-D3: Convert all-to-all, join, read, and write logical operators to frozen dataclasses - PR-D4: Convert remaining source logical operators to frozen dataclasses - PR-Next-0: Convert the remaining concrete logical operators to frozen dataclasses - PR-Next-1: Convert abstract logical operator classes to frozen dataclasses - This PR: make logical-operator names derived by default Next: - deduplicate `_apply_transform` - replace `input_op: InitVar` with a real `input_dependencies` field - remove `input_dependency` on `AbstractOneToOne` - clean up `_get_args` - remove redundant `__repr__` / `__str__` - clean up special-casing in logical rules - finalize equality / comparability work for `#60312` Signed-off-by: yaommen <myanstu@163.com> * [Data] Deduplicate logical operator apply transform (#63089) ## Description #### Why this is needed: This is the next PR in the `#60312` logical plan migration stack. After making logical operators frozen dataclasses and moving logical operator names to the base layer, most concrete operators still carry near-identical `_apply_transform` implementations. Each implementation recursively transforms its input operator, keeps `self` when the input is unchanged, and rebuilds the operator when the input changes. #### What this PR changes: Adds a frozen-safe default `_apply_transform` implementation to `LogicalOperator` and moves operator-specific rebuild details into small `_with_new_input` / `_with_new_input_dependencies` hooks. For single-input operators, concrete dataclass operators with `input_op` still use `dataclasses.replace(self, input_op=...)`, while generic custom subclasses keep the previous shallow-copy child rewiring behavior. `RandomShuffle` and `Repartition` keep small hooks because they still have InitVar-only constructor values that must be passed during replacement. `NAry` owns the common n-ary rebuild path for `Zip` and `Union`, and `Join` keeps the multi-input rebuild hook for its left and right inputs. This PR does not replace `input_op: InitVar` with a real `input_dependencies` field, remove `input_dependency` from `AbstractOneToOne`, clean up `_get_args`, or clean up logical-rule special-casing. Those remain separate follow-ups in the current split plan. ## Related issues Part of #60312 ## Additional information This PR corresponds to the `_apply_transform` deduplication step in the current split plan. It reopens the same change from #63086 against `master`. ### Tests - `python -m pre_commit run --files python/ray/data/_internal/logical/interfaces/logical_operator.py python/ray/data/_internal/logical/operators/all_to_all_operator.py python/ray/data/_internal/logical/operators/count_operator.py python/ray/data/_internal/logical/operators/join_operator.py python/ray/data/_internal/logical/operators/map_operator.py python/ray/data/_internal/logical/operators/n_ary_operator.py python/ray/data/_internal/logical/operators/one_to_one_operator.py python/ray/data/_internal/logical/operators/streaming_split_operator.py python/ray/data/_internal/logical/operators/write_operator.py python/ray/data/tests/unit/test_logical_plan.py` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/unit/test_logical_plan.py python/ray/data/tests/test_execution_optimizer_limit_pushdown.py::test_limit_pushdown_recreates_frozen_download` - In #63086: `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_execution_optimizer_limit_pushdown.py python/ray/data/tests/test_execution_optimizer_advanced.py python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py -k 'limit_pushdown_recreates_frozen_download or zip_e2e or union or split or project or join'` ### Stack Plan Done: - PR-A: Add a default property implementation for `LogicalOperator.name` - PR-B: Move logical `output_dependencies` handling out of logical operators - PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs` - PR-D1: Convert one-to-one logical operators to frozen dataclasses - PR-D2: Convert map logical operators to frozen dataclasses - PR-D3: Convert all-to-all, join, read, and write logical operators to frozen dataclasses - PR-D4: Convert remaining source logical operators to frozen dataclasses - PR-Next-0: Convert the remaining concrete logical operators to frozen dataclasses - PR-Next-1: Convert abstract logical operator classes to frozen dataclasses - PR-Next-2: make logical-operator names derived by default - This PR: deduplicate `_apply_transform` Next: - replace `input_op: InitVar` with a real `input_dependencies` field - remove `input_dependency` on `AbstractOneToOne` - clean up `_get_args` - remove redundant `__repr__` / `__str__` - clean up special-casing in logical rules - finalize equality / comparability work for `#60312` --------- Signed-off-by: yaommen <myanstu@163.com> * [RLlib] Fix ValueError in MultiAgentEpisode.get_rewards() when agent inactive for all requested env steps (#62907) ## Summary Fixes #62903 - `MultiAgentEpisode.get_rewards()` (and other `get_*` methods) no longer crashes with `ValueError` when called on a finalized multi-agent episode where an agent was inactive during the requested env steps. When retrieving per-agent data by env step indices, `_get_single_agent_data_by_env_step_indices` filters out `SKIP_ENV_TS_TAG` entries for agents that didn't participate in certain env steps. If an agent was inactive for **all** requested env steps, the filtered indices list became empty, causing `InfiniteLookbackBuffer.get(indices=[])` → `batch([])` → `ValueError: Input list_of_structs does not contain any items`. This PR adds an early return of an empty list when all indices are filtered out, allowing the caller's existing `if len(agent_values) > 0` guard to correctly exclude the inactive agent from the result dict. <details> <summary>Before</summary> ``` episode.get_rewards(indices=slice(1, 3)) ValueError: Input `list_of_structs` does not contain any items. File "ray/rllib/env/multi_agent_episode.py", line 2554, in _get_data_by_env_steps agent_values = self._get_single_agent_data_by_env_step_indices( File "ray/rllib/env/multi_agent_episode.py", line 2753, in _get_single_agent_data_by_env_step_indices ret = inf_lookback_buffer.get( File "ray/rllib/env/utils/infinite_lookback_buffer.py", line 243, in get data = batch(data) File "ray/rllib/utils/spaces/space_utils.py", line 315, in batch raise ValueError("Input `list_of_structs` does not contain any items.") ``` </details> <details> <summary>After</summary> ```python episode.get_rewards(indices=slice(1, 3)) # {'a0': array([0.2, 0.3])} # a1 correctly excluded — it was inactive during env steps 1 and 2 ``` </details> <details> <summary>Test results</summary> ``` $ python -m pytest test_multi_agent_episode.py -v -x test_multi_agent_episode.py::TestMultiAgentEpisode::test_add_env_reset PASSED [ 5%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_add_env_step PASSED [ 11%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_cut PASSED [ 17%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_actions PASSED [ 23%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_infos PASSED [ 29%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_observations PASSED [ 35%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_return PASSED [ 41%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_rewards PASSED [ 47%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_sample_batch PASSED [ 52%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_state_and_from_state PASSED [ 58%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_init PASSED [ 64%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_len PASSED [ 70%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_other_getters PASSED [ 76%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_setters PASSED [ 82%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_slice PASSED [ 88%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_slice_with_lookback PASSED [ 94%] test_multi_agent_episode.py::test_multi_agent_episode_functionality PASSED [100%] ======================== 17 passed, 3 warnings in 5.03s ======================== ``` </details> ## Test plan - [x] Reproduced the bug: `get_rewards(indices=slice(1,3))` raises `ValueError` on a finalized episode with an inactive agent - [x] Verified the fix: same call now returns `{'a0': array([0.2, 0.3])}` with the inactive agent correctly excluded - [x] Verified `get_rewards()` without indices still works as before - [x] Added regression test to `test_get_rewards` in `test_multi_agent_episode.py` covering: - Finalized (numpy) episode with `get_rewards(indices=slice(1,3))` — the exact crash scenario - `get_actions(indices=slice(1,3))` — proves the fix covers all `get_*` methods (shared code path) - Non-finalized episode with same scenario — proves finalized/non-finalized behavior is consistent - [x] All 17 tests in `test_multi_agent_episode.py` pass (regression test is inside existing `test_get_rewards`) --------- Signed-off-by: Cursx <33718736+Cursx@users.noreply.github.com> * Add shared Claude Code configuration for Ray development (#62554) ## Description: Sets up hierarchical Claude Code instructions for the Ray repo so that each team (Data, Serve, Train, Tune, RLlib, C++ Core) can maintain their own scoped rules and skills. ## Primary changes: - Root `.claude/CLAUDE.md` with shared instructions, per-library templates teams can fill in - Path-scoped `.claude/rules/ `for Python guidelines, C++ style, security, debugging - Shared skills: `/rebuild`, `/lint`, `/fetch-buildkite-logs` for common workflows - `.claude/settings.json` with common permissions - Developer docs at `doc/source/ray-contribute/agent-development.rst` covering personal setup, worktree support, and how to add team-specific rules/skills - `.gitignore` updated to version-control shared config while keeping personal files local reference: https://code.claude.com/docs/en/best-practices ## Future work: Support other coding agents like codex, the instructions can be written in common markdown files and imported inside coding agent specific instruction files. we can also integrate with anyscale managed skills to help debug release tests running on anyscale workspaces. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * refactor(raylet): split task option helpers into task_options_utils.pxi Signed-off-by: hieuvous <trunghieu101112a1@gmail.com> * refactor(raylet): move resource task option helpers Signed-off-by: chichic21039 <Linhpham.1508055@gmail.com> * Refactor/raylet task options utils resources fallback (#7) * Move function and actor helpers to task options utils * Update task options utils and raylet * Resolve add/add merge conflict in task_options_utils.pxi * Resolve add/add merge conflict in task_options_utils.pxi * refactor(raylet): extract resources and fallback helpers to task_options_utils.pxi Signed-off-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn> --------- Signed-off-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn> Signed-off-by: Duy Hưng <163378800+RitaKaniska@users.noreply.github.com> Co-authored-by: hieuvous <trunghieu101112a1@gmail.com> Co-authored-by: hieuvous <164620181+hieuvous@users.noreply.github.com> Co-authored-by: HLDKNotFound <huynhleduykhanh25022005@gmail.com> Co-authored-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn> --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Chaitanya Bharadwaj <venkatachaitanyametta@gmail.com> Signed-off-by: Chaitanya Bharadwaj <74806126+mvcb@users.noreply.github.com> Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: slxswaa1993 <470093691@qq.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Signed-off-by: carolynwang <carolyn@anyscale.com> Signed-off-by: davik <davik@anyscale.com> Signed-off-by: Mark Towers <mark@anyscale.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Signed-off-by: yaommen <myanstu@163.com> Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: Cursx <33718736+Cursx@users.noreply.github.com> Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Signed-off-by: hieuvous <trunghieu101112a1@gmail.com> Signed-off-by: chichic21039 <Linhpham.1508055@gmail.com> Signed-off-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn> Signed-off-by: Duy Hưng <163378800+RitaKaniska@users.noreply.github.com> Co-authored-by: Sai Miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Co-authored-by: Goutam <goutam@anyscale.com> Co-authored-by: Chaitanya Bharadwaj <74806126+mvcb@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: harshit-anyscale <harshit@anyscale.com> Co-authored-by: zhilong <121425509+Bye-legumes@users.noreply.github.com> Co-authored-by: slxswaa <slxswaa1993@hotmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Elliot Barnwell <elliot.barnwell@anyscale.com> Co-authored-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Carolyn Wang <carolyn@anyscale.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: davik <davik@anyscale.com> Co-authored-by: Mark Towers <mark.m.towers@gmail.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Lonnie Liu <lonnie@anyscale.com> Co-authored-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: yaommen <myanstu@163.com> Co-authored-by: Andrew Pollack-Gray <andrew@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Cursx <33718736+Cursx@users.noreply.github.com> Co-authored-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: hieuvous <trunghieu101112a1@gmail.com> Co-authored-by: chichic21039 <Linhpham.1508055@gmail.com> Co-authored-by: hieuvous <164620181+hieuvous@users.noreply.github.com> Co-authored-by: HLDKNotFound <huynhleduykhanh25022005@gmail.com> Co-authored-by: Duyhung080205 <dhdhung23@clc.fitus.edu.vn>

…roject#63004) ## Description PR [[Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy](ray-project#62643), enabled the new by-time killing policy by default opposed to the legacy by-group killing policy. This resulted in `test_memory_pressure` failures in post merge. We found the following in our investigation: * The integration test tests for policy specific behaviors when the memory pressure integration test suite should instead tests for the memory monitoring system's general ability to reduce memory pressure. * The failing integration test should be unit test that tests for the killing policy's behavior directly. In general, we prefer unit test over integration tests for memory threshold sensitive tests as the test environment can have significant impact on the test result, leading to flaky test behaviors. This PR removes redundant integration tests that tests for policy specific behaviors already covered by the policy's unit testing, and introduces a new unit test for cases that were previously covered by the integration test. The following are the removed integration test and their replacements: * `test_restartable_actor_oom_retry_off_throws_oom_error` -> redundant to `test_restartable_actor_throws_oom_error` * `test_memory_pressure_kill_newest_worker` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_memory_pressure_kill_task_if_actor_submitted_task_first` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_task_oom_no_oom_retry_fails_immediately` -> replaced by `TestTaskOomKillNoOomRetryFailsImmediately` in https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc * `test_task_oom_only_uses_oom_retry` -> replaced by `TestTaskOomInfiniteRetry` in https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc * `test_newer_task_not_retriable_kill_older_retriable_task_first` -> replaced by `TestPolicyPrioritizesRetriableOverNonRetriable` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_put_object_task_usage_slightly_below_limit_does_not_crash` -> replaced by `TestMonitorDetectsMemoryBelowThresholdCallbackNotExecuted` in https://github.com/ray-project/ray/blob/master/src/ray/common/tests/threshold_memory_monitor_test.cc * `test_last_task_of_the_group_fail_immediately` -> replaced by `TestLastWorkerInGroupShouldNotRetry` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_group_by_owner_test.cc * `test_one_actor_max_lifo_kill_next_actor` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc ## Additional information `test_memory_pressure` run: https://buildkite.com/ray-project/postmerge/builds/17288 --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>

davik added 2 commits April 28, 2026 21:17

Remove redundant flaky integration test in favor of unit tests

5c8ec52

Signed-off-by: davik <davik@anyscale.com>

Remove unused test functions

c7eec20

Signed-off-by: davik <davik@anyscale.com>

Kunchd requested a review from a team as a code owner April 28, 2026 21:43

Kunchd added the go add ONLY when ready to merge, run all tests label Apr 28, 2026

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

Kunchd mentioned this pull request Apr 28, 2026

[core] Pin legacy OOM killing policy in test_memory_pressure fixtures #62922

Open

rueian approved these changes Apr 28, 2026

View reviewed changes

Merge branch 'master' into deflake-test_mem_pressure

cb01007

ray-gardener Bot added the core Issues that should be addressed in Ray Core label Apr 29, 2026

Merge branch 'master' into deflake-test_mem_pressure

91dbb4d

MengjinYan merged commit 6f95fe9 into ray-project:master Apr 30, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove redundant flaky integration test in favor of unit tests#63004

Remove redundant flaky integration test in favor of unit tests#63004
MengjinYan merged 4 commits intoray-project:masterfrom
Kunchd:deflake-test_mem_pressure

Kunchd commented Apr 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kunchd commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kunchd commented Apr 28, 2026 •

edited

Loading