Skip to content

Rl sandbox intro#1387

Merged
arbrown merged 2 commits into
mainfrom
rl-sandbox-intro
Jun 24, 2026
Merged

Rl sandbox intro#1387
arbrown merged 2 commits into
mainfrom
rl-sandbox-intro

Conversation

@arbrown

@arbrown arbrown commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@arbrown arbrown merged commit a5e8634 into main Jun 24, 2026
6 checks passed
@arbrown arbrown deleted the rl-sandbox-intro branch June 24, 2026 20:11

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a High-Performance Distributed RL Sandbox environment on GKE, adding Dockerfiles, Kubernetes configurations, and a training script using TRL and Ray. Key feedback includes adding a timeout to the HTTP request, using --no-install-recommends to reduce Docker image size, enabling vLLM in the GRPO configuration for faster generation, making the bash block regex more robust, and moving ray.init() to the main function to prevent initialization issues on remote workers.

try:
github_repo = example["repo"]
url = f"https://raw.githubusercontent.com/{github_repo}/{example['base_commit']}/{target_file}"
with urllib.request.urlopen(url) as response:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The urllib.request.urlopen call does not specify a timeout. If the GitHub raw server is slow or unresponsive, this call can block indefinitely, hanging the dataset mapping process. It is recommended to set a reasonable timeout.

Suggested change
with urllib.request.urlopen(url) as response:
with urllib.request.urlopen(url, timeout=10) as response:

Comment on lines +5 to +9
RUN apt-get update && apt-get install -y \
git \
build-essential \
libsqlite3-dev \
&& rm -rf /var/lib/apt/lists/*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To keep the Docker image size as small as possible, it is recommended to use the --no-install-recommends flag with apt-get install.

RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    build-essential \
    libsqlite3-dev \
    && rm -rf /var/lib/apt/lists/*

Comment on lines +116 to +124
training_args = GRPOConfig(
output_dir="outputs",
learning_rate=5e-6,
max_steps=10,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=8,
generation_batch_size=8,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the GPU worker base image is vllm/vllm-openai, you can significantly accelerate the generation phase of GRPO by enabling vLLM integration in GRPOConfig using use_vllm=True.

Suggested change
training_args = GRPOConfig(
output_dir="outputs",
learning_rate=5e-6,
max_steps=10,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=8,
generation_batch_size=8,
)
training_args = GRPOConfig(
output_dir="outputs",
learning_rate=5e-6,
max_steps=10,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=8,
generation_batch_size=8,
use_vllm=True,
)


try:
# Check if the code is correctly formatted
bash_match = re.search(r"```bash\n(.*?)\n```", code, re.DOTALL)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

LLMs frequently output sh instead of bash, or include trailing whitespace after the language identifier. The current regex is strict and will fail to match these variations, resulting in a 0.0 reward for otherwise valid completions. Using a more permissive regex like r"```(?:bash|sh)\\s*\\n(.*?)\\n```" is much more robust.

Suggested change
bash_match = re.search(r"```bash\n(.*?)\n```", code, re.DOTALL)
bash_match = re.search(r"```(?:bash|sh)\\s*\\n(.*?)\\n```", code, re.DOTALL)

import urllib.request
import re

ray.init(ignore_reinit_error=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling ray.init() at the module level is a Ray anti-pattern. When Ray workers import this module to execute tasks, they will run ray.init() again. Although ignore_reinit_error=True suppresses the error, it can still cause unexpected behavior or warnings. It is best practice to initialize Ray inside the main() function or under the if __name__ == "__main__": block.

Comment on lines +137 to +139
def main():
print("Submitting training job to GPU worker...")
ray.get(train.remote())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Initialize Ray inside the main() function to ensure it only runs on the driver process and not on the Ray workers when they import this module.

Suggested change
def main():
print("Submitting training job to GPU worker...")
ray.get(train.remote())
def main():
ray.init(ignore_reinit_error=True)
print("Submitting training job to GPU worker...")
ray.get(train.remote())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant