Rl sandbox intro#1387
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a High-Performance Distributed RL Sandbox environment on GKE, adding Dockerfiles, Kubernetes configurations, and a training script using TRL and Ray. Key feedback includes adding a timeout to the HTTP request, using --no-install-recommends to reduce Docker image size, enabling vLLM in the GRPO configuration for faster generation, making the bash block regex more robust, and moving ray.init() to the main function to prevent initialization issues on remote workers.
| try: | ||
| github_repo = example["repo"] | ||
| url = f"https://raw.githubusercontent.com/{github_repo}/{example['base_commit']}/{target_file}" | ||
| with urllib.request.urlopen(url) as response: |
There was a problem hiding this comment.
The urllib.request.urlopen call does not specify a timeout. If the GitHub raw server is slow or unresponsive, this call can block indefinitely, hanging the dataset mapping process. It is recommended to set a reasonable timeout.
| with urllib.request.urlopen(url) as response: | |
| with urllib.request.urlopen(url, timeout=10) as response: |
| RUN apt-get update && apt-get install -y \ | ||
| git \ | ||
| build-essential \ | ||
| libsqlite3-dev \ | ||
| && rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
| training_args = GRPOConfig( | ||
| output_dir="outputs", | ||
| learning_rate=5e-6, | ||
| max_steps=10, | ||
| per_device_train_batch_size=1, | ||
| gradient_accumulation_steps=4, | ||
| num_generations=8, | ||
| generation_batch_size=8, | ||
| ) |
There was a problem hiding this comment.
Since the GPU worker base image is vllm/vllm-openai, you can significantly accelerate the generation phase of GRPO by enabling vLLM integration in GRPOConfig using use_vllm=True.
| training_args = GRPOConfig( | |
| output_dir="outputs", | |
| learning_rate=5e-6, | |
| max_steps=10, | |
| per_device_train_batch_size=1, | |
| gradient_accumulation_steps=4, | |
| num_generations=8, | |
| generation_batch_size=8, | |
| ) | |
| training_args = GRPOConfig( | |
| output_dir="outputs", | |
| learning_rate=5e-6, | |
| max_steps=10, | |
| per_device_train_batch_size=1, | |
| gradient_accumulation_steps=4, | |
| num_generations=8, | |
| generation_batch_size=8, | |
| use_vllm=True, | |
| ) |
|
|
||
| try: | ||
| # Check if the code is correctly formatted | ||
| bash_match = re.search(r"```bash\n(.*?)\n```", code, re.DOTALL) |
There was a problem hiding this comment.
LLMs frequently output sh instead of bash, or include trailing whitespace after the language identifier. The current regex is strict and will fail to match these variations, resulting in a 0.0 reward for otherwise valid completions. Using a more permissive regex like r"```(?:bash|sh)\\s*\\n(.*?)\\n```" is much more robust.
| bash_match = re.search(r"```bash\n(.*?)\n```", code, re.DOTALL) | |
| bash_match = re.search(r"```(?:bash|sh)\\s*\\n(.*?)\\n```", code, re.DOTALL) |
| import urllib.request | ||
| import re | ||
|
|
||
| ray.init(ignore_reinit_error=True) |
There was a problem hiding this comment.
Calling ray.init() at the module level is a Ray anti-pattern. When Ray workers import this module to execute tasks, they will run ray.init() again. Although ignore_reinit_error=True suppresses the error, it can still cause unexpected behavior or warnings. It is best practice to initialize Ray inside the main() function or under the if __name__ == "__main__": block.
| def main(): | ||
| print("Submitting training job to GPU worker...") | ||
| ray.get(train.remote()) |
There was a problem hiding this comment.
Initialize Ray inside the main() function to ensure it only runs on the driver process and not on the Ray workers when they import this module.
| def main(): | |
| print("Submitting training job to GPU worker...") | |
| ray.get(train.remote()) | |
| def main(): | |
| ray.init(ignore_reinit_error=True) | |
| print("Submitting training job to GPU worker...") | |
| ray.get(train.remote()) |
No description provided.