Problem Description
When RD-Agent generates factors containing machine learning models (such as LSTM, RandomForest, XGBoost), the execution hangs or takes extremely long time on datasets with 5M+ rows.
Example Scenario
- Data: 5,160,401 rows (A-share stock data, 5194 instruments × 1046 trading days)
- Factor: LSTM-based 5D return predictor
- Expected behavior: Complete within minutes
- Actual behavior: Process runs for hours at 100% single-core CPU, never completes
Root Cause Analysis
The LSTM implementation generated by the LLM uses a per-stock per-day training pattern:
```python
Problematic pattern generated by LLM:
for instrument in instruments:
for day in trading_days:
# Train NEW LSTM model for each instrument each day
model = train_new_lstm(data[:day])
prediction = model.predict(next_day)
```
This creates O(instruments × days) training iterations:
- 5,194 instruments × 970 prediction days × 30 epochs ≈ 145 million training iterations
Similar Issues
- RandomForest: Same pattern, per-stock per-day retraining
- XGBoost: Same anti-pattern
- Critic feedback: Consistently rejects these as "not aligned with factor calculation paradigms" but the LLM keeps regenerating similar code
Environment
- RD-Agent with CoSTEER framework
- Data: MultiIndex DataFrame with datetime × instrument
- Python: 3.10, conda env: rdagent4qlib
Logs
Key log entries showing the pattern:
```
critic 1: The factor requires training a new Random Forest model for each instrument and each day, which is computationally prohibitive...
critic 2: The implementation is a heuristic approximation, not the actual ML factor...
final_decision: false
```
Workaround Found
We manually optimized the LSTM code to use vectorized pre-training:
- Train Ridge regression once on all data
- Batch predict for all stocks/days
- Reduced time from hours to 28 seconds
But this requires manual intervention each time ML-based factors are generated.
Questions
- Is there a way to configure the Critic or prompts to accept "pre-trained model + batch prediction" approach?
- Should the factor selection phase filter out ML-based factors by default?
- Any suggestions for handling ML-based factors in the RAG knowledge base?
Thank you for this excellent framework!
Problem Description
When RD-Agent generates factors containing machine learning models (such as LSTM, RandomForest, XGBoost), the execution hangs or takes extremely long time on datasets with 5M+ rows.
Example Scenario
Root Cause Analysis
The LSTM implementation generated by the LLM uses a per-stock per-day training pattern:
```python
Problematic pattern generated by LLM:
for instrument in instruments:
for day in trading_days:
# Train NEW LSTM model for each instrument each day
model = train_new_lstm(data[:day])
prediction = model.predict(next_day)
```
This creates O(instruments × days) training iterations:
Similar Issues
Environment
Logs
Key log entries showing the pattern:
```
critic 1: The factor requires training a new Random Forest model for each instrument and each day, which is computationally prohibitive...
critic 2: The implementation is a heuristic approximation, not the actual ML factor...
final_decision: false
```
Workaround Found
We manually optimized the LSTM code to use vectorized pre-training:
But this requires manual intervention each time ML-based factors are generated.
Questions
Thank you for this excellent framework!