Skip to content

ML-based factors (LSTM/RandomForest) hang on large datasets (5M+ rows) #1407

Description

@sawchen

Problem Description

When RD-Agent generates factors containing machine learning models (such as LSTM, RandomForest, XGBoost), the execution hangs or takes extremely long time on datasets with 5M+ rows.

Example Scenario

  • Data: 5,160,401 rows (A-share stock data, 5194 instruments × 1046 trading days)
  • Factor: LSTM-based 5D return predictor
  • Expected behavior: Complete within minutes
  • Actual behavior: Process runs for hours at 100% single-core CPU, never completes

Root Cause Analysis

The LSTM implementation generated by the LLM uses a per-stock per-day training pattern:

```python

Problematic pattern generated by LLM:

for instrument in instruments:
for day in trading_days:
# Train NEW LSTM model for each instrument each day
model = train_new_lstm(data[:day])
prediction = model.predict(next_day)
```

This creates O(instruments × days) training iterations:

  • 5,194 instruments × 970 prediction days × 30 epochs ≈ 145 million training iterations

Similar Issues

  • RandomForest: Same pattern, per-stock per-day retraining
  • XGBoost: Same anti-pattern
  • Critic feedback: Consistently rejects these as "not aligned with factor calculation paradigms" but the LLM keeps regenerating similar code

Environment

  • RD-Agent with CoSTEER framework
  • Data: MultiIndex DataFrame with datetime × instrument
  • Python: 3.10, conda env: rdagent4qlib

Logs

Key log entries showing the pattern:
```
critic 1: The factor requires training a new Random Forest model for each instrument and each day, which is computationally prohibitive...
critic 2: The implementation is a heuristic approximation, not the actual ML factor...
final_decision: false
```

Workaround Found

We manually optimized the LSTM code to use vectorized pre-training:

  • Train Ridge regression once on all data
  • Batch predict for all stocks/days
  • Reduced time from hours to 28 seconds

But this requires manual intervention each time ML-based factors are generated.

Questions

  1. Is there a way to configure the Critic or prompts to accept "pre-trained model + batch prediction" approach?
  2. Should the factor selection phase filter out ML-based factors by default?
  3. Any suggestions for handling ML-based factors in the RAG knowledge base?

Thank you for this excellent framework!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions