ML-based factors (LSTM/RandomForest) hang on large datasets (5M+ rows)

## Problem Description

When RD-Agent generates factors containing machine learning models (such as LSTM, RandomForest, XGBoost), the execution hangs or takes extremely long time on datasets with 5M+ rows.

### Example Scenario
- Data: 5,160,401 rows (A-share stock data, 5194 instruments × 1046 trading days)
- Factor: LSTM-based 5D return predictor
- Expected behavior: Complete within minutes
- Actual behavior: Process runs for hours at 100% single-core CPU, never completes

### Root Cause Analysis
The LSTM implementation generated by the LLM uses a **per-stock per-day training** pattern:

\`\`\`python
# Problematic pattern generated by LLM:
for instrument in instruments:
    for day in trading_days:
        # Train NEW LSTM model for each instrument each day
        model = train_new_lstm(data[:day])
        prediction = model.predict(next_day)
\`\`\`

This creates O(instruments × days) training iterations:
- 5,194 instruments × 970 prediction days × 30 epochs ≈ **145 million training iterations**

### Similar Issues
- **RandomForest**: Same pattern, per-stock per-day retraining
- **XGBoost**: Same anti-pattern
- **Critic feedback**: Consistently rejects these as "not aligned with factor calculation paradigms" but the LLM keeps regenerating similar code

## Environment
- RD-Agent with CoSTEER framework
- Data: MultiIndex DataFrame with datetime × instrument
- Python: 3.10, conda env: rdagent4qlib

## Logs
Key log entries showing the pattern:
\`\`\`
critic 1: The factor requires training a new Random Forest model for each instrument and each day, which is computationally prohibitive...
critic 2: The implementation is a heuristic approximation, not the actual ML factor...
final_decision: false
\`\`\`

## Workaround Found
We manually optimized the LSTM code to use vectorized pre-training:
- Train Ridge regression once on all data
- Batch predict for all stocks/days
- Reduced time from hours to 28 seconds

But this requires manual intervention each time ML-based factors are generated.

## Questions
1. Is there a way to configure the Critic or prompts to accept "pre-trained model + batch prediction" approach?
2. Should the factor selection phase filter out ML-based factors by default?
3. Any suggestions for handling ML-based factors in the RAG knowledge base?

---
Thank you for this excellent framework!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ML-based factors (LSTM/RandomForest) hang on large datasets (5M+ rows) #1407

Problem Description

Example Scenario

Root Cause Analysis

Problematic pattern generated by LLM:

Similar Issues

Environment

Logs

Workaround Found

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

ML-based factors (LSTM/RandomForest) hang on large datasets (5M+ rows) #1407

Description

Problem Description

Example Scenario

Root Cause Analysis

Problematic pattern generated by LLM:

Similar Issues

Environment

Logs

Workaround Found

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions