Research-grade data collection and dataset-building pipeline for Polymarket BTC 5-minute up/down markets.
This repository is a data pipeline, not an automated trading bot. Its job is to collect market data, measure data quality, build research datasets, and support backtesting.
- collects live BTC 5-minute Polymarket market snapshots
- stores top-of-book and depth summaries for YES/NO order books
- records BTC spot reference ticks
- tracks official market resolutions
- runs quality audits, health checks, and backups
- builds leak-safe features, labels, and decision datasets
- runs execution-aware backtests on the resulting dataset
Supported now:
- BTC 5-minute Polymarket up/down markets
- BTC spot reference feed
- official resolution collection
- quality audits
- feature and label ETL
- execution-aware backtesting
Not the focus of this repo:
- generic all-market support
- cloud deployment automation
- cross-platform packaging
- direct order execution or live trading automation
Prediction market research gets noisy fast if the collection layer is weak. This repo is built to solve the data problem first:
- reproducible ETL outputs
- no future leakage in feature generation
- explicit slot-level quality gating
- operational monitoring for unattended collection
flowchart LR
A["Polymarket CLOB / Gamma"] --> B["btc5m-scanner.exe"]
C["BTC spot reference"] --> D["btc5m-reference.exe"]
E["Official resolution"] --> F["btc5m-resolution.exe"]
B --> G["btc5m_dataset.db"]
D --> G
F --> G
G --> H["btc5m_audit_dataset.py"]
G --> I["btc5m_build_features.py"]
G --> J["btc5m_build_labels.py"]
I --> K["btc5m_decision_dataset"]
J --> K
K --> L["btc5m_run_backtest.py"]
- common Shared database helpers, lock handling, operational status, feeds, and backtest engine.
- polymarket_scanner Live BTC5M market scanner and snapshot publisher.
- scripts Audit, backup, setup verification, feature build, label build, decision dataset build, summaries, and backtest runner.
- control Collector control scripts, monitor console, and scheduler registration.
- docs Specs, architecture notes, runbooks, and planning documents.
The live SQLite dataset is stored at runtime/data/btc5m_dataset.db.
Core tables:
btc5m_marketsbtc5m_snapshotsbtc5m_orderbook_depthbtc5m_reference_ticksbtc5m_reference_1m_ohlcvbtc5m_lifecycle_eventscollector_runsquality_auditsbtc5m_featuresbtc5m_labelsbtc5m_decision_dataset
This setup path is written for a fresh clone on a Windows machine.
Required:
- Windows 10 or 11
- Python 3.11+
- Git
- PowerShell
Optional but useful:
- DB Browser for SQLite
- NordVPN or another VPN if you need region-specific routing for Polymarket access
git clone git@github.com:Chelebii/prediction-market-data-pipeline.git
cd prediction-market-data-pipelineor
git clone https://github.com/Chelebii/prediction-market-data-pipeline.git
cd prediction-market-data-pipelinepython -m venv .venv
.\.venv\Scripts\Activate.ps1pip install -r requirements.txtCopy-Item polymarket_scanner\.env.example polymarket_scanner\.envThen edit polymarket_scanner/.env as needed.
Notes:
- the example file is safe to keep as-is for basic local setup
- Telegram credentials are optional and only needed for alerts
- runtime paths in the example file are repo-relative by default
- real
.envfiles are ignored by Git
This check does not start collectors or write dataset rows. It verifies that the clone has the expected files, Python version, dependencies, and local env shape.
python scripts\btc5m_verify_setup.pyIf you prefer structured output:
python scripts\btc5m_verify_setup.py --jsonRecommended on Windows, and effectively required if you want VPN split tunneling by process name.
powershell -ExecutionPolicy Bypass -File control\scripts\ensure_btc5m_process_exes.ps1Expected long-running collector process names:
btc5m-scanner.exebtc5m-reference.exebtc5m-resolution.exe
Expected periodic maintenance process names:
btc5m-healthcheck.exebtc5m-dataset-audit.exebtc5m-backup-dataset.exe
The script copies the resolved real CPython executable into BTC5M-specific image names next to that Python installation. If needed, override paths with:
BTC5M_SCANNER_EXE_PATHBTC5M_REFERENCE_EXE_PATHBTC5M_RESOLUTION_EXE_PATHBTC5M_HEALTHCHECK_EXE_PATHBTC5M_AUDIT_EXE_PATHBTC5M_BACKUP_EXE_PATH
powershell -ExecutionPolicy Bypass -File control\scripts\btc5m_collection_control.ps1 -Action startOr start the full monitor flow:
control\scripts\start_btc5m_collectors.cmdThe control script is intended to be safe to re-run and should avoid duplicate long-running collectors.
powershell -ExecutionPolicy Bypass -File control\scripts\register_btc5m_collection_tasks.ps1 -Action registerThis registers:
- a Startup folder entry that starts the collector monitor on user logon
- health check every 5 minutes
- dataset audit every 15 minutes
- derived ETL every 15 minutes
- backup every 6 hours
python scripts\btc5m_collection_summary.pyWhat you want to see:
- collectors are
RUNNING - snapshot freshness is low
- reference freshness is low
- no urgent warnings
Open runtime/data/btc5m_dataset.db in DB Browser for SQLite and look at:
btc5m_snapshotsbtc5m_reference_ticksbtc5m_marketsquality_audits
A healthy first run usually looks like this:
- the three collector processes appear as
btc5m-scanner.exe,btc5m-reference.exe, andbtc5m-resolution.exe python scripts\btc5m_collection_summary.pyreports no urgent warningsruntime/data/btc5m_dataset.dbstarts growingruntime/logs/contains fresh collector log files
Start collectors:
powershell -ExecutionPolicy Bypass -File control\scripts\btc5m_collection_control.ps1 -Action startStop collectors:
powershell -ExecutionPolicy Bypass -File control\scripts\btc5m_collection_control.ps1 -Action stopRestart collectors:
powershell -ExecutionPolicy Bypass -File control\scripts\btc5m_collection_control.ps1 -Action restartStatus:
powershell -ExecutionPolicy Bypass -File control\scripts\btc5m_collection_control.ps1 -Action statusTask Scheduler status:
powershell -ExecutionPolicy Bypass -File control\scripts\register_btc5m_collection_tasks.ps1 -Action statusUnregister Startup and scheduled tasks:
powershell -ExecutionPolicy Bypass -File control\scripts\register_btc5m_collection_tasks.ps1 -Action unregisterOperational summary:
python scripts\btc5m_collection_summary.pyJSON summary:
python scripts\btc5m_collection_summary.py --jsonManual audit:
python scripts\btc5m_audit_dataset.py --lookback-hours 48 --max-markets 250 --include-activeManual backup:
python scripts\btc5m_backup_dataset.pyOnce enough live data has been collected, run the ETL stages.
Build features:
python scripts\btc5m_build_features.py --feature-version v1Build labels:
python scripts\btc5m_build_labels.py --label-version v1Build final decision dataset:
python scripts\btc5m_build_decision_dataset.py --dataset-version v1 --feature-version v1 --label-version v1Run the same derived ETL flow used by Task Scheduler:
control\scripts\run_btc5m_derived_etl.cmdRun a baseline backtest:
python scripts\btc5m_run_backtest.py --dataset-version v1 --feature-version v1 --split-bucket train --strategy momentum- Windows-first operational tooling; Linux/macOS are not the primary target yet
- setup is optimized for one always-on collection machine
- VPN routing requirements depend on your jurisdiction and network setup
- no packaged installer or container workflow is provided yet
- no automated test suite is shipped yet; use
python scripts\btc5m_verify_setup.pyand the live operational summary for verification
Do not commit:
- real
.envfiles runtime/- live
.dbfiles - private keys or API credentials
Ignored by default:
.env.env.*runtime/state/*.db*.db-shm*.db-wal*.log*.log.**.lock*.pem*.key*.p12*.pfx