I collect every price movement across 7,500+ Polymarket prediction markets. Every 4 minutes. That's 6.3 million price points and counting.
Here's the exact architecture, the mistakes I made, and the code patterns that actually work at this scale.
Why Prediction Market Data Is Hard
Polymarket runs on the Polygon blockchain with an off-chain CLOB (Central Limit Order Book). There's no single endpoint that gives you "all prices." You need to:
- Hit the Gamma API to discover markets and conditions
- Poll the CLOB API for real-time prices per token
- Store everything with timestamps for historical analysis
- Handle markets that resolve, expire, or get delisted
Most people scrape once and call it a day. I needed continuous collection because prediction market alpha lives in price velocity — how fast prices move after news breaks.
The Schema That Survived 6.3M Rows
I started with a normalized PostgreSQL schema. Markets table, conditions table, tokens table, prices table with foreign keys everywhere. It was elegant. It was also painfully slow for the queries I actually needed.
Here's what I landed on — a denormalized SQLite schema optimized for time-series reads:
CREATE TABLE markets (
condition_id TEXT PRIMARY KEY,
question TEXT,
slug TEXT,
category TEXT,
end_date TEXT,
active INTEGER DEFAULT 1,
volume REAL DEFAULT 0,
liquidity REAL DEFAULT 0,
updated_at TEXT
);
CREATE TABLE prices (
id INTEGER PRIMARY KEY AUTOINCREMENT,
condition_id TEXT,
token_id TEXT,
outcome TEXT,
price REAL,
timestamp TEXT DEFAULT (datetime('now')),
FOREIGN KEY (condition_id) REFERENCES markets(condition_id)
);
CREATE INDEX idx_prices_condition_time
ON prices(condition_id, timestamp);
CREATE INDEX idx_prices_timestamp
ON prices(timestamp);
Why SQLite over Postgres? Three reasons:
- Single-file deployment. The entire database is one file I can copy, backup, and ship.
- Read performance. SQLite handles 6M+ rows with proper indexes without breaking a sweat. My dashboard queries return in <50ms.
-
Zero ops. No connection pooling, no pg_hba.conf, no pg_dump cron jobs.
cp market_universe.db backup/is my backup strategy.
The tradeoff: no concurrent writes. But I only have one writer (the collector), so it doesn't matter.
The Collection Loop
The core collector runs as a simple Python loop. No Celery, no message queues, no Kubernetes. Just a process that wakes up every 4 minutes:
import time
import sqlite3
import requests
from datetime import datetime
GAMMA_API = "https://gamma-api.polymarket.com"
CLOB_API = "https://clob.polymarket.com"
def collect_cycle(db_path: str):
conn = sqlite3.connect(db_path)
# 1. Discover active markets
markets = fetch_all_markets()
upsert_markets(conn, markets)
# 2. Collect prices for each token
tokens = get_active_tokens(conn)
prices = fetch_prices_batch(tokens)
insert_prices(conn, prices)
conn.commit()
conn.close()
def fetch_all_markets():
"""Paginate through Gamma API to get all markets."""
markets = []
offset = 0
while True:
resp = requests.get(
f"{GAMMA_API}/markets",
params={"limit": 100, "offset": offset, "active": True}
)
batch = resp.json()
if not batch:
break
markets.extend(batch)
offset += 100
return markets
# Run forever
while True:
try:
collect_cycle("market_universe.db")
print(f"[{datetime.now()}] Collected {len(prices)} prices")
except Exception as e:
print(f"[{datetime.now()}] Error: {e}")
time.sleep(240) # 4 minutes
Mistakes I Made (So You Don't)
Mistake 1: Fetching prices one-by-one. My first version made one API call per token. With 7,500 markets and 2 tokens each, that's 15,000 requests per cycle. I got rate-limited within minutes.
Fix: Batch price fetches. The CLOB API accepts arrays of token IDs. I batch 50 tokens per request, bringing 15,000 calls down to 300.
Mistake 2: Storing every price even when it hasn't changed. If a market is at 0.65 and stays at 0.65 for 6 hours, I was storing 90 identical rows.
Fix: Only insert when price changes from the last recorded value:
def should_store(conn, condition_id, token_id, new_price):
cursor = conn.execute("""
SELECT price FROM prices
WHERE condition_id = ? AND token_id = ?
ORDER BY timestamp DESC LIMIT 1
""", (condition_id, token_id))
row = cursor.fetchone()
if row is None:
return True
return abs(row[0] - new_price) > 0.001 # 0.1% threshold
This cut storage by ~60% without losing any meaningful signal.
Mistake 3: No connection timeout. The Gamma API occasionally hangs for 30+ seconds. My collector would stall, miss cycles, and create gaps in the data.
Fix: Aggressive timeouts on every request:
resp = requests.get(url, params=params, timeout=(5, 15))
# 5s connect timeout, 15s read timeout
Querying 6.3M Rows Fast
The dashboard needs to answer questions like:
- "What moved the most in the last hour?"
- "Show me the price history for this market"
- "Which markets have the highest volume?"
Here's the query that powers the "biggest movers" widget:
WITH recent AS (
SELECT condition_id, token_id, outcome, price, timestamp,
ROW_NUMBER() OVER (
PARTITION BY condition_id, token_id
ORDER BY timestamp DESC
) as rn
FROM prices
WHERE timestamp > datetime('now', '-1 hour')
),
current_price AS (
SELECT * FROM recent WHERE rn = 1
),
hour_ago AS (
SELECT condition_id, token_id, price
FROM recent
WHERE rn = (SELECT MAX(rn) FROM recent r2
WHERE r2.condition_id = recent.condition_id
AND r2.token_id = recent.token_id)
)
SELECT m.question, cp.outcome, cp.price as current,
ha.price as previous,
(cp.price - ha.price) as delta
FROM current_price cp
JOIN hour_ago ha USING (condition_id, token_id)
JOIN markets m USING (condition_id)
WHERE ABS(cp.price - ha.price) > 0.02
ORDER BY ABS(cp.price - ha.price) DESC
LIMIT 20;
This runs in ~80ms on 6.3M rows. The idx_prices_timestamp index does the heavy lifting by pruning to just the last hour before the window function kicks in.
What the Data Actually Reveals
After collecting for 35 days straight, some patterns:
Price gaps at market open. Many markets have a "morning gap" where price jumps 3-5% in the first 30 minutes of US trading hours. This is arbitrageable.
Volume precedes price. In 78% of major moves (>10% swing), volume spikes 15-30 minutes before the price move. The liquidity providers know something.
Resolution convergence. Markets approaching expiry don't converge smoothly to 0 or 1. They stair-step, with 80% of the convergence happening in the final 12 hours.
Category clusters. Political markets move together. When one election market swings, correlated markets follow within minutes. Crypto markets are more independent.
These patterns are why I built PolyScope — a free dashboard that visualizes all of this data in real-time. You can explore the movers, track specific markets, and see the volume patterns yourself.
The Stack
- Collection: Python + requests + SQLite (runs on a $5/mo VPS)
- Database: SQLite with WAL mode enabled for concurrent reads during writes
- Dashboard: React + Vite, deployed on Vercel
- API: The dashboard reads from a replicated copy of the DB
Total infrastructure cost: $5/month for the VPS. Everything else is free tier.
What I'd Do Differently
If I were starting over:
- Use DuckDB instead of SQLite for the analytical queries. DuckDB's columnar storage is purpose-built for the aggregations I run most.
- Add WebSocket collection alongside polling. Polymarket's CLOB has a WebSocket feed that would give sub-second price updates.
- Store order book snapshots, not just mid-prices. The bid-ask spread tells you more about market confidence than the price alone.
Try It Yourself
The full dashboard is live at PolyScope — free, no signup. It refreshes every 4 minutes with the latest data from all 7,500+ markets.
If you want to build your own prediction market tools, the API Connector skill for Claude Code handles the Polymarket API integration patterns I described above — pagination, batching, error handling, and rate limit management. It saves hours of boilerplate when connecting to any REST API.
For building monitoring dashboards on top of your collected data, the Dashboard Builder skill generates full dashboard layouts from a metrics spec — I used it to prototype PolyScope's panel layout before writing custom components.
I publish data engineering and trading content weekly. Follow for more prediction market analysis.