An AI-driven research lab
for /ES intraday strategies.
This notebook documents systematic strategy discovery on the E-mini S&P 500 futures contract. Hypotheses are codified as Python signal generators, evaluated under a fixed walk-forward + Monte Carlo protocol, and graded against an in-sample/out-of-sample split with a hard 10pt mean-edge floor. Survivors are tracked against live tape; failures are logged and never re-tried under the same parameters. Everything below is auto-generated from the underlying results table — no manual curation.
// Abstract
The hypothesis-testing pipeline is built around a single thesis: most apparent intraday edges in /ES are payoff-structure artefacts — outcomes of asymmetric stop/target geometry rather than information edges in the order flow. To control for this, every strategy is benchmarked against a randomized-entry control that holds the same trading days, the same risk geometry, and the same time-window constraints. A strategy that does not statistically dominate the control is not a strategy — it is a stop-target ratio.
The grading protocol applies a fixed minimum-edge floor of µ ≥ 10 ES points per trade across both in-sample and out-of-sample windows. This is calibrated to one round-trip cost on a one-lot at typical /ES bid-ask plus realistic slippage; anything below is dominated by execution noise on real fills. As of 2026-04-30, the active table contains 59 hypotheses surviving this filter, of which 0 hold a Grade A or B classification. The lab is currently in a cold-restart phase following a full purge of pre-floor results.
Grade Distribution
Profit Factor × Sample Size
Mean Edge per Trade (µ, pts) — Distribution Across Survivors
// Methodology
- Universe
- /ESM26 front-month, 1-minute OHLCV, RTH 09:30–16:00 ET
- Sample
- Continuous ~12 months, ~370,000 bars after roll-merge cleansing
- Split
- Walk-forward, train ≈67% / test ≈33% strict OOS
- Entry
- Next-bar open, 0.25pt slippage applied to entry & exit
- Stops
- Intrabar trigger on bar H/L (conservative fill assumption)
- Cost
- $4.50 round-trip commission ≈ 0.09pts on /ES
- Sample bar
- n ≥ 30 in-sample, n ≥ 20 OOS for any non-F grade
- Random control
- 200 iterations, identical risk geometry, fixed seed for determinism
// Grading Function
- Edge floor. Reject if µ<10pts on either FULL or OOS. Mean-edge below the round-trip cost band is statistically indistinguishable from noise on a one-lot.µ_full ≥ 10 ∧ (n_oos < 5 ∨ µ_oos ≥ 10)
- Profit factor. A: PF_full > 1.5, PF_oos > 1.3. B: PF_full > 1.3, PF_oos > 1.0. C: PF_full > 1.0 with marginal OOS.PF = Σ wins / |Σ losses|
- Out-of-sample win rate. A: WR_oos > 50% under Wilson 95% CI. B: WR_oos > 45%.
- Edge-vs-random ratio. A: ρ ≥ 2.0×. B: ρ ≥ 1.0×. Below 1× ⇒ no information edge — payoff geometry only.ρ = µ_signal / µ_random_control
- Sample size. A: n_full ≥ 30 ∧ n_oos ≥ 20. B: n_full ≥ 25 ∧ n_oos ≥ 15. Otherwise capped at C/F regardless of stats.
- Failure modes (auto-F). µ_full ≤ 0, PF_full ≤ 1.0, µ_oos < 0 with n_oos ≥ 5, or ρ < 1.0.
// Active Cohort
| Grade | n | µ ̄ (pts) | σ(µ) | Best PF | Best PF_oos | µ_max |
|---|---|---|---|---|---|---|
| loading manifest... | ||||||
// Open Research Questions
Q1 / Mean-reversion family rehab. The volume-cluster compression cohort dominated pre-purge keepers (PF 1.1–14.6) but every member failed the µ ≥ 10pt floor. The structural pattern does reliably mark short-window snapbacks of 1–3pts. Open question: does widening targets to 12–15pts and accepting longer time-in-trade rehabilitate the family without inducing new selection bias? Pending walk-forward sweep on the original detection logic with a target-distance grid {12, 15, 18, 22pts}.
Q2 / Lookahead-bias static audit. Several killed candidates referenced group.iloc[i+1:i+3] inside detect_signals() — a forward-window read used to confirm the reversal. This guarantees inflated WR. A static AST audit of every detect_signals implementation is in progress to flag any forward-bar references; flagged strategies will be re-graded under a strict next-bar-only constraint.
Q3 / Random-control variance. Current _random_control seeds RNG=42 and runs 200 iterations on the union of signal days. Output is deterministic but the variance estimator may understate true distribution width. Open question: bootstrap the trading-day set in addition to entry timing, with the trade-off of giving up cross-strategy comparability for the same control draw.
Q4 / Regime conditioning at promotion. Current pipeline reports a VIX low/normal/high decomposition but does not gate live deployment on regime. A strategy graded B in normal-vol may be Grade F in high-vol; promoting unconditionally risks regime-induced kills. Pending design: minimum n per regime bucket before any production gating decision.
// Recent Kills
lab_results.csv. Every dropped row had µ < 10pts — graded under a permissive earlier rule and would not have qualified under the current floor. Backed up to lab_results_pre_10pt_purge_*.csv for the audit trail.regrader_history.csv; verdict DEGRADED for all 36.// Notes on Reproducibility
data/quant/lab_results.csv with deduplication by (name, date). Strategy modules live in quant/strategies/ and are immutable after publication — tweaks generate a new _v2 module with its own row. The kill log is append-only. Methodology revisions are versioned (protocol_v3 as of the 10pt-floor enforcement); historical rows retain their original protocol version.