RESEARCH NOTEBOOK / PUBLIC LOG

An AI-driven research lab
for /ES intraday strategies.

This notebook documents systematic strategy discovery on the E-mini S&P 500 futures contract. Hypotheses are codified as Python signal generators, evaluated under a fixed walk-forward + Monte Carlo protocol, and graded against an in-sample/out-of-sample split with a hard 10pt mean-edge floor. Survivors are tracked against live tape; failures are logged and never re-tried under the same parameters. Everything below is auto-generated from the underlying results table — no manual curation.

RUNTIME · -- SAMPLE · 370k+ bars HYPOTHESES · -- SURVIVAL · --
Tested (n)
--
cumulative hypotheses
Killed
--
rejected at grading
Grade A/B
--
production promoted
Best PF (full)
--
profit factor
Best µ (pts)
--
mean P&L / trade
Kill rate
--
false-positive control

// Abstract

research_log_v3

The hypothesis-testing pipeline is built around a single thesis: most apparent intraday edges in /ES are payoff-structure artefacts — outcomes of asymmetric stop/target geometry rather than information edges in the order flow. To control for this, every strategy is benchmarked against a randomized-entry control that holds the same trading days, the same risk geometry, and the same time-window constraints. A strategy that does not statistically dominate the control is not a strategy — it is a stop-target ratio.

The grading protocol applies a fixed minimum-edge floor of µ ≥ 10 ES points per trade across both in-sample and out-of-sample windows. This is calibrated to one round-trip cost on a one-lot at typical /ES bid-ask plus realistic slippage; anything below is dominated by execution noise on real fills. As of 2026-04-30, the active table contains 59 hypotheses surviving this filter, of which 0 hold a Grade A or B classification. The lab is currently in a cold-restart phase following a full purge of pre-floor results.

Grade Distribution

Profit Factor × Sample Size

Mean Edge per Trade (µ, pts) — Distribution Across Survivors

// Methodology

protocol_v3 · locked
Universe
/ESM26 front-month, 1-minute OHLCV, RTH 09:30–16:00 ET
Sample
Continuous ~12 months, ~370,000 bars after roll-merge cleansing
Split
Walk-forward, train ≈67% / test ≈33% strict OOS
Entry
Next-bar open, 0.25pt slippage applied to entry & exit
Stops
Intrabar trigger on bar H/L (conservative fill assumption)
Cost
$4.50 round-trip commission ≈ 0.09pts on /ES
Sample bar
n ≥ 30 in-sample, n ≥ 20 OOS for any non-F grade
Random control
200 iterations, identical risk geometry, fixed seed for determinism

// Grading Function

G(s) → {A, B, C, F}
  1. Edge floor. Reject if µ<10pts on either FULL or OOS. Mean-edge below the round-trip cost band is statistically indistinguishable from noise on a one-lot.µ_full ≥ 10 ∧ (n_oos < 5 ∨ µ_oos ≥ 10)
  2. Profit factor. A: PF_full > 1.5, PF_oos > 1.3. B: PF_full > 1.3, PF_oos > 1.0. C: PF_full > 1.0 with marginal OOS.PF = Σ wins / |Σ losses|
  3. Out-of-sample win rate. A: WR_oos > 50% under Wilson 95% CI. B: WR_oos > 45%.
  4. Edge-vs-random ratio. A: ρ ≥ 2.0×. B: ρ ≥ 1.0×. Below 1× ⇒ no information edge — payoff geometry only.ρ = µ_signal / µ_random_control
  5. Sample size. A: n_full ≥ 30 ∧ n_oos ≥ 20. B: n_full ≥ 25 ∧ n_oos ≥ 15. Otherwise capped at C/F regardless of stats.
  6. Failure modes (auto-F). µ_full ≤ 0, PF_full ≤ 1.0, µ_oos < 0 with n_oos ≥ 5, or ρ < 1.0.

// Active Cohort

aggregated by grade
Gradenµ ̄ (pts)σ(µ)Best PFBest PF_oosµ_max
loading manifest...

→ full scoreboard · → dashboard view · → production tracker

// Open Research Questions

tracked

Q1 / Mean-reversion family rehab. The volume-cluster compression cohort dominated pre-purge keepers (PF 1.1–14.6) but every member failed the µ ≥ 10pt floor. The structural pattern does reliably mark short-window snapbacks of 1–3pts. Open question: does widening targets to 12–15pts and accepting longer time-in-trade rehabilitate the family without inducing new selection bias? Pending walk-forward sweep on the original detection logic with a target-distance grid {12, 15, 18, 22pts}.

Q2 / Lookahead-bias static audit. Several killed candidates referenced group.iloc[i+1:i+3] inside detect_signals() — a forward-window read used to confirm the reversal. This guarantees inflated WR. A static AST audit of every detect_signals implementation is in progress to flag any forward-bar references; flagged strategies will be re-graded under a strict next-bar-only constraint.

Q3 / Random-control variance. Current _random_control seeds RNG=42 and runs 200 iterations on the union of signal days. Output is deterministic but the variance estimator may understate true distribution width. Open question: bootstrap the trading-day set in addition to entry timing, with the trade-off of giving up cross-strategy comparability for the same control draw.

Q4 / Regime conditioning at promotion. Current pipeline reports a VIX low/normal/high decomposition but does not gate live deployment on regime. A strategy graded B in normal-vol may be Grade F in high-vol; promoting unconditionally risks regime-induced kills. Pending design: minimum n per regime bucket before any production gating decision.

// Recent Kills

last 7d
2026-04-29 · purge
1,945 historical rows removed from lab_results.csv. Every dropped row had µ < 10pts — graded under a permissive earlier rule and would not have qualified under the current floor. Backed up to lab_results_pre_10pt_purge_*.csv for the audit trail.
2026-04-28 · re-validation cohort
36 prior Grade C survivors re-tested against the latest 12-month window. None survived. All fell below the µ floor on fresh out-of-sample. Logged to regrader_history.csv; verdict DEGRADED for all 36.
2026-04-30 · data source migration
Pipeline migrated off Schwab API to Databento exclusively. Schwab refresh-token expiries had silently broken the EOD signal resolver, producing a stale +15 scorecard on 2026-04-29 that should have read +12 (one open signal stayed PENDING because the resolver could not authenticate). Schwab dependency now removed from runtime hot paths.

// Notes on Reproducibility

audit trail
All grading is computed from a single fixed function with no per-strategy tuning. The same scoring pipeline runs on every backtest; results are appended to data/quant/lab_results.csv with deduplication by (name, date). Strategy modules live in quant/strategies/ and are immutable after publication — tweaks generate a new _v2 module with its own row. The kill log is append-only. Methodology revisions are versioned (protocol_v3 as of the 10pt-floor enforcement); historical rows retain their original protocol version.