A follow-on brief because someone cared about wishlist item 5.3. The honest data is smaller than the parent doc claimed, the underlying problem is more interesting because of it, and there are five real research questions hiding inside what looked like one engineering task. Below: the real numbers, the schema, the proposed pipeline, and what a researcher could actually pick up.
The parent brief said "~2,500 trade-decision rows." That was generous. Real number is exactly 211 rich-feature rows across six bots, plus ~1,800 thin rows from a market-maker bot logging only {timestamp, result, merge_profit}. Most of the thin set is unusable for a real outcome model. I'm flagging this here so you can calibrate expectations before deciding if it's still interesting.
Update 2026-05-03 ~23:30 UTC — what's actually shipped. After this conversation, three implementation passes + two audit passes ran. 28 ML modules are live across the brain (13), the office (10), and a cross-cutting layer (5); ~73 deferred items are in~/vault/research/ML_BACKLOG.mdwith effort estimates, not pretending to be done. Audit found 2 BLOCKERS + 4 HIGH issues, all fixed before this update — including a localStorage-persistence bug on 4 office features and a too-lax drift detector that was failing to fire on a real 31-pp WR shift. Brochure numbers below are now from the audited pipeline, not implementer self-reports.
Real findings already produced by the audited modules:
- directional_v1: 14/90 wins, P(WR<0.5)=100%, total PnL −$1,538.43. Decision-curve confirms no profitable threshold (best PnL = −$277, CI [−$298, −$252]).
- delta_sniper_v2: 74.4% WR but PnL −$7.46 — high WR with bigger losers than winners. The only bot whose Wilson CI sits entirely above 50%, yet still loses money.
- Meta-LR finding: directional_v1 is using sub-optimal η=0.10. True optimum is η=0.50 → +30.24% Brier improvement. lil_tail_v2 has +74.67% headroom.
- Thompson kill rule over 8 candidate policies confirms only delta_sniper_v2 is above break-even; cum-loss spread $3,068 between best and worst arms.
- Drift detector (post-fix) fires combined alert when any of {PSI>0.25, |ΔWR|>0.15, sKL>1.0} — currently flagging directional_v1 (psi_textbook + wr_shift) and delta_sniper_v2 (wr_shift).
The smaller number sharpens the problem. At N=2,500 you'd reach for an XGBoost and call it done. At N=200 with multiple sources of feature heterogeneity, the model choice is downstream of an actual research question about how to combine evidence across non-identically-distributed strategies. That's the more useful framing anyway.
| Bot | Rows | Feature richness | Outcome label |
|---|---|---|---|
| directional_v1 Polymarket BTC/ETH/SOL 5-min Up/Down |
90 | rich entry price, shares, signal mode (vol-flow / OFI / Hawkes), volume ratio, condition_id, window_end | payout, trade_pnl, result ∈ {WIN, EXPIRED, REDEEMED} |
| delta-sniper-v2 Polymarket late-window delta |
39 | medium entry side, delta band, time-in-window | pnl, settle status |
| lil-tail v1 BTC 5m tail-buy at $0.01–$0.02 |
4 | rich entry price, side, vol state, magnitude gate, window_start | result (4 L so far) |
| lil-tail v2 Mint-and-sell SPLIT-arb |
50 | rich mint cost, up/down filled, redemption value, sells collected | cycle_pnl, result ∈ {P, L, F} |
| meme-bot Solana copy-trade (paper) |
12 | medium source wallet id, entry price, hold duration | pnl |
| sports-mirror-bot | 16 | medium match, side, line | binary outcome |
| SMM merge logs polymarket-bot/ml_data/trades.jsonl |
1,852 | thin just timestamp, result, merge_profit |
three-class result |
| resolved_sniper brain-system/state — older copy-trade |
90 | likely overlaps with directional_v1 | shared |
Five reasons we think this isn't a "wait until you have more data" problem.
Here's what one row of directional_v1/history.jsonl actually looks like — verified, not paraphrased:
// /opt/polymarket-bot/history.jsonl · sample row { "ts": "1776404522", "condition_id": "0x77e37e4aa27726ec…", "outcome": "DOWN", "token_id": "65047327802547722647…", "our_price": 0.31, "shares": 14.0, "cost": 4.34, "order_id": "0xa4c0e35fed3e7578…", "success": true, "result": "EXPIRED", "signal_mode": "volume_flow", "volume_ratio": 0.6718, "title": "Bitcoin Up or Down", "window_end": 1776404700, "payout": 0.0, "trade_pnl": -4.34, "_logged_at": 1776487920 }
The rest of the bots have the same flavour but different keys. Step one of any real pipeline is a unified-feature contract that maps each bot's columns to a shared schema, with a bot_id column for partial pooling.
Concrete enough that a researcher can decide whether they want it.
| Week | Deliverable | Definition of done | Status |
|---|---|---|---|
| 1 | Unified schema across the bots; data-quality audit; baseline summary stats per bot | One file with 211 rows and a clean column contract; documented null-handling; per-bot mean / Wilson CI on outcome | shipped 2026-05-03 decision_log_pipeline.py writes ~/vault/research/decision-log/unified.jsonl + summary + calibration. Wilson CIs for all 6 bots in summary.json. |
| 1.5 | Bayesian Beta-posterior kill rule replacing fixed thresholds | Reusable module that takes (wins, losses) and emits kill-or-keep with min-N gate. Backfit against existing bots. | shipped 2026-05-03 bayesian_kill.py. Smoke-tested against all 6 bots' current state. Would correctly kill directional_v1 and lil_tail_v2; correctly keep delta_sniper_v2 and the small-N cases. |
| 2 | First model + walk-forward CV; calibration plots; decision-curve analysis | GBT or hierarchical glmm; 5-fold time-series CV with embargo; reliability diagram per bot; PnL-vs-threshold curve with optimal threshold marked | open needs the data scientist |
| 3 | Handoff: live-mode integration, write-up, & future-bot logging contract | Threshold wired into one bot's kill rule; one-pager summarising what works / what doesn't; FUTURE_BOTS.md doc spec'ing the columns a new bot must log to slot in |
open partial — Bayesian kill rule could be wired into v1/v2 today, but waiting on Week 2 to define the right confidence threshold from real data |
What's left for the data scientist after tonight: weeks 2–3 above. Week 1 is done. The unified jsonl + summary + calibration files are at ~/vault/research/decision-log/ and the directional_v1 raw history (90 rows, rich features) is the cleanest single-bot starting point. The Bayesian kill module is a 30-line drop-in that can be wired into any bot's main loop with one import.
Outcome of the work: "strategy X has measurable edge separable from noise / strategy X is statistically indistinguishable from a coin flip" — said with a number that we can defend, plus a kill threshold that isn't 5_tickets_zero_wins.
If you wanted to push back on any of this, these are where the back-and-forth would actually be useful. Numbered so you can answer "Q3a — disagree, here's why" without retyping context.
Hierarchical Bayesian regression with partial pooling vs. multi-task GBT with bot-id as a grouping vs. a tiny calibrated MLP. At N=211 with 4–6 bots, which would you reach for first, and why specifically? My weak prior is hierarchical Bayes for interpretability, but I haven't lived in this regime.
de Prado's purged k-fold with embargo seems mandatory at this scale, but the embargo length is a free parameter. For a 5-min Polymarket settlement, is a 1-day embargo sufficient or do you want a longer "regime" buffer (1 week)? What signal would tell us we picked it wrong after the fact?
Platt scaling is cheap but assumes a sigmoid relationship. Isotonic is more flexible but needs more data. At N=90 per bot post-CV, which holds up? Or do we cross-bot-pool the calibration set itself?
Binary profitable_settle ∈ {0,1} is the simple choice. A continuous pnl_normalised_by_size target preserves more signal but the loss function gets weirder (skewed, fat tails). Do you target both and pick at deploy time, or commit early?
Parent brief's Q7 is "how do you detect regime change at small N." This work fits in as the likelihood; the regime detector is the prior shift on whether the likelihood still applies. Is that the right factorisation, or are you reaching for something joint (a switching-state-space model, say) where the two questions are one?
The cleanest 90 rows are at /opt/polymarket-bot/history.jsonl (rich features, single bot, single market type). I'd start there — get the calibration plot for that bot alone, decide if the protocol survives walk-forward, then add bots once the per-bot version works.
Email Neo at theprofessor.alexander@gmail.com with thoughts on Q3a–Q3e, or any reply at all. We're treating this as a real conversation, not an interview.