Appendix to /research · v0.1 · 2026-05-03

Decision-log outcome model — the longer answer.

A follow-on brief because someone cared about wishlist item 5.3. The honest data is smaller than the parent doc claimed, the underlying problem is more interesting because of it, and there are five real research questions hiding inside what looked like one engineering task. Below: the real numbers, the schema, the proposed pipeline, and what a researcher could actually pick up.

numbers verified 2026-05-03 21:00 UTC ~7 minute read follow-on to /research

Contents

01Correction up front 02The real data 03Why N=200 is still interesting 04Schema & pipeline sketch 05What a 3-week piece looks like 06Five sub-questions for you

01Correction up front

The parent brief said "~2,500 trade-decision rows." That was generous. Real number is exactly 211 rich-feature rows across six bots, plus ~1,800 thin rows from a market-maker bot logging only {timestamp, result, merge_profit}. Most of the thin set is unusable for a real outcome model. I'm flagging this here so you can calibrate expectations before deciding if it's still interesting.

Update 2026-05-03 ~23:30 UTC — what's actually shipped. After this conversation, three implementation passes + two audit passes ran. 28 ML modules are live across the brain (13), the office (10), and a cross-cutting layer (5); ~73 deferred items are in ~/vault/research/ML_BACKLOG.md with effort estimates, not pretending to be done. Audit found 2 BLOCKERS + 4 HIGH issues, all fixed before this update — including a localStorage-persistence bug on 4 office features and a too-lax drift detector that was failing to fire on a real 31-pp WR shift. Brochure numbers below are now from the audited pipeline, not implementer self-reports.

Real findings already produced by the audited modules:

directional_v1: 14/90 wins, P(WR<0.5)=100%, total PnL −$1,538.43. Decision-curve confirms no profitable threshold (best PnL = −$277, CI [−$298, −$252]).

delta_sniper_v2: 74.4% WR but PnL −$7.46 — high WR with bigger losers than winners. The only bot whose Wilson CI sits entirely above 50%, yet still loses money.

Meta-LR finding: directional_v1 is using sub-optimal η=0.10. True optimum is η=0.50 → +30.24% Brier improvement. lil_tail_v2 has +74.67% headroom.

Thompson kill rule over 8 candidate policies confirms only delta_sniper_v2 is above break-even; cum-loss spread $3,068 between best and worst arms.

Drift detector (post-fix) fires combined alert when any of {PSI>0.25, |ΔWR|>0.15, sKL>1.0} — currently flagging directional_v1 (psi_textbook + wr_shift) and delta_sniper_v2 (wr_shift).

The smaller number sharpens the problem. At N=2,500 you'd reach for an XGBoost and call it done. At N=200 with multiple sources of feature heterogeneity, the model choice is downstream of an actual research question about how to combine evidence across non-identically-distributed strategies. That's the more useful framing anyway.

02The real data — what's labeled, what isn't

Bot	Rows	Feature richness	Outcome label
directional_v1 Polymarket BTC/ETH/SOL 5-min Up/Down	90	rich entry price, shares, signal mode (vol-flow / OFI / Hawkes), volume ratio, condition_id, window_end	`payout`, `trade_pnl`, `result` ∈ {WIN, EXPIRED, REDEEMED}
delta-sniper-v2 Polymarket late-window delta	39	medium entry side, delta band, time-in-window	`pnl`, settle status
lil-tail v1 BTC 5m tail-buy at $0.01–$0.02	4	rich entry price, side, vol state, magnitude gate, window_start	`result` (4 L so far)
lil-tail v2 Mint-and-sell SPLIT-arb	50	rich mint cost, up/down filled, redemption value, sells collected	`cycle_pnl`, `result` ∈ {P, L, F}
meme-bot Solana copy-trade (paper)	12	medium source wallet id, entry price, hold duration	`pnl`
sports-mirror-bot	16	medium match, side, line	binary outcome
SMM merge logs polymarket-bot/ml_data/trades.jsonl	1,852	thin just `timestamp`, `result`, `merge_profit`	three-class result
resolved_sniper brain-system/state — older copy-trade	90	likely overlaps with directional_v1	shared

Fig 1. Rich-feature rows aggregate to ~211. Each bar is a separate strategy with its own feature set; that's the heterogeneity you'd be working with.

03Why N=200 is still interesting

Five reasons we think this isn't a "wait until you have more data" problem.

Bot-feature heterogeneity is the real research question.
Each bot has different features. Concat-and-train throws away which features come from which strategy. Honest options: fixed effects per bot, hierarchical Bayesian regression with bot-level random effects on coefficients, or multi-task GBT with bot as a grouping. There's a real partial-pooling tradeoff hiding in N=39 rows for delta-sniper vs N=90 for directional. James-Stein-flavoured shrinkage is the kind of thing that pays for itself at this scale specifically.
Temporal leakage is everywhere and N is small enough that any leak matters.
The 90 directional_v1 rows span ~3 weeks. A naive k-fold lets the model see the future. Embargo-windowed time-series CV is mandatory, and at N=90 a 3-day embargo eats real chunks. Designing an honest CV protocol is more important than picking a model. Walk-forward backtest with purged cross-validation is the de Prado playbook; the question is what embargo length is right for our specific market.
Calibration matters more than AUC at this scale.
The bot uses the predicted probability as input to a kill rule. Brier score and reliability diagrams are the right metrics. We have no calibration today — every "this strategy is at 0.6 confidence" is a guess. A reliability curve on existing trades would be the first thing I'd build, before any model. Platt scaling or isotonic on top of whatever model gets picked.
This is the same problem as the regime-change detector (parent brief Q7), decomposed.
The outcome model is the likelihood. The regime-change detector is the prior shift on whether the model still applies. They factor cleanly. A complete answer ships both: a calibrated likelihood, plus a posterior over "is the world I trained on still the world I'm in?"
Scale dictates the model class — pick honestly.
211 rich rows is too few for deep nets, perfect for gradient-boosted trees with hand-crafted features, and most interesting for hierarchical Bayesian regression where partial pooling does the work. We would be reaching for one of those three, not transformers. Worth saying out loud because the temptation to over-model at N=200 is real.

04Schema & pipeline sketch

Here's what one row of directional_v1/history.jsonl actually looks like — verified, not paraphrased:

// /opt/polymarket-bot/history.jsonl · sample row
{
  "ts": "1776404522",
  "condition_id": "0x77e37e4aa27726ec…",
  "outcome": "DOWN",
  "token_id": "65047327802547722647…",
  "our_price": 0.31,
  "shares": 14.0,
  "cost": 4.34,
  "order_id": "0xa4c0e35fed3e7578…",
  "success": true,
  "result": "EXPIRED",
  "signal_mode": "volume_flow",
  "volume_ratio": 0.6718,
  "title": "Bitcoin Up or Down",
  "window_end": 1776404700,
  "payout": 0.0,
  "trade_pnl": -4.34,
  "_logged_at": 1776487920
}

The rest of the bots have the same flavour but different keys. Step one of any real pipeline is a unified-feature contract that maps each bot's columns to a shared schema, with a bot_id column for partial pooling.

Fig 2. Top row is the model. Bottom row is what makes it usable for a kill-rule decision. The handoff doc at the end is the artifact future bots check against.

Fig 3. Illustrative — bots emit binary entry/no-entry decisions, not probabilities. We can construct a real reliability curve only after a model produces predicted probabilities and we bin them. The Wilson-CI table below is the empirical-rate baseline against which any future model would be calibrated.

Fig 4. Wilson 95% CIs from the unified pipeline. Whiskers are CI bounds; dot is the empirical rate; red diamond marks where the rate sits relative to the 50% break-even. delta_sniper_v2 is the only bot whose CI sits entirely above 50% — yet its PnL is still negative, evidence that win-rate alone is not enough. (See Fig 5 caveat.)

05What a 3-week piece of work would look like

Concrete enough that a researcher can decide whether they want it.

Week	Deliverable	Definition of done	Status
1	Unified schema across the bots; data-quality audit; baseline summary stats per bot	One file with 211 rows and a clean column contract; documented null-handling; per-bot mean / Wilson CI on outcome	shipped 2026-05-03 `decision_log_pipeline.py` writes `~/vault/research/decision-log/unified.jsonl` + summary + calibration. Wilson CIs for all 6 bots in summary.json.
1.5	Bayesian Beta-posterior kill rule replacing fixed thresholds	Reusable module that takes (wins, losses) and emits kill-or-keep with min-N gate. Backfit against existing bots.	shipped 2026-05-03 `bayesian_kill.py`. Smoke-tested against all 6 bots' current state. Would correctly kill directional_v1 and lil_tail_v2; correctly keep delta_sniper_v2 and the small-N cases.
2	First model + walk-forward CV; calibration plots; decision-curve analysis	GBT or hierarchical glmm; 5-fold time-series CV with embargo; reliability diagram per bot; PnL-vs-threshold curve with optimal threshold marked	open needs the data scientist
3	Handoff: live-mode integration, write-up, & future-bot logging contract	Threshold wired into one bot's kill rule; one-pager summarising what works / what doesn't; `FUTURE_BOTS.md` doc spec'ing the columns a new bot must log to slot in	open partial — Bayesian kill rule could be wired into v1/v2 today, but waiting on Week 2 to define the right confidence threshold from real data

What's left for the data scientist after tonight: weeks 2–3 above. Week 1 is done. The unified jsonl + summary + calibration files are at ~/vault/research/decision-log/ and the directional_v1 raw history (90 rows, rich features) is the cleanest single-bot starting point. The Bayesian kill module is a 30-line drop-in that can be wired into any bot's main loop with one import.

Outcome of the work: "strategy X has measurable edge separable from noise / strategy X is statistically indistinguishable from a coin flip" — said with a number that we can defend, plus a kill threshold that isn't 5_tickets_zero_wins.

06Five sub-questions for you

If you wanted to push back on any of this, these are where the back-and-forth would actually be useful. Numbered so you can answer "Q3a — disagree, here's why" without retyping context.

Q3a — model class

Hierarchical Bayesian regression with partial pooling vs. multi-task GBT with bot-id as a grouping vs. a tiny calibrated MLP. At N=211 with 4–6 bots, which would you reach for first, and why specifically? My weak prior is hierarchical Bayes for interpretability, but I haven't lived in this regime.

Q3b — leakage protocol

de Prado's purged k-fold with embargo seems mandatory at this scale, but the embargo length is a free parameter. For a 5-min Polymarket settlement, is a 1-day embargo sufficient or do you want a longer "regime" buffer (1 week)? What signal would tell us we picked it wrong after the fact?

Q3c — calibration choice

Platt scaling is cheap but assumes a sigmoid relationship. Isotonic is more flexible but needs more data. At N=90 per bot post-CV, which holds up? Or do we cross-bot-pool the calibration set itself?

Q3d — outcome target

Binary profitable_settle ∈ {0,1} is the simple choice. A continuous pnl_normalised_by_size target preserves more signal but the loss function gets weirder (skewed, fat tails). Do you target both and pick at deploy time, or commit early?

Q3e — connection to regime-change

Parent brief's Q7 is "how do you detect regime change at small N." This work fits in as the likelihood; the regime detector is the prior shift on whether the likelihood still applies. Is that the right factorisation, or are you reaching for something joint (a switching-state-space model, say) where the two questions are one?

If you want to start somewhere

The cleanest 90 rows are at /opt/polymarket-bot/history.jsonl (rich features, single bot, single market type). I'd start there — get the calibration plot for that bot alone, decide if the protocol survives walk-forward, then add bots once the per-bot version works.

Email Neo at theprofessor.alexander@gmail.com with thoughts on Q3a–Q3e, or any reply at all. We're treating this as a real conversation, not an interview.