Appendix to /research · v0.1 · 2026-05-03

Decision-log outcome model — the longer answer.

A follow-on brief because someone cared about wishlist item 5.3. The honest data is smaller than the parent doc claimed, the underlying problem is more interesting because of it, and there are five real research questions hiding inside what looked like one engineering task. Below: the real numbers, the schema, the proposed pipeline, and what a researcher could actually pick up.

numbers verified 2026-05-03 21:00 UTC ~7 minute read follow-on to /research
Contents
01Correction up front 02The real data 03Why N=200 is still interesting 04Schema & pipeline sketch 05What a 3-week piece looks like 06Five sub-questions for you

01Correction up front

The parent brief said "~2,500 trade-decision rows." That was generous. Real number is exactly 211 rich-feature rows across six bots, plus ~1,800 thin rows from a market-maker bot logging only {timestamp, result, merge_profit}. Most of the thin set is unusable for a real outcome model. I'm flagging this here so you can calibrate expectations before deciding if it's still interesting.
Update 2026-05-03 ~23:30 UTC — what's actually shipped. After this conversation, three implementation passes + two audit passes ran. 28 ML modules are live across the brain (13), the office (10), and a cross-cutting layer (5); ~73 deferred items are in ~/vault/research/ML_BACKLOG.md with effort estimates, not pretending to be done. Audit found 2 BLOCKERS + 4 HIGH issues, all fixed before this update — including a localStorage-persistence bug on 4 office features and a too-lax drift detector that was failing to fire on a real 31-pp WR shift. Brochure numbers below are now from the audited pipeline, not implementer self-reports.

Real findings already produced by the audited modules:

The smaller number sharpens the problem. At N=2,500 you'd reach for an XGBoost and call it done. At N=200 with multiple sources of feature heterogeneity, the model choice is downstream of an actual research question about how to combine evidence across non-identically-distributed strategies. That's the more useful framing anyway.

02The real data — what's labeled, what isn't

Bot Rows Feature richness Outcome label
directional_v1
Polymarket BTC/ETH/SOL 5-min Up/Down
90 rich entry price, shares, signal mode (vol-flow / OFI / Hawkes), volume ratio, condition_id, window_end payout, trade_pnl, result ∈ {WIN, EXPIRED, REDEEMED}
delta-sniper-v2
Polymarket late-window delta
39 medium entry side, delta band, time-in-window pnl, settle status
lil-tail v1
BTC 5m tail-buy at $0.01–$0.02
4 rich entry price, side, vol state, magnitude gate, window_start result (4 L so far)
lil-tail v2
Mint-and-sell SPLIT-arb
50 rich mint cost, up/down filled, redemption value, sells collected cycle_pnl, result ∈ {P, L, F}
meme-bot
Solana copy-trade (paper)
12 medium source wallet id, entry price, hold duration pnl
sports-mirror-bot 16 medium match, side, line binary outcome
SMM merge logs
polymarket-bot/ml_data/trades.jsonl
1,852 thin just timestamp, result, merge_profit three-class result
resolved_sniper
brain-system/state — older copy-trade
90 likely overlaps with directional_v1 shared
Rich-feature trade-outcome rows · by bot directional_v1 90 lil-tail v2 50 delta-sniper-v2 39 sports-mirror 16 meme-bot 12 lil-tail v1 4 total rich rows ≈ 211 · plus 1,852 thin rows · plus ~90 likely-duplicated
Fig 1. Rich-feature rows aggregate to ~211. Each bar is a separate strategy with its own feature set; that's the heterogeneity you'd be working with.

03Why N=200 is still interesting

Five reasons we think this isn't a "wait until you have more data" problem.

  1. Bot-feature heterogeneity is the real research question.
    Each bot has different features. Concat-and-train throws away which features come from which strategy. Honest options: fixed effects per bot, hierarchical Bayesian regression with bot-level random effects on coefficients, or multi-task GBT with bot as a grouping. There's a real partial-pooling tradeoff hiding in N=39 rows for delta-sniper vs N=90 for directional. James-Stein-flavoured shrinkage is the kind of thing that pays for itself at this scale specifically.
  2. Temporal leakage is everywhere and N is small enough that any leak matters.
    The 90 directional_v1 rows span ~3 weeks. A naive k-fold lets the model see the future. Embargo-windowed time-series CV is mandatory, and at N=90 a 3-day embargo eats real chunks. Designing an honest CV protocol is more important than picking a model. Walk-forward backtest with purged cross-validation is the de Prado playbook; the question is what embargo length is right for our specific market.
  3. Calibration matters more than AUC at this scale.
    The bot uses the predicted probability as input to a kill rule. Brier score and reliability diagrams are the right metrics. We have no calibration today — every "this strategy is at 0.6 confidence" is a guess. A reliability curve on existing trades would be the first thing I'd build, before any model. Platt scaling or isotonic on top of whatever model gets picked.
  4. This is the same problem as the regime-change detector (parent brief Q7), decomposed.
    The outcome model is the likelihood. The regime-change detector is the prior shift on whether the model still applies. They factor cleanly. A complete answer ships both: a calibrated likelihood, plus a posterior over "is the world I trained on still the world I'm in?"
  5. Scale dictates the model class — pick honestly.
    211 rich rows is too few for deep nets, perfect for gradient-boosted trees with hand-crafted features, and most interesting for hierarchical Bayesian regression where partial pooling does the work. We would be reaching for one of those three, not transformers. Worth saying out loud because the temptation to over-model at N=200 is real.

04Schema & pipeline sketch

Here's what one row of directional_v1/history.jsonl actually looks like — verified, not paraphrased:

// /opt/polymarket-bot/history.jsonl · sample row
{
  "ts": "1776404522",
  "condition_id": "0x77e37e4aa27726ec…",
  "outcome": "DOWN",
  "token_id": "65047327802547722647…",
  "our_price": 0.31,
  "shares": 14.0,
  "cost": 4.34,
  "order_id": "0xa4c0e35fed3e7578…",
  "success": true,
  "result": "EXPIRED",
  "signal_mode": "volume_flow",
  "volume_ratio": 0.6718,
  "title": "Bitcoin Up or Down",
  "window_end": 1776404700,
  "payout": 0.0,
  "trade_pnl": -4.34,
  "_logged_at": 1776487920
}

The rest of the bots have the same flavour but different keys. Step one of any real pipeline is a unified-feature contract that maps each bot's columns to a shared schema, with a bot_id column for partial pooling.

Proposed pipeline · raw bot logs → calibrated decision policy Raw history ~6 jsonl files Unified schema bot_id + features Time-CV split walk-forward + embargo Hierarchical fit GBT or Bayesian glmm Calibration isotonic / Platt Decision curve PnL vs threshold Kill threshold live policy input Handoff doc: how a future bot logs to slot into this pipeline automatically
Fig 2. Top row is the model. Bottom row is what makes it usable for a kill-rule decision. The handoff doc at the end is the artifact future bots check against.
Reliability diagram · sketch (no real model fit yet) 1.0 0.5 0.0 0.0 predicted prob 1.0 empirical y = x · ideal uncalibrated post-Platt
Fig 3. Illustrative — bots emit binary entry/no-entry decisions, not probabilities. We can construct a real reliability curve only after a model produces predicted probabilities and we bin them. The Wilson-CI table below is the empirical-rate baseline against which any future model would be calibrated.
Real Wilson 95% CIs · per bot · 2026-05-03 0% 50% 100% break-even (assumed 50%) directional_v1 14/90 PnL −$1,538 lil_tail_v1 0/4 · noise PnL −$4 lil_tail_v2 0/50 · pre-fix PnL $0 (mints recouped) delta_sniper_v2 29/39 · high WR · neg PnL PnL −$7 meme_bot 0/12 (paper, suspect adapter) verify schema
Fig 4. Wilson 95% CIs from the unified pipeline. Whiskers are CI bounds; dot is the empirical rate; red diamond marks where the rate sits relative to the 50% break-even. delta_sniper_v2 is the only bot whose CI sits entirely above 50% — yet its PnL is still negative, evidence that win-rate alone is not enough. (See Fig 5 caveat.)

05What a 3-week piece of work would look like

Concrete enough that a researcher can decide whether they want it.

Week Deliverable Definition of done Status
1 Unified schema across the bots; data-quality audit; baseline summary stats per bot One file with 211 rows and a clean column contract; documented null-handling; per-bot mean / Wilson CI on outcome shipped 2026-05-03 decision_log_pipeline.py writes ~/vault/research/decision-log/unified.jsonl + summary + calibration. Wilson CIs for all 6 bots in summary.json.
1.5 Bayesian Beta-posterior kill rule replacing fixed thresholds Reusable module that takes (wins, losses) and emits kill-or-keep with min-N gate. Backfit against existing bots. shipped 2026-05-03 bayesian_kill.py. Smoke-tested against all 6 bots' current state. Would correctly kill directional_v1 and lil_tail_v2; correctly keep delta_sniper_v2 and the small-N cases.
2 First model + walk-forward CV; calibration plots; decision-curve analysis GBT or hierarchical glmm; 5-fold time-series CV with embargo; reliability diagram per bot; PnL-vs-threshold curve with optimal threshold marked open needs the data scientist
3 Handoff: live-mode integration, write-up, & future-bot logging contract Threshold wired into one bot's kill rule; one-pager summarising what works / what doesn't; FUTURE_BOTS.md doc spec'ing the columns a new bot must log to slot in open partial — Bayesian kill rule could be wired into v1/v2 today, but waiting on Week 2 to define the right confidence threshold from real data
What's left for the data scientist after tonight: weeks 2–3 above. Week 1 is done. The unified jsonl + summary + calibration files are at ~/vault/research/decision-log/ and the directional_v1 raw history (90 rows, rich features) is the cleanest single-bot starting point. The Bayesian kill module is a 30-line drop-in that can be wired into any bot's main loop with one import.

Outcome of the work: "strategy X has measurable edge separable from noise / strategy X is statistically indistinguishable from a coin flip" — said with a number that we can defend, plus a kill threshold that isn't 5_tickets_zero_wins.

06Five sub-questions for you

If you wanted to push back on any of this, these are where the back-and-forth would actually be useful. Numbered so you can answer "Q3a — disagree, here's why" without retyping context.

Q3a — model class

Hierarchical Bayesian regression with partial pooling vs. multi-task GBT with bot-id as a grouping vs. a tiny calibrated MLP. At N=211 with 4–6 bots, which would you reach for first, and why specifically? My weak prior is hierarchical Bayes for interpretability, but I haven't lived in this regime.

Q3b — leakage protocol

de Prado's purged k-fold with embargo seems mandatory at this scale, but the embargo length is a free parameter. For a 5-min Polymarket settlement, is a 1-day embargo sufficient or do you want a longer "regime" buffer (1 week)? What signal would tell us we picked it wrong after the fact?

Q3c — calibration choice

Platt scaling is cheap but assumes a sigmoid relationship. Isotonic is more flexible but needs more data. At N=90 per bot post-CV, which holds up? Or do we cross-bot-pool the calibration set itself?

Q3d — outcome target

Binary profitable_settle ∈ {0,1} is the simple choice. A continuous pnl_normalised_by_size target preserves more signal but the loss function gets weirder (skewed, fat tails). Do you target both and pick at deploy time, or commit early?

Q3e — connection to regime-change

Parent brief's Q7 is "how do you detect regime change at small N." This work fits in as the likelihood; the regime detector is the prior shift on whether the likelihood still applies. Is that the right factorisation, or are you reaching for something joint (a switching-state-space model, say) where the two questions are one?


If you want to start somewhere

The cleanest 90 rows are at /opt/polymarket-bot/history.jsonl (rich features, single bot, single market type). I'd start there — get the calibration plot for that bot alone, decide if the protocol survives walk-forward, then add bots once the per-bot version works.

Email Neo at theprofessor.alexander@gmail.com with thoughts on Q3a–Q3e, or any reply at all. We're treating this as a real conversation, not an interview.