Two systems we run today. A personal retrieval-and-decision engine called the brain. A 19-agent canvas simulation that doubles as a testbed for proto-cognitive mechanisms. Both work. Both have honest places where machine learning, classical statistics, or quantum-inspired methods would do real work, and a handful of places where they wouldn't. This document is what we'd hand to a researcher before a first conversation, ranked by how much we don't already know the answer.
The brain is a local-first retrieval system over ~990 distilled Markdown notes from 730 source files. A small CLI wraps it; a Bayesian confidence ledger weights each note by counterfactual track record; daily crons rebuild the index, surface contradictions, and decay stale notes. ~2,650 decisions logged in the past month feed back into that ledger.
The office is a canvas-2D pixel-art simulation of a 19-agent hedge fund running in the browser. Beneath the visualisation sits a four-block cognitive substrate (cog-A/B/C/D) implementing per-agent persistent memory, an inter-agent message bus, predictive-coding ticks with prediction-error-driven attention reallocation, Bayesian Beta posteriors over preferences, recursive meta-thought, mood/affect vectors with a synchroniser pulling four sub-systems toward a shared mean, and per-agent phenomenological vocabulary tracking. Roughly 100 ideas from a recent design pass have been implemented; more are scaffolded.
Both are honest engineering. Neither pretends to be intelligent in any strong sense. We think there are about a dozen places where genuinely interesting mathematics or ML would change what these systems can do, and ten of those places are open enough that we'd want a real researcher's view rather than a senior engineer's guess.
Files arrive through three paths — a desktop drop-zone watched by a launchd daemon, a browser bookmarklet that POSTs highlighted text to a local capture server on port 11455, and a CLI ingest. Each ingest produces a distillation pass through a local LLM (Llama-3 8B-Instruct on Ollama), a frontmatter block with confidence and decay metadata, and an embedding stored in sqlite-vec. The decisions log is separate — every code edit, every trade gate, every "I'm changing X because Y" call writes a row.
Live at voliaventures.com/office. Press I for the debug HUD; the lamps go green when each cog sub-system fires, the ticker streams real bus events. The spec was a whitepaper I wrote two weeks ago about "what would be needed to push a 2D agent sim toward a meaningful point on the consciousness spectrum." Roughly 100 ideas from that whitepaper are now in the codebase. The fundamental data structure is a per-agent agent.cog JSON object that survives across sessions in localStorage and is the substrate every other system reads from.
This is the section most relevant to deciding where we're being honest and where we're being optimistic.
| Quantity | How we measure it today | Honest assessment |
|---|---|---|
| Brain retrieval quality | Top-k cosine over sqlite-vec; offline checks against held-out notes I tagged | Works at this scale; we have no end-to-end ranking metric |
| Decision-ledger calibration | Brier score on resolved trade decisions (wins/losses with predicted probability) | Honest. Weekly Brier is logged. Bayesian update rule is α/β with decay |
| Trade strategy edge | Wilson 95% CI on win rate, conditional on entry-price band | Honest. We disable kill switches at small N because the CIs are wide |
| Agent integration (Φ-flavoured) | Heuristic — count of components an action touches per tick | soft No real Φ measurement; this is a placeholder |
| Phenomenal vocabulary stability | Per-agent descriptor co-occurrence count across sessions | soft Useful signal; no theory of why we'd expect it to converge |
| Predictive surprise | L1 distance between agent's last-tick prediction and observed | Honest. Computable. Drives attention reallocation in cog-C |
| Self-model accuracy | None today — the self_model is a hand-written JSON | missing |
| Cross-bot trade-decision schema | Unified pipeline at scripts/decision_log_pipeline.py ingests all 6 bot histories into ~/vault/research/decision-log/unified.jsonl with Wilson CIs + Beta posteriors |
shipped 2026-05-03 |
| Bayesian kill rule | scripts/bayesian_kill.py — Beta posterior over WR with min-N gate. Replaces fixed-threshold rules. |
shipped 2026-05-03 |
| Brain ML stack (13 modules) | ~/.openclaw/workspace/brain-system/ml/brainml_* — walk-forward GBT (+19.6% Brier), Platt calibration (ECE 0.21→0.10), decision curves, BM25+RRF, regime-aware rerank, per-domain Brier, Poisson-tail anomaly on decision flow, KL composition drift, model versioning with sha256, retrieval eval harness. Wired via brain ml CLI. |
shipped 2026-05-03 |
| Trading-bot ML stack (5 modules) | scripts/ml/xml_* — split-conformal + Mondrian quantiles, Thompson-sampling 8-arm kill bandit, combined drift detector (PSI textbook + ΔWR + sKL on signal_mode), Brier+ECE regression test, meta-gradient over learning rate. Each runs against the unified ~/vault/research/decision-log/unified.jsonl. |
shipped 2026-05-03 |
| Office ML stack (10 features) | office.html — Bayesian self-model preferences (5-axis Beta posteriors writing back to self_model.preferences), TD(0) on persuasion via music-vote bus, valence-shaped reservoir, soft-attention with EMA-smoothed temperature, dynamic capacity from affect, phenom-vocab drift detector, latency budget gating per-tick. All persistent via __cog.markDirty() wired through cog-A flush. |
shipped 2026-05-03 |
| Audit infrastructure | 2-phase audit pattern (wiring + feature-completeness) ran on the 28 modules above; flagged 2 BLOCKERS + 4 HIGH; all fixed pre-deploy. Reports at ~/Desktop/ml-shipped-audit-{WIRING,FEATURES}-2026-05-03.md. |
shipped 2026-05-03 |
The fact that two of these say "soft" and one says "missing" is the honest answer to "what could a researcher actually contribute." Everything below this point is downstream of fixing one of those three. (Two new "shipped" rows reflect work done after the first conversation — see /research-decision-log for what those artifacts revealed.)
We've put these in rough order of what we'd actually start on, not what's most ambitious. Quantum-inspired classical methods (tensor networks, quantum-walk samplers, quantum-cognition probability) are listed where we think they're more than theatre.
Today: hand-rolled templates over a seeded vocabulary. Pattern-repeat is visible after ~30 minutes of watching.
Want: a small model that takes {room, mood-vector, last-3-journal-entries} and emits a single first-person sentence. ONNX runtime in the browser, WebGPU when available. Phi-3-mini, TinyLlama-1B, Qwen-0.5B-Distill all candidates. Latency budget — 19 agents firing roughly once every 90 s, so < 19 inferences / minute total. Cost on the user side is a one-time ~500 MB model download, which we'd lazy-load.
Today: fixed kill rules, currently disabled because at N=5 they fired prematurely.
Want: an online change-point detector that distinguishes "noise streak" from "regime shift." Bayesian online change-point (Adams & MacKay 2007), CUSUM with adaptive threshold, or a simpler Beta-Binomial posterior over win-rate that triggers when posterior mass below break-even crosses 0.95. We have ~2,500 trade-decision rows with binary outcomes, market state, and entry conditions — enough to backtest detector candidates.
Today: rule-based gating with hand-tuned thresholds. The features are there; nothing learns from them.
Want: a tiny model — gradient-boosted trees, or a 2-layer MLP — that consumes the decision feature vector (entry price, vol band, time-of-day, seconds-into-window, on-chain liquidity, previous-N-decisions outcome) and predicts probability of profitable settlement. Trained per-strategy. A clear evaluation protocol matters more than model choice — temporal CV with embargo, calibration plots, decision-curve analysis.
→ Longer answer in the appendix: /research-decision-log — the real data sizes (it's smaller and more interesting than this section suggests), the schema, a 3-week deliverable spec, and five sub-questions ranked for back-and-forth.
Today: brute-force cosine over 989 chunks. ~30 ms per query.
Want: an HNSW index with optional product quantisation (PQ) or RaBitQ, gated to flip on at ~10 k chunks. We have a build path queued; the question is more nuanced than "when to flip." Specifically: at our scale, how much of the latency budget should go to retrieval vs. re-ranking with a small cross-encoder? Where does ANN error start to dominate retrieval quality?
Today: each agent's world_model is ~12 components × ~30 dimensions of state, all stored independently per-agent. Storage works; the agents don't share structure.
Want: a low-rank decomposition that lets shared world-model structure live in one tensor and per-agent specialisation be a small adapter. Tensor trains (Oseledets, Khrulkov-Cichocki) are the obvious candidate. Quantum-inspired in the sense of borrowing the math from MPS/PEPS literature without needing actual qubits. Useful only if it improves cross-agent generalisation, which is an empirical claim.
Today: Beta(α, β) posteriors over each agent's track preferences, updated with classical Bayes.
Want: an honest test of whether quantum-probability theory (Busemeyer & Bruza 2012) — non-commutative belief operators, projection onto context-dependent subspaces — predicts our agents' decision biases (order effects, conjunction fallacies) better than classical Bayes. The agents are simulated, so we can ground-truth them. This is a test of whether QP cognition has predictive content at our scale or whether it's mathematical theatre. Either answer is useful.
Today: agents pick next-room by score = info-gain + goal-distance with hand-set weights.
Want: full active-inference machinery — a generative model over (state, action, observation), expected free energy decomposed into pragmatic + epistemic value, action selected by softmax over EFE. Friston's framework. Practical libraries (pymdp) exist. Question is what the right level of abstraction is for our agents — are they Friston-grade Markov-blanket entities, or just rule-followers we'd be cargo-culting onto?
Today: we count component-coupling and call it "integration." It's a soft measurement.
Want: a tractable IIT-flavoured proxy. Approximations exist — ΦR, Φ* (geometric integrated information, Oizumi et al. 2014; Tegmark 2016), causal-emergence measures (Hoel 2017). At 12 components per agent, even an approximate Φ is computationally feasible. The point isn't to claim consciousness — it's to have a number that distinguishes a tightly-coupled simulation from a loosely-coupled one, so we can tell which architectural changes help and which are decoration.
These are the questions where I genuinely don't have an answer I trust. They aren't interview questions. If your answer to any of them is "the question is wrong, here's the right one," that's the most useful possible response.
If you want to read code:
office.html — the simulation, all of it. ~17 k lines, cleanly tagged with === COG-A === / === COG-B === block markers. The __cog global is the foundation API.brain.py quickstart — single CLI entry. brain ask "<question>" shows retrieval; brain decide writes to the ledger.~/vault/memory/PERMANENT_FINDINGS.md — the small set of things we've decided are true.~/vault/memory/KILL_LIST.md — the smaller set of things we've decided are dead. Both are the principled side of avoiding research debt.If you want to read prose:
Email Neo at theprofessor.alexander@gmail.com. Mention this brief — the inbox is loud, so a one-line subject like research / Q3 / [your name] lands faster than "Hi!".
No formal interview process. The conversation is the conversation. If you'd rather write something first, even better — pick any of the ten questions and tell me what I got wrong about it.