Brief · v0.2 · technical · 2026-05-03

Open problems on the brain & the office sim.

Two systems we run today. A personal retrieval-and-decision engine called the brain. A 19-agent canvas simulation that doubles as a testbed for proto-cognitive mechanisms. Both work. Both have honest places where machine learning, classical statistics, or quantum-inspired methods would do real work, and a handful of places where they wouldn't. This document is what we'd hand to a researcher before a first conversation, ranked by how much we don't already know the answer.

both systems live ~10 minute read prepared by Neo · Volia Ventures
Contents
01Summary 02The brain 03The office 04Methodology 05Where ML / DS / quantum would help 06Ten open questions 07Code pointers

01Summary

The brain is a local-first retrieval system over ~990 distilled Markdown notes from 730 source files. A small CLI wraps it; a Bayesian confidence ledger weights each note by counterfactual track record; daily crons rebuild the index, surface contradictions, and decay stale notes. ~2,650 decisions logged in the past month feed back into that ledger.

The office is a canvas-2D pixel-art simulation of a 19-agent hedge fund running in the browser. Beneath the visualisation sits a four-block cognitive substrate (cog-A/B/C/D) implementing per-agent persistent memory, an inter-agent message bus, predictive-coding ticks with prediction-error-driven attention reallocation, Bayesian Beta posteriors over preferences, recursive meta-thought, mood/affect vectors with a synchroniser pulling four sub-systems toward a shared mean, and per-agent phenomenological vocabulary tracking. Roughly 100 ideas from a recent design pass have been implemented; more are scaffolded.

Both are honest engineering. Neither pretends to be intelligent in any strong sense. We think there are about a dozen places where genuinely interesting mathematics or ML would change what these systems can do, and ten of those places are open enough that we'd want a real researcher's view rather than a senior engineer's guess.

02The brain — what's already there

Files indexed
732
7 domains, .md only
Vector chunks
989
sqlite-vec, brute-force cosine
Decisions logged
~2,650
last 30 days
Token corpus
~383k
≈ a thin novel

Files arrive through three paths — a desktop drop-zone watched by a launchd daemon, a browser bookmarklet that POSTs highlighted text to a local capture server on port 11455, and a CLI ingest. Each ingest produces a distillation pass through a local LLM (Llama-3 8B-Instruct on Ollama), a frontmatter block with confidence and decay metadata, and an embedding stored in sqlite-vec. The decisions log is separate — every code edit, every trade gate, every "I'm changing X because Y" call writes a row.

Vector chunks indexed, weekly · since launch Mar 1 Mar 22 Apr 12 May 3 0 500 1000 989 · today decay-flagging cron deployed
Fig 1. Steeper slope = research weeks. Flat segments = trading-bot sprints. The bend after the 04:00 Sunday decay cron started is real — auto-decay catches stale notes that would otherwise bloat the corpus.
Brain corpus by namespace · 989 chunks learning · 577 (58%) claude · 249 (25%) live · 113 (11%) vault · 50 (5%) learning = ingested research, papers, news · vault = canonical findings (counterfactually-validated) claude = working session memory · live = system state snapshots
Fig 2. Most chunks are external research. The vault is small on purpose — only findings that survive a counterfactual gate (would I act differently if this turned out wrong?) graduate into it.

03The office — what's already there

Live at voliaventures.com/office. Press I for the debug HUD; the lamps go green when each cog sub-system fires, the ticker streams real bus events. The spec was a whitepaper I wrote two weeks ago about "what would be needed to push a 2D agent sim toward a meaningful point on the consciousness spectrum." Roughly 100 ideas from that whitepaper are now in the codebase. The fundamental data structure is a per-agent agent.cog JSON object that survives across sessions in localStorage and is the substrate every other system reads from.

cog-A memory · message bus attention · scene binding cog-B preferences · voting theory of mind · grudges cog-C predictive coding recursive thought · learning cog-D mood · affect · valence phenom-vocab · culture 1 Hz Affect Synchroniser α=0.20 EWMA over mood / energy / focus / anxiety / boredom agent.cog · per-agent persistent substrate (localStorage, ~6 kB / agent) affect (5d) · working_memory (k=4) · attention_vec · self_model · world_model phenom_vocab · self_boundary · scene · journal · prediction_errors · valence_log subjective_dt · integration_proxy_phi · interoceptive_anomaly
Fig 3. Each agent ticks once per second through all four cog blocks. The synchroniser pulls A/C/D affect toward a shared mean — the four blocks would otherwise drift into independent mood machines. The substrate is the only persistent state; everything else regenerates from it.

04Methodology — what we measure today

This is the section most relevant to deciding where we're being honest and where we're being optimistic.

Quantity How we measure it today Honest assessment
Brain retrieval quality Top-k cosine over sqlite-vec; offline checks against held-out notes I tagged Works at this scale; we have no end-to-end ranking metric
Decision-ledger calibration Brier score on resolved trade decisions (wins/losses with predicted probability) Honest. Weekly Brier is logged. Bayesian update rule is α/β with decay
Trade strategy edge Wilson 95% CI on win rate, conditional on entry-price band Honest. We disable kill switches at small N because the CIs are wide
Agent integration (Φ-flavoured) Heuristic — count of components an action touches per tick soft No real Φ measurement; this is a placeholder
Phenomenal vocabulary stability Per-agent descriptor co-occurrence count across sessions soft Useful signal; no theory of why we'd expect it to converge
Predictive surprise L1 distance between agent's last-tick prediction and observed Honest. Computable. Drives attention reallocation in cog-C
Self-model accuracy None today — the self_model is a hand-written JSON missing
Cross-bot trade-decision schema Unified pipeline at scripts/decision_log_pipeline.py ingests all 6 bot histories into ~/vault/research/decision-log/unified.jsonl with Wilson CIs + Beta posteriors shipped 2026-05-03
Bayesian kill rule scripts/bayesian_kill.py — Beta posterior over WR with min-N gate. Replaces fixed-threshold rules. shipped 2026-05-03
Brain ML stack (13 modules) ~/.openclaw/workspace/brain-system/ml/brainml_* — walk-forward GBT (+19.6% Brier), Platt calibration (ECE 0.21→0.10), decision curves, BM25+RRF, regime-aware rerank, per-domain Brier, Poisson-tail anomaly on decision flow, KL composition drift, model versioning with sha256, retrieval eval harness. Wired via brain ml CLI. shipped 2026-05-03
Trading-bot ML stack (5 modules) scripts/ml/xml_* — split-conformal + Mondrian quantiles, Thompson-sampling 8-arm kill bandit, combined drift detector (PSI textbook + ΔWR + sKL on signal_mode), Brier+ECE regression test, meta-gradient over learning rate. Each runs against the unified ~/vault/research/decision-log/unified.jsonl. shipped 2026-05-03
Office ML stack (10 features) office.html — Bayesian self-model preferences (5-axis Beta posteriors writing back to self_model.preferences), TD(0) on persuasion via music-vote bus, valence-shaped reservoir, soft-attention with EMA-smoothed temperature, dynamic capacity from affect, phenom-vocab drift detector, latency budget gating per-tick. All persistent via __cog.markDirty() wired through cog-A flush. shipped 2026-05-03
Audit infrastructure 2-phase audit pattern (wiring + feature-completeness) ran on the 28 modules above; flagged 2 BLOCKERS + 4 HIGH; all fixed pre-deploy. Reports at ~/Desktop/ml-shipped-audit-{WIRING,FEATURES}-2026-05-03.md. shipped 2026-05-03
The fact that two of these say "soft" and one says "missing" is the honest answer to "what could a researcher actually contribute." Everything below this point is downstream of fixing one of those three. (Two new "shipped" rows reflect work done after the first conversation — see /research-decision-log for what those artifacts revealed.)

05Where ML, data science, or quantum methods would help

We've put these in rough order of what we'd actually start on, not what's most ambitious. Quantum-inspired classical methods (tensor networks, quantum-walk samplers, quantum-cognition probability) are listed where we think they're more than theatre.

5.1 — In-browser per-agent thought generation

Today: hand-rolled templates over a seeded vocabulary. Pattern-repeat is visible after ~30 minutes of watching.

Want: a small model that takes {room, mood-vector, last-3-journal-entries} and emits a single first-person sentence. ONNX runtime in the browser, WebGPU when available. Phi-3-mini, TinyLlama-1B, Qwen-0.5B-Distill all candidates. Latency budget — 19 agents firing roughly once every 90 s, so < 19 inferences / minute total. Cost on the user side is a one-time ~500 MB model download, which we'd lazy-load.

5.2 — Bayesian regime-change detector on PnL

Today: fixed kill rules, currently disabled because at N=5 they fired prematurely.

Want: an online change-point detector that distinguishes "noise streak" from "regime shift." Bayesian online change-point (Adams & MacKay 2007), CUSUM with adaptive threshold, or a simpler Beta-Binomial posterior over win-rate that triggers when posterior mass below break-even crosses 0.95. We have ~2,500 trade-decision rows with binary outcomes, market state, and entry conditions — enough to backtest detector candidates.

5.3 — Outcome model on the trade-decision log

Today: rule-based gating with hand-tuned thresholds. The features are there; nothing learns from them.

Want: a tiny model — gradient-boosted trees, or a 2-layer MLP — that consumes the decision feature vector (entry price, vol band, time-of-day, seconds-into-window, on-chain liquidity, previous-N-decisions outcome) and predicts probability of profitable settlement. Trained per-strategy. A clear evaluation protocol matters more than model choice — temporal CV with embargo, calibration plots, decision-curve analysis.

→ Longer answer in the appendix: /research-decision-log — the real data sizes (it's smaller and more interesting than this section suggests), the schema, a 3-week deliverable spec, and five sub-questions ranked for back-and-forth.

5.4 — HNSW & quantisation for the brain at 10–100k

Today: brute-force cosine over 989 chunks. ~30 ms per query.

Want: an HNSW index with optional product quantisation (PQ) or RaBitQ, gated to flip on at ~10 k chunks. We have a build path queued; the question is more nuanced than "when to flip." Specifically: at our scale, how much of the latency budget should go to retrieval vs. re-ranking with a small cross-encoder? Where does ANN error start to dominate retrieval quality?

5.5 — Tensor-network compression of agent world-models

Today: each agent's world_model is ~12 components × ~30 dimensions of state, all stored independently per-agent. Storage works; the agents don't share structure.

Want: a low-rank decomposition that lets shared world-model structure live in one tensor and per-agent specialisation be a small adapter. Tensor trains (Oseledets, Khrulkov-Cichocki) are the obvious candidate. Quantum-inspired in the sense of borrowing the math from MPS/PEPS literature without needing actual qubits. Useful only if it improves cross-agent generalisation, which is an empirical claim.

5.6 — Quantum-cognition probability for preference posteriors

Today: Beta(α, β) posteriors over each agent's track preferences, updated with classical Bayes.

Want: an honest test of whether quantum-probability theory (Busemeyer & Bruza 2012) — non-commutative belief operators, projection onto context-dependent subspaces — predicts our agents' decision biases (order effects, conjunction fallacies) better than classical Bayes. The agents are simulated, so we can ground-truth them. This is a test of whether QP cognition has predictive content at our scale or whether it's mathematical theatre. Either answer is useful.

5.7 — Active inference / free-energy goal selection

Today: agents pick next-room by score = info-gain + goal-distance with hand-set weights.

Want: full active-inference machinery — a generative model over (state, action, observation), expected free energy decomposed into pragmatic + epistemic value, action selected by softmax over EFE. Friston's framework. Practical libraries (pymdp) exist. Question is what the right level of abstraction is for our agents — are they Friston-grade Markov-blanket entities, or just rule-followers we'd be cargo-culting onto?

5.8 — Φ-proxy measurement on the cog substrate

Today: we count component-coupling and call it "integration." It's a soft measurement.

Want: a tractable IIT-flavoured proxy. Approximations exist — ΦR, Φ* (geometric integrated information, Oizumi et al. 2014; Tegmark 2016), causal-emergence measures (Hoel 2017). At 12 components per agent, even an approximate Φ is computationally feasible. The point isn't to claim consciousness — it's to have a number that distinguishes a tightly-coupled simulation from a loosely-coupled one, so we can tell which architectural changes help and which are decoration.

06Ten open questions

These are the questions where I genuinely don't have an answer I trust. They aren't interview questions. If your answer to any of them is "the question is wrong, here's the right one," that's the most useful possible response.

Q1 — measurement
Is there a Φ-proxy that's tractable at 12 components × 19 agents × 60 fps and actually means something?
The IIT 3.0 partition lattice is intractable above ~6 elements. Approximations (ΦR, ΦG, ΦH) trade rigor for compute. At our scale, do any of them produce numbers that correlate with the architectural changes we'd want to optimise for? Or do we need a different framework entirely (causal emergence, partial information decomposition)?
Q2 — foundations
Free-energy principle vs. predictive processing — same theory or different?
Friston-FEP claims to subsume PP. Andy Clark and Jakob Hohwy seem to think they're separable. When we implement "active inference" do we actually need full FEP machinery (Markov blankets, ergodicity) or is plain prediction-error minimisation sufficient for our scale? More concretely — do we lose something specific by using PP without the variational-Bayes wrapper?
Q3 — falsifiability
When agents emit phenomenal reports, is there any principled way to falsify "they're experiencing something" vs. "they're computing tokens that look like reports"?
This is the hard problem applied to sims. I don't expect an answer to consciousness; I'd be happy with a falsifiable test that distinguishes the two cases for our agents specifically. Behavioural Turing-style tests don't seem to. Φ doesn't quite either. Is there a third direction — predictive coherence under perturbation, perhaps?
Q4 — quantum cognition
Does quantum-probability theory have empirical content for an artificial agent, or is it post-hoc curve-fitting?
Busemeyer's quantum-cognition program predicts certain order effects in human decisions that classical Bayes doesn't. Our agents are deterministic; we can ground-truth them. If we model their preferences with QP and don't see order effects, the framework is theatre at our scale. If we do, it's not. What experimental design would actually settle this for a synthetic system?
Q5 — universal approximation
Why bother with cognitive structure if a sufficiently large MLP can fit any function our cog systems compute?
Inductive bias for sample efficiency is the standard answer. But at our compute budget, sample efficiency isn't the bottleneck. Is there something the structure (separate self-model, world-model, attention) does that no monolithic MLP can — interpretability counts as something — or are we picking structure for aesthetic reasons we should be honest about?
Q6 — embedding topology
Is hyperbolic-space embedding meaningfully better than Euclidean for our brain corpus, or is the corpus too flat?
Poincaré-ball embeddings (Nickel & Kiela 2017) capture hierarchical concepts with O(d) parameters where Euclidean needs O(d log n). Our notes have a clear hierarchy (research → finding → kill-list-entry). At ~1k chunks the gain might be invisible. At what corpus size does it start to matter, and is there a cheap signal — graph clustering coefficient maybe — that would tell us before we commit?
Q7 — small-N edge detection
A trading bot with 5 / 5 losses. The Wilson 95% CI on win rate is [0%, 43%] — statistically that's noise. But the streak feels meaningful. What's a principled middle ground?
"Wait until N=600" is the textbook answer and it costs $600. "Pull at N=5" is what feels right and overreacts to noise constantly. Half-Cauchy prior on the rate? Bayesian change-point with broad priors? Sequential probability-ratio test? What does an honest mid-N decision policy look like for someone who can't run 600 trials?
Q8 — stochasticity
Does our cog substrate need genuine stochasticity (quantum noise, hardware RNG) or is pseudo-randomness sufficient?
Most predictive-processing accounts treat noise as a feature — variational sampling, exploration. Our PRNG is a Mersenne Twister; the agents are technically deterministic given a seed. Is there an experiment that would show qualitative behaviour differences with true entropy (atomic decay, vacuum fluctuation), or is this a distinction without a difference at our scale?
Q9 — sleep and consolidation
Human memory consolidation has a sleep cycle. Our brain has a 03:00 cron. Is one a degenerate version of the other, or are we missing something deeper?
REM/NREM cycles do replay-and-prune in patterns the brain RAG's nightly cron doesn't model — selective replay, theta-gamma coupling, dream-state recombination. If we wanted the brain's consolidation to actually mimic something biological rather than aesthetically resemble it, what's the minimum mechanism that would buy us a real benefit?
Q10 — the dumb practical one
If we wanted to hand you a real slice of work — a piece, not a "playground task" — what would it look like in your hands?
I'd rather ask you what's interesting than guess. The slices I can imagine are: (a) the Bayesian regime-change detector for trading, (b) the Φ-proxy measurement implementation, (c) the in-browser thought generator with ONNX. But you might see a fourth I'm not naming because I don't have your context. What would you pick up if it were yours?

07Code & references

If you want to read code:

If you want to read prose:


Contact

Email Neo at theprofessor.alexander@gmail.com. Mention this brief — the inbox is loud, so a one-line subject like research / Q3 / [your name] lands faster than "Hi!".

No formal interview process. The conversation is the conversation. If you'd rather write something first, even better — pick any of the ten questions and tell me what I got wrong about it.