EVIDENCE FILE
PREDICTION TRACK RECORD
Every prediction. Every outcome. Including the ones we got wrong.
2,738 predictions logged since March 2026 - across sports, macro, geopolitics, crypto, and real estate. No backfilled data. No cherry-picked wins. Every run is on the record.
METHODOLOGY CALIBRATION · REBUILT, TIME-GATED, LEAK-AUDITED
Every question below was submitted to the swarm with today_override clamped to a date at least 30 days before resolution, so the model could not see the outcome in its training data. Sampled across 8 real-estate sub-domains. Audit status: passed (2026-05-25; 0/19 anachronisms). See the full anachronism scan →
LIVE METHODOLOGY BRIER
0.168
n=69 · 95% CI [0.080, 0.256]
VS. PHASE 1 BASELINE
39%
better Brier · phase1 n=111, Brier 0.277
RE SUB-DOMAINS COVERED
8
cap rates · permits · rents · REITs · SFH · regional · news
PER-SUB-DOMAIN · PHASE2D_XDOMAIN
Time-gated cross-domain calibration: every question submitted with `today_override` set to a date ≥ 30 days before resolution. Phase 2D adds skill-weighted v2 archetype clusters to the baseline. Phase 2E (supervisor v1) is excluded — known temporal leak.
WHERE THE SWARM HAS AN EDGE
Brier scores below 0.25 (coin-flip baseline) where the swarm is meaningfully calibrated. Lower is better; 0.0 is perfect.
HEADLINE BRIER (TIME-GATED)
0.236
18 resolved · Brier 0.236 · 95% CI [0.040, 0.432]
TOTAL PREDICTIONS LOGGED
2,738
18 time-gated resolved · 1% honest resolution rate (excludes circular-scored + leakage-risk)
DEPTH FIRST, BREADTH SECOND
Holodeck is built depth-first for real estate, macro, and private markets - the domains with the structured ground-truth data Gray Capital cares about. We've built dedicated archetype clusters for those. They perform.
Crypto, broad sports, and geopolitics use a general-purpose archetype mix — the same 180-expert baseline that runs across all domains. We haven't built specialized clusters there, and the track record on those categories reflects it: a Brier of 0.236 across 18 resolved predictions. Know the difference: depth-built domains (real estate, macro) earn their use; breadth domains are tools for when structured domain data isn't available.
REAL ESTATE PHASE 2D BATCH — ARCHIVED, NOT IN HEADLINE
Methodology: skill-weighted aggregation across v2 archetypes. 5 questions sampled at random per sub-domain, prompted with point-in-time context only. Lower Brier = better; 0.25 = a coin flip. Negative delta = beats the prior (Phase 1) baseline.
Why it’s not in the headline: a May methodology audit (W2.2) found that 404 of 436 historically resolved predictions had resolution_date before submitted_at — meaning the model could have seen the outcome in training data. We now exclude any prediction without positive lead time from the public Brier. This Phase 2D table is real and is shown for completeness, but its 0.165 aggregate has been retired from the headline pending a rebuild under the tighter v2 protocol (run with today_override to clamp the swarm’s knowledge horizon to the question’s vintage). Watch this page — the rebuilt number is landing this week.
BREAKDOWN BY DOMAIN (LIVE)
FEATURED OUTCOMES
Our strongest calls - highest conviction predictions with verified real-world outcomes.
Did the Abraham Accords gain a new Arab signatory in Q1 2026?
Did Ethereum outperform Bitcoin in Q1 2026?
Was the March 2026 CPI above 0.3% month-over-month?
Was US retail sales growth positive in February 2026?
Was the March 2026 PPI above 0.2% month-over-month?
Was the April 2026 Empire State Manufacturing Index below 0?
Will the Golden State Warriors qualify for the 2026 NBA Playoffs?
Will the new Pope be elected within 2 weeks of the 2026 conclave opening?
Did the UK experience a general election in Q1 2026?
Will US GDP growth for Q4 2025 be reported as positive?
Did the Los Angeles Angels beat the Toronto Blue Jays on April 22, 2026?
Did the Fed hold rates at its March 2026 FOMC meeting?
RECENTLY RESOLVED PREDICTIONS
50 most recent · ordered by resolution dateWHAT IS A BRIER SCORE?
A Brier score measures prediction accuracy on probability estimates. Lower is better. A perfect score is 0.0 (100% confidence on the correct outcome). Random guessing scores 0.25. Under 0.15 is excellent; under 0.25 is solid. Sports tend to be harder to call than macro trends.
RELIABILITY DIAGRAM · PHASE2D_XDOMAIN
69 resolved · 10 probability bins · y=x is perfect calibrationWhen we say “70% likely,” does it actually happen 70% of the time? Each dot below is a bin of resolved predictions; dot size = sample count. Sitting on the diagonal means calibrated. Above the line = under-confident; below = over-confident.
Honest small-N caveat: with 69 resolved predictions, each dot is one or two outcomes. The picture sharpens as more questions resolve. Empty bins shown intentionally - no cherry-picking.
CALIBRATION TEST - 2026 NCAA MEN'S TOURNAMENT
62 games · pre-game predictions only · verified outcomes74.2% overall accuracy (46/62). Brier score 0.155 - well below the 0.25 coin-flip baseline. The honest test: when the swarm said "70% confident," did it win 70% of the time? Below: every confidence bucket, including the bucket we deliberately said was a coin flip.
The 50-55% row is the most important. A miscalibrated swarm overclaims uncertainty. Ours said "these 6 are basically coin flips" - and they were.
FEATURED CASE STUDIES - NCAA ELITE EIGHT
March 28, 2026 · submitted before tip-off · pre-game verifiedThe first two publicly logged, pre-event predictions with verified outcomes. Run through the live engine before game time; both resolved the same night.
WHY THIS PAGE EXISTS
THE HONEST ANSWER
We log everything from day one so that when we have hundreds of resolved predictions, you can audit the full history - not a curated highlight reel.
WHAT WE EXPECT TO WIN
Structural breaks. Regime changes. Tail events that prediction markets underprice because they're anchored to recent consensus. That's where synthetic agent swarms earn their edge.
WHAT WE DON'T HAVE YET
Per-archetype calibration on the high-conviction domains.
The swarm has 180 archetype-distinct experts across 23 domain presets. We can already show which domain the swarm is calibrated in. We can't yet show which archetypes within real estate / macro drove each call - we started logging full per-segment data on resolved questions in May 2026, so the sample size is still small (12 questions with full segment + outcome data, of 395 total resolved).
The priority is depth on multifamily underwriting and macro scenarios - those are the archetype clusters we're actively expanding. We'll ship per-archetype calibration there first, with ~10 resolved predictions per cluster as the statistical floor. Broader-domain archetype work (crypto, sports, geopolitics) is a future roadmap item, not a current focus.
ETA: real estate + macro per-archetype, late June 2026.
Every public run is logged. This page updates as outcomes resolve.