HolodeckHOLODECKby DarkGray

EVIDENCE FILE

PREDICTION TRACK RECORD

Every prediction. Every outcome. Including the ones we got wrong.

2,738 predictions logged since March 2026 - across sports, macro, geopolitics, crypto, and real estate. No backfilled data. No cherry-picked wins. Every run is on the record.

2,738 PREDICTIONS LOGGED18 RESOLVED (TIME-GATED)0.236 BRIER - LIVEREBUILDING SAMPLE - METHODOLOGY V2

METHODOLOGY CALIBRATION · REBUILT, TIME-GATED, LEAK-AUDITED

Every question below was submitted to the swarm with today_override clamped to a date at least 30 days before resolution, so the model could not see the outcome in its training data. Sampled across 8 real-estate sub-domains. Audit status: passed (2026-05-25; 0/19 anachronisms). See the full anachronism scan →

LIVE METHODOLOGY BRIER

0.168

n=69 · 95% CI [0.080, 0.256]

VS. PHASE 1 BASELINE

39%

better Brier · phase1 n=111, Brier 0.277

RE SUB-DOMAINS COVERED

8

cap rates · permits · rents · REITs · SFH · regional · news

METHODOLOGYNBRIER (EXT)BRIER (MED)
Phase 1 — baseline (no v2 archetypes)1110.2770.241
Phase 2D — v2 archetypes + skill weighting (within-domain)210.1940.196
Phase 2D — v2 archetypes + skill weighting (cross-domain)← HEADLINE690.1680.175

PER-SUB-DOMAIN · PHASE2D_XDOMAIN

SUB-DOMAINNBRIER (EXT)AVG P(YES)ACTUAL YES RATE
Construction Permits20.0120.460.50
Commercial Re Cap Rates40.0300.340.25
Regional Submarket40.0510.600.75
Single Family Housing90.0870.230.33
Multifamily Rent Vacancy40.0960.540.75
Mortgage Rates Treasury Fed400.2100.390.50
Reit Performance30.2470.480.00
Major News Events30.3160.521.00

Time-gated cross-domain calibration: every question submitted with `today_override` set to a date ≥ 30 days before resolution. Phase 2D adds skill-weighted v2 archetype clusters to the baseline. Phase 2E (supervisor v1) is excluded — known temporal leak.

WHERE THE SWARM HAS AN EDGE

Brier scores below 0.25 (coin-flip baseline) where the swarm is meaningfully calibrated. Lower is better; 0.0 is perfect.

HEADLINE BRIER (TIME-GATED)

0.236

18 resolved · Brier 0.236 · 95% CI [0.040, 0.432]

TOTAL PREDICTIONS LOGGED

2,738

18 time-gated resolved · 1% honest resolution rate (excludes circular-scored + leakage-risk)

DEPTH FIRST, BREADTH SECOND

Holodeck is built depth-first for real estate, macro, and private markets - the domains with the structured ground-truth data Gray Capital cares about. We've built dedicated archetype clusters for those. They perform.

Crypto, broad sports, and geopolitics use a general-purpose archetype mix — the same 180-expert baseline that runs across all domains. We haven't built specialized clusters there, and the track record on those categories reflects it: a Brier of 0.236 across 18 resolved predictions. Know the difference: depth-built domains (real estate, macro) earn their use; breadth domains are tools for when structured domain data isn't available.

REAL ESTATE PHASE 2D BATCH — ARCHIVED, NOT IN HEADLINE

Methodology: skill-weighted aggregation across v2 archetypes. 5 questions sampled at random per sub-domain, prompted with point-in-time context only. Lower Brier = better; 0.25 = a coin flip. Negative delta = beats the prior (Phase 1) baseline.

Why it’s not in the headline: a May methodology audit (W2.2) found that 404 of 436 historically resolved predictions had resolution_date before submitted_at — meaning the model could have seen the outcome in training data. We now exclude any prediction without positive lead time from the public Brier. This Phase 2D table is real and is shown for completeness, but its 0.165 aggregate has been retired from the headline pending a rebuild under the tighter v2 protocol (run with today_override to clamp the swarm’s knowledge horizon to the question’s vintage). Watch this page — the rebuilt number is landing this week.

SUB-DOMAINNPHASE 1PHASE 2DΔ VS BASELINE
Commercial cap rates50.2270.098- 0.129
Multifamily rent & vacancy50.2090.133- 0.075
REIT performance50.2890.242- 0.047
Construction permits50.1890.176- 0.013
Regional submarkets50.1380.127- 0.011
Single-family housing50.1440.140- 0.005
Major news events50.2420.241- 0.001
OVERALL350.2060.165- 0.040

BREAKDOWN BY DOMAIN (LIVE)

DOMAINTOTALRESOLVEDAVG BRIERAVG PROB
Macro602150.27763%
Geopolitics43230.0385%
Sports4250--
Real Estate3780--
Crypto3550--

FEATURED OUTCOMES

Our strongest calls - highest conviction predictions with verified real-world outcomes.

GEOPOLITICS🎯 Accurate

Did the Abraham Accords gain a new Arab signatory in Q1 2026?

Swarm called5%
✗ NOBrier 0.003
View full simulation →
CRYPTO🎯 Accurate

Did Ethereum outperform Bitcoin in Q1 2026?

Swarm called5%
✗ NOBrier 0.003
View full simulation →
MACRO🎯 Accurate

Was the March 2026 CPI above 0.3% month-over-month?

Swarm called95%
✓ YESBrier 0.003
View full simulation →
MACRO🎯 Accurate

Was US retail sales growth positive in February 2026?

Swarm called95%
✓ YESBrier 0.003
View full simulation →
MACRO🎯 Accurate

Was the March 2026 PPI above 0.2% month-over-month?

Swarm called95%
✓ YESBrier 0.003
View full simulation →
MACRO🎯 Accurate

Was the April 2026 Empire State Manufacturing Index below 0?

Swarm called95%
✓ YESBrier 0.003
View full simulation →
SPORTS🎯 Accurate

Will the Golden State Warriors qualify for the 2026 NBA Playoffs?

Swarm called7%
✗ NOBrier 0.005
View full simulation →
GEOPOLITICS🎯 Accurate

Will the new Pope be elected within 2 weeks of the 2026 conclave opening?

Swarm called92%
✓ YESBrier 0.006
View full simulation →
GEOPOLITICS🎯 Accurate

Did the UK experience a general election in Q1 2026?

Swarm called8%
✗ NOBrier 0.006
View full simulation →
ECONOMICS🎯 Accurate

Will US GDP growth for Q4 2025 be reported as positive?

Swarm called85%
✓ YESBrier 0.022
View full simulation →
SPORTS🎯 Accurate

Did the Los Angeles Angels beat the Toronto Blue Jays on April 22, 2026?

Swarm called85%
✓ YESBrier 0.023
View full simulation →
MACRO🎯 Accurate

Did the Fed hold rates at its March 2026 FOMC meeting?

Swarm called85%
✓ YESBrier 0.023
View full simulation →

RECENTLY RESOLVED PREDICTIONS

50 most recent · ordered by resolution date
QUESTIONDOMAINPROBABILITYOUTCOMEBRIER
Will the Cleveland Fed's nowcast for May 2026 CPI show above 0.25% MoM?2026-05-15Macro58%✓ YES0.176
Will the new Pope be elected within 2 weeks of the 2026 conclave opening?2026-05-15Geopolitics92%✓ YES0.006
Will the Cleveland Fed's nowcast for May 2026 CPI show above 0.25% MoM?2026-05-15Macro58%✓ YES0.176
Will the Cleveland Fed's nowcast for May 2026 CPI show above 0.25% MoM?2026-05-15Macro62%✓ YES0.144
Will the new Pope be elected within 2 weeks of the 2026 conclave opening?2026-05-15Geopolitics72%✓ YES0.078
Will the new Pope be elected within 2 weeks of the 2026 conclave opening?2026-05-15Geopolitics92%✓ YES0.006
Will the Fed cut rates at its May 6, 2026 FOMC meeting?2026-05-07Macro55%✓ YES0.202
Will the May 2026 FOMC statement signal more than 2 total cuts for 2026?2026-05-07Macro68%✓ YES0.102
Will the May 2026 FOMC statement signal more than 2 total cuts for 2026?2026-05-07Macro65%✓ YES0.122
Will the May 2026 FOMC statement signal more than 2 total cuts for 2026?2026-05-07Macro67%✓ YES0.109
Will the April 2026 jobs report show more than 100,000 new payrolls?2026-05-01Macro68%✓ YES0.102
Did Q1 2026 GDP growth come in positive (above 0%)?2026-04-30Macro72%✗ NO0.518
Will China's GDP growth come in below 4% for Q1 2026?2026-04-30Macro32%✗ NO0.102
Will China's GDP growth come in below 4% for Q1 2026?2026-04-30Macro65%✗ NO0.423
Will China's GDP growth come in below 4% for Q1 2026?2026-04-30Macro65%✗ NO0.423
Will the 2026 September fed funds futures price (as of April 23) imply more than 1 cut?2026-04-24Macro72%✗ NO0.518
Will the 2026 September fed funds futures price (as of April 23) imply more than 1 cut?2026-04-24Macro72%✗ NO0.518
Will the 2026 September fed funds futures price (as of April 23) imply more than 1 cut?2026-04-24Macro72%✗ NO0.518
WHAT IS A BRIER SCORE?

A Brier score measures prediction accuracy on probability estimates. Lower is better. A perfect score is 0.0 (100% confidence on the correct outcome). Random guessing scores 0.25. Under 0.15 is excellent; under 0.25 is solid. Sports tend to be harder to call than macro trends.

RELIABILITY DIAGRAM · PHASE2D_XDOMAIN

69 resolved · 10 probability bins · y=x is perfect calibration

When we say “70% likely,” does it actually happen 70% of the time? Each dot below is a bin of resolved predictions; dot size = sample count. Sitting on the diagonal means calibrated. Above the line = under-confident; below = over-confident.

bin 0.0-0.1: n=14, predicted avg 0.03, observed 0.14bin 0.1-0.2: n=14, predicted avg 0.14, observed 0.14bin 0.2-0.3: n=7, predicted avg 0.26, observed 0.43bin 0.3-0.4: n=2, predicted avg 0.34, observed 0.50bin 0.4-0.5: n=4, predicted avg 0.41, observed 1.00bin 0.5-0.6: n=5, predicted avg 0.56, observed 0.60bin 0.6-0.7: n=6, predicted avg 0.65, observed 0.67bin 0.7-0.8: n=5, predicted avg 0.74, observed 0.80bin 0.8-0.9: n=10, predicted avg 0.84, observed 0.90bin 0.9-1.0: n=2, predicted avg 0.95, observed 1.00predicted probabilityobserved rate0110
BINNPREDICTEDOBSERVEDCALIBRATION
0.0-0.1140.030.14+11pp (under)
0.1-0.2140.140.14on target
0.2-0.370.260.43+16pp (under)
0.3-0.420.340.50+16pp (under)
0.4-0.540.411.00+59pp (under)
0.5-0.650.560.60on target
0.6-0.760.650.67on target
0.7-0.850.740.80+6pp (under)
0.8-0.9100.840.90+6pp (under)
0.9-1.020.951.00+5pp (under)

Honest small-N caveat: with 69 resolved predictions, each dot is one or two outcomes. The picture sharpens as more questions resolve. Empty bins shown intentionally - no cherry-picking.

CALIBRATION TEST - 2026 NCAA MEN'S TOURNAMENT

62 games · pre-game predictions only · verified outcomes

74.2% overall accuracy (46/62). Brier score 0.155 - well below the 0.25 coin-flip baseline. The honest test: when the swarm said "70% confident," did it win 70% of the time? Below: every confidence bucket, including the bucket we deliberately said was a coin flip.

SWARM CONFIDENCEN PICKSCORRECTACTUAL ACCURACYCALIBRATION
50-55%6116.7%said coin flip → was coin flip
55-65%221463.6%well-calibrated (predicted ~60%)
65-75%161381.2%slightly underconfident
75-85%77100%underconfident - every pick hit
85-100%1111100%high conviction well-justified

The 50-55% row is the most important. A miscalibrated swarm overclaims uncertainty. Ours said "these 6 are basically coin flips" - and they were.

FEATURED CASE STUDIES - NCAA ELITE EIGHT

March 28, 2026 · submitted before tip-off · pre-game verified

The first two publicly logged, pre-event predictions with verified outcomes. Run through the live engine before game time; both resolved the same night.

QUESTIONOUR CALLMARKETOUTCOMEBRIER (US)BRIER (MKT)VERDICT
#9 Iowa to upset #3 Illinois (Elite Eight)Mar 28, 202633%67% Illinois - NO upset28% IowaVegas moneyline implied✗ NO upsetIllinois 73-640.1090.078✓ CORRECTHolodeck called it
#2 Purdue to upset #1 Arizona (Elite Eight)Mar 28, 202631%69% Arizona - NO upset25% PurdueSpread implied✗ NO upsetArizona 79-640.0960.063✓ CORRECTHolodeck called it
2 predictions - 2 correct (100%)Avg Holodeck Brier: 0.103Avg Market Brier: 0.071Markets slightly better calibrated on these two; both made correct directional calls

WHY THIS PAGE EXISTS

THE HONEST ANSWER

We log everything from day one so that when we have hundreds of resolved predictions, you can audit the full history - not a curated highlight reel.

WHAT WE EXPECT TO WIN

Structural breaks. Regime changes. Tail events that prediction markets underprice because they're anchored to recent consensus. That's where synthetic agent swarms earn their edge.

WHAT WE DON'T HAVE YET

Per-archetype calibration on the high-conviction domains.

The swarm has 180 archetype-distinct experts across 23 domain presets. We can already show which domain the swarm is calibrated in. We can't yet show which archetypes within real estate / macro drove each call - we started logging full per-segment data on resolved questions in May 2026, so the sample size is still small (12 questions with full segment + outcome data, of 395 total resolved).

The priority is depth on multifamily underwriting and macro scenarios - those are the archetype clusters we're actively expanding. We'll ship per-archetype calibration there first, with ~10 resolved predictions per cluster as the statistical floor. Broader-domain archetype work (crypto, sports, geopolitics) is a future roadmap item, not a current focus.

ETA: real estate + macro per-archetype, late June 2026.

Run your own prediction →

Every public run is logged. This page updates as outcomes resolve.