Theory of Space

Can Foundation Models Construct Spatial Beliefs Through Active Exploration?

Pingyue Zhang¹*, Zihan Huang*, Yue Wang⁴*, Jieyu Zhang³*, Letian Xue¹, Zihan Wang¹, Qineng Wang¹

Keshigeyan Chandrasegaran², Ruohan Zhang², Yejin Choi², Ranjay Krishna³, Jiajun Wu², Li Fei-Fei², Manling Li¹

¹Northwestern University, ²Stanford University ³University of Washington, ⁴Cornell University,

*equal contribution

Code Paper Dataset Results (10 samples)

What is Theory of Space?

Theory of Space is the ability to build a mental map from partial views. We define it as three coupled abilities:

Construct

Actively explore and integrate partial observations into a globally consistent belief.

Revision

Revise the belief when new evidence conflicts with earlier assumptions.

Exploit

Use the current belief to answer spatial queries.

Theory of Space decomposes into constructing, revising, and exploiting an internal spatial belief.

We operationalize this by explicitly probing the agent’s cognitive map during exploration, not just its final answers. This exposes belief accuracy, uncertainty, and how the map evolves over time.

Why is this problem important?

Active Exploration: Embodied agents must actively explore to construct spatial beliefs, unlike passive systems.
Efficiency Challenge: Effective exploration requires efficient decision-making under uncertainty, prioritizing information gain.

Current Limitations: Multimodal models excel at passive tasks but struggle with active exploration under partial observability. Key bottlenecks include:

Active vs. Passive Gap: Models significantly underperform in active settings compared to passive reasoning.
Modality Gap: Vision-based performance lags behind text-based reasoning.
Inefficiency: Models require far more actions than optimal proxies.

Cost vs. evaluation accuracy in VisionWorld; faded icons indicate passive logs (reasoning only).

What do we do?

We build a benchmark where models must explore and reveal their spatial beliefs. Text-based and vision-based worlds are paired so we can isolate reasoning vs exploration.

Exploration Environment

Procedurally generated multi-room layouts on N×M grid with paired environments:

Text World: symbolic observations with direction/distance bins (pure reasoning)
Vision World: egocentric RGB images from ThreeDWorld (perception + reasoning)

Action Space

Goto Move to visible object

Rotate Turn 90°/180°/270°

Observe (Cost: 1) Get text/visual view

Query (Cost: 2) Get coordinates

Active Exploration Example

Step through the agent's interaction to see what it observes, chooses, believes, and how that aligns with the ground truth.

Steps

0 / 6

Step 0 of 6

Agent Action

：

Move forward to inspect the door.

Observation

Vision World

Text World

The agent sees a partial room view with a door to the front-left.

Ground Truth

Agent Belief

Task Suite

We evaluate how the learned map is used at two levels:

Route-level tasks test egocentric, path-based reasoning.
Survey-level tasks test allocentric, map-like reasoning.

Survey-level probes ask whether a model can infer unseen views and handle geometric transformations beyond memorized paths.

Task suite for Theory of Space covering route-level and survey-level belief.

Belief Revision

We introduce a dynamic perturbation task to probe Belief Revision. After exploration, objects are secretly relocated, creating a "false belief" that conflicts with new observations.

Task: The agent must actively re-explore to identify changes and revise its map.

Disentangling Exploration from Reasoning

Active vs Passive exploration comparison

We disentangle exploration ability from reasoning ability using scripted proxy agents:

SCOUT (vision): rotates 360° at each location for full coverage
STRATEGIST (text): belief-driven policy to maximally reduce ambiguity

Active Exploration

Agent plans its own actions, balancing cost vs. information gain.

Passive Comprehension

Model reasons from proxy-collected logs, isolating reasoning from exploration.

Probing Internal Belief

Spatial belief probing: externalize a cognitive map and identify unexplored regions.

We provide a direct window into the agent's spatial belief via explicit cognitive-map probing:

Agent outputs a structured cognitive map (N×M grid) at each step
Scored on correctness (ground-truth alignment) and other dynamic metrics
Also probes uncertainty by identifying unobserved regions

Evaluation Results

Performance comparison of different models on Theory of Space (ToS) benchmark. Dark highlighting indicates the best result within each category, light highlighting denotes the second-best.

Proprietary Models Open-source Models

Performance Conclusion

🎯

Active exploration is the bottleneck GPT-5.2: 0.57→0.46 | GEMINI-3 PRO: 0.60→0.57 (passive→active)

📊

Substantial modality gap GPT-5.2: 0.72 text vs 0.46 vision | GEMINI-3 PRO: 0.81 text vs 0.57 vision

⚡

Inefficient exploration Proxy agents: ~9 steps to coverage | Models: far more steps with no gain

🔍

Exploration patterns differ GPT-5.2: jumps to doors early | GEMINI-3 PRO: rotates first like scout proxy

Spatial Belief Probing

We probe the agent's internal belief state to understand why failures occur.

Correctness (final): Evaluates the predicted global map at the last turn.
Perception: Compares the predicted local map to the ground-truth local map for the current field of view.
Self-tracking: Compares the agent pose inferred from the predicted global map to the ground-truth agent state.
Local ↔ Global: Compares local and global predictions within the same turn (coherence check).
Stability: Checks if previously observed objects degrade in the map over time.
Uncertainty: Can the agent identify which regions it hasn't seen yet?

Methods	Correctness			Perception		Local ↔ Global		Stability		Self-tracking		Uncertainty
Methods	Ori.	Pos.	Overall	Ori.	Pos.	Ori.	Pos.	Ori.	Pos.	Ori.	Pos.	Uncertainty
Vision-based World
GPT-5.2	20.2	42.0	32.2	33.5	72.4	57.9	58.7	65.4	56.4	93.3	64.7	53.7
GEMINI-3 PRO	32.2	62.5	52.1	43.8	68.5	52.9	68.3	61.8	62.0	98.8	73.9	70.2
Text-based World
GPT-5.2	91.0	75.1	80.0	100	86.8	96.4	86.0	96.7	67.6	98.0	86.7	64.5
GEMINI-3 PRO	92.5	75.5	81.4	99.9	88.2	91.6	84.8	90.8	67.7	99.9	85.2	79.2

Probing Conclusion

👁️

Perception is the bottleneck Vision perception remains a key bottleneck, especially for object orientation.

📉

Unstable belief Unstable cognitive map prediction degrades spatial belief beyond initial perception.

Belief Revision Task

We introduce a dynamic perturbation task to probe the agent's ability to handle non-stationarity. After exploration, 4 objects are secretly relocated/reoriented. The agent must re-explore to identify changes.

Identification Accuracy: F1 score for correctly identifying which objects changed.
Average Steps: Cost of belief revision. Includes Redundancy (steps taken after finding all changes).
Belief Correctness: Map accuracy on the changed objects.
Belief Inertia: Measures if the agent overwrites its old map (high score) or exhibits belief inertia (low score).

Methods	Avg. Steps ↓		Identification (%) ↑		Belief Correctness (%) ↑		Belief Inertia (%) ↓
Methods	All	Red.	Ori.	Pos.	Ori.	Pos.	Ori.	Pos.
Text-based World
GPT-5.2	6.92	0.55	97.9	98.4	89.5	69.7	5.5	12.5
GEMINI-3 PRO	7.79	0.18	98.7	98.8	91.8	72.9	7.9	5.7
Vision-based World
GPT-5.2	13.06	6.20	14.3	68.0	16.7	42.9	68.9	34.7
GEMINI-3 PRO	10.29	3.23	23.9	82.5	30.3	63.1	51.1	14.4

Belief Revision Conclusion

🔄

Severe Belief Inertia in Vision Vision agents fail to overwrite obsolete orientation beliefs despite new evidence (Belief Inertia score ~20%).

🛑

High Redundancy Vision agents take many redundant steps after changes are visible, indicating failure to recognize the revision is complete.

Key Findings

Modality Gap

Text significantly outperforms vision.

A clear modality gap persists: text-based settings consistently outperform vision-based settings in spatial belief construction and exploitation.

Vision vs text gap under passive setting

Active Exploration is the Bottleneck

Performance drops when models must actively explore.

a) Performance and Efficiency Deficit

Active agents score lower than reasoning on rule-based program histories, and explore less efficiently than the program.

b) Incomplete Coverage

Active agent fails to achieve complete information coverage. Models explore redundantly, requiring ≥14 steps without improving belief accuracy, while rule-based proxies reach target coverage in ~9 steps.

c) Complexity-Widened Gap

The active–passive gap increases with complexity; Gemini-3 Pro scales much better. As the number of rooms increases, exploration cost rises accordingly. For both GPT-5.2 and Gemini-3 Pro, performance declines as the room number increases, and the active–passive performance gap widens with room number.

Methods	2-room			4-room
Methods	pass.	act.	exp.	pass.	act.	exp.
Text-based World
GPT-5.2	92.3	77.8	6.2	86.5	66.0	16.4
Gemini-3 Pro	86.7	80.6	6.2	81.2	77.7	19.7
Vision-based World
GPT-5.2	59.3	51.5	10.8	52.6	40.3	23.2
Gemini-3 Pro	58.3	57.8	6.6	56.2	51.5	19.7

Cognitive Map Failures

Orientation, stability, and belief drift issues.

Orientation Gap: Vision perception is a bottleneck, especially for object orientation.
Unstable Map: Beliefs about previously observed objects degrade over time.
Belief Drift: New updates corrupt earlier correct perceptions, lowering final correctness.

Maps as a Diagnostic Proxy

Map correctness correlates with downstream success.

Sufficiency Test: Conditioning on ground-truth maps yields near-perfect accuracy (~95%), confirming the JSON map format captures all necessary information for tasks.
Alignment Test: Prompting models to explicitly generate maps before answering slightly degrades performance. This externalization gap indicates the model's latent internal belief is richer than its discretized JSON output.

While lossy, the explicit map remains a strong diagnostic proxy. Map correctness correlates significantly with downstream success:

Methods	Text (%)	Vision (%)
GPT-5.2	41.8	57.0
Gemini-3 Pro	46.6	64.5

Pearson correlation (r) between spatial-belief correctness and downstream evaluation performance. All correlations are significant (p<.001).

Vision Deficiencies & Belief Inertia

Vision agents persist in obsolete beliefs.

Vision-based Revision Failures: Vision agents suffer from excessive exploration redundancy and poor accuracy in identifying object shifts.
Belief Inertia: Agents, especially vision-based ones, persist in obsolete spatial coordinates despite new observations.

Citation

If you find our work useful in your research, please cite:

@inproceedings{zhang2026theoryofspace,
  title     = {Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?},
  author    = {Zhang, Pingyue and Huang, Zihan and Wang, Yue and Zhang, Jieyu and Xue, Letian and Wang, Zihan and Wang, Qineng and Chandrasegaran, Keshigeyan and Zhang, Ruohan and Choi, Yejin and Krishna, Ranjay and Wu, Jiajun and Fei-Fei, Li and Li, Manling},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
}