Project Eden

the First World Model for AI-native Multiplayer and Agent Interaction in a Consistent World State

2026/05/31

Generative AI has made extraordinary progress in video. With the rise of action-conditioned generation (where models respond to user inputs to produce continuous visual motion), the industry has increasingly begun to frame these systems as "world models."

But does predicting the next sequence of pixels equate to simulating a world?

Not quite. A video model predicts how pixels should change.
A true world model must reason for what those pixels represent: objects, spaces, events, actions, memory, and physical consequences that carry forward over time. To date, research chasing this vision has largely fractured into two directions, each hitting a fundamental wall.

The first direction is action-conditioned video generation. Today’s monocular video models operate primarily through autoregressive prediction in 2D pixel space. Because the evolution of the world and the rendering of the current camera view are tightly coupled, the model's entire "understanding" of the world is compressed into a short context window of recent frames.
This creates a hard limit on persistence. When an object leaves the camera's field of view, there is no independent state preserving it. If the camera returns, the model must infer the object again from context, often through hallucination. The result is familiar: objects drift, disappear, or reappear inconsistently. This path captures time and motion, but lacks a durable world state.

The second direction is static 3D scene generation. While these systems provide strong spatial structures and allow users to move through navigable environments, they treat the scene as a fixed asset. Time, physics, and state transitions are not native to architecture. This path captures space, but lacks continuous world evolution.

One path captures motion without persistence. The other captures structure without evolution.

For VAST, a foundational world model must possess both.
It requires solving two foundational problems simultaneously:
1.State: Defining the objective condition of the world at any given moment, independent of the camera.
2.Transition: Driving that world forward as actions, events, and rules unfold over time.

Today, we are sharing a research preview of our approach: Project Eden.

Project Eden is a persistent, multiplayer world model that fundamentally breaks from existing paradigms by decoupling the underlying world state from visual rendering. Instead of treating the world as a sequence of transient frames, Eden treats it as a structured, evolving environment that runs continuously, can be modified by user actions, and can be consistently observed from any viewpoint.

The Principle of Decoupling: State Before Rendering

Eden starts from a simple design principle: space, events, viewpoints, object identity, physical changes, and visual appearance should not all be compressed into pixel history.

In a real interactive world, the world exists before any single camera observes it. A player can look away from a wall, and the wall should still be there. A fire can be extinguished, and the world should remember that it is out. Two players can race on the same track from different angles, and they should still be acting inside one synchronized reality.

These are state problems before they are rendering problems.

A capable world model requires an underlying state that persists independently of camera view. Visual rendering should be a way to observe that state, not the medium where the entire state is stored. This is the core philosophy behind Project Eden: separating the world state from visual generation.

Under the Hood: A Three-Layer Architecture

To achieve this, Project Eden replaces the traditional monolithic video generator with a three-layer architecture, assigning clear responsibilities to each component.

1. The Evolving Structured State

Eden maintains a global world state that persists over time, can be updated by actions, and can be queried by different cameras. For efficiency and temporal rigor, this state is not a massive 4D point cloud. It is a compact implicit or structured representation that carries the world’s underlying content, coarse geometry, object semantics, and the consequences of user actions.

This is where the world lives.

Objects that leave the camera view are not discarded. Changes caused by user actions can be written into the world state. The world is not regenerated from scratch every time the camera moves; it is queried from the same underlying state.

2. The State-to-Observation Interface

When the system needs to render a specific view, it converts the evolving world state into camera-conditioned constraints: local semantics, geometry cues, and event changes. These intermediate representations always come from the same underlying state, so different viewpoints remain physically aligned with the same objective world.

The renderer does not have to guess the scene structure from pixels alone. It receives conditions grounded in a world state that already exists.

3. Generative Neural Rendering

The renderer receives state-derived constraints and produces high-fidelity visual output: texture, lighting, material detail, motion, smoke, fire, water, and other local dynamics. Its role is not to carry the entire burden of world memory. Its role is to translate the underlying state into high-fidelity visual output.

This gives Eden a different foundation from video-first world models. The world is maintained below the image. The image becomes a view into that world.

The Data Paradigm: Aligning Structure and Vision

A state-based world model requires a different kind of data.

For Eden, native training data is not just video. The key signal is alignment between two forms of the same world: the underlying simulation state, which contains structure and logic, and the rendered observation, which contains high-fidelity visual experience.

To build this dual-state training substrate, VAST uses a layered data strategy.

Large-Scale Deconstruction of Internet Video

Internet video provides diversity, scale, and broad visual commonsense, but it arrives as 2D pixels. Using Tripo’s accumulated 3D foundation model capabilities, VAST reverse-engineers structural signals from unlabelled video, including depth, camera pose, and geometric trajectories. This turns ordinary video into a more structured state-observation signal and gives the model generalization across many types of environments.

Engine-Synthesized Simulation Data

Game engines naturally maintain both internal state and rendered output. They can provide precise 3D state annotations, action instructions, camera poses, object identities, and environmental changes. This gives Eden controlled data for learning physical evolution, action response, and scene logic.

Together, these two pillars of data help the model learn not only how worlds look, but how they change.

What Eden Unlocks

By decoupling world state from visual rendering, Eden is designed to unlock capabilities that pure video generation and static 3D generation struggle to provide together.

Environmental Persistence and Viewpoint Consistency

In Eden, objects do not disappear when they leave the camera frustum. They continue to exist in the underlying state. When the same scene is revisited later or observed from another camera, the model queries a state that has persisted over time.

This makes long-horizon memory possible. No matter how long the user looks away, the world is still there when they turn back.

The preview demonstrates this through persistent change.
In the fire-extinguishing demo, the user action does not merely produce a temporary visual effect. The fire is extinguished, and the environment enters a changed state.

Rich Physical Dynamics and Diverse Control

The underlying state accurately registers these diverse user inputs and updates the physical dynamics accordingly.

Reusable and Editable Worlds

Traditional video world models often behave like one-way generations. Once the timeline moves forward, direct intervention in the world itself is limited.

Eden allows users to repeatedly intervene in a running world state. A user can modify the environment, leave marks, change objects, or create consequences that persist.
The user is no longer generating a new, separate video for each interaction. They are continuing inside the same reusable, modular world.

Because changes are stored in the underlying state, other users entering that world can observe the same changes. Generated worlds become persistent interactive spaces rather than disposable clips.

Native Multiplayer and Multi-Agent Interaction

In a pure video-based approach, multiple players often mean multiple unrelated pixel histories. As the number of viewpoints grows, consistency becomes harder and compute cost rises quickly.

In Eden, multiple agents share the same compact underlying state. The system renders separate views according to each agent’s camera and position, while their actions update the same world. This makes concurrent, multi-view interaction a native property of the architecture.

The preview demonstrates this with shared-world scenes. In the racing demo, two cars drive on the same track. Each player can observe the race from a different camera, but the world underneath remains synchronized.

In the shooting-range demo, different players take different actions in the same environment, and Eden produces different results according to the rules of that world.

Agent Training

A world with stable physical logic, temporal consistency, and long-term persistence can become more than a content medium. It can serve as a training and evaluation environment for embodied intelligence.

This direction shares the broader ambition of foundational world model research, including systems such as Genie: playable environments, action control, long-horizon consistency, and agent evaluation. Tripo’s emphasis is different. We focus on structured state and state-rendering decoupling as the path toward worlds that can persist.

Why This Matters

Project Eden is positioned as a foundational engine for next-generation interactive content, and also as a high-quality simulation base for embodied intelligence and agent research.

For interactive content, Eden points toward low-barrier world creation: a creator can generate an environment, define or trigger interactions, and let multiple users enter the same persistent world.

For research, Eden points toward simulation environments with long-horizon consistency, physical rules, editable scenarios, and measurable action consequences. Embodied agents need worlds where actions have stable outcomes and where the environment does not reset or drift unpredictably after every observation.

This is why VAST does not treat world models as a subproblem of video generation. A world model needs a state that can evolve.

Outlook: Toward General Interactive Worlds

Project Eden is a research preview. It is not the final version of a general-purpose world model.

The path ahead is still early.

We are working on richer physical dynamics, more complex scene evolution, broader free-viewpoint exploration, larger environments, and finer-grained object interaction.

We are also building toward a stronger State Transition Model: a system capable of continuously updating the underlying world state from agent actions, world rules, environmental feedback, and visual observations.

Real-time rendering and system efficiency also need to keep improving as the number of users, viewpoints, objects, and events grows. Evaluation must also move beyond visual quality to test persistence, object identity, causal consistency, rule-following, cross-view consistency, action consequences, and multi-agent synchronization.

The shift from predicting the next pixel to simulating the next state is more than an architectural change. It is a step toward AI systems that can create, remember, and reason inside persistent interactive worlds.

About VAST AI Research

VAST AI Research is building 3D foundation models and world models.
Learn more at www.tripo3d.ai/research and follow us at @vastairesearch.

Share the Article

Generate anything in 3D

Click below to Join Millions of 3D Creators. Try ultra-high fidelity model generation and best-in-class pbr texture.