April 2026·The Moonlake Team

Introducing Moonlake's 3D Agent: Computer Use Capabilities for World Modeling

Last month, we showed an early demo of our world modeling agent generating persistent, interactive 3D environments. Since our beta, we have been collecting one of the largest datasets of reasoning trajectories over real-world world-building tasks. Today, we are bringing computer use capabilities to our world modeling agent, allowing it to act directly inside tools like Blender to automate real work at scale.

Automating Real Work While Preserving Fine-Grained Editability

AI has made rapid progress in automating knowledge work such as software, and the loop is well-defined: code, execute, verify, and iterate. World-building follows a similar pattern, but with one major difference: the verification process requires causal, geometric, and spatial understanding of the world, priors that are never stated in the task or instruction.

This is a distinct category of work that goes beyond text-based reasoning and represents more than $100B each year across simulation, games, animation, film, and VFX workflows.

Today, we are starting with software that developers and artists are already familiar with: Blender. By building systems that allow agents to iteratively refine and improve over long-horizon trajectories rather than producing one-shot outputs, our world modeling agent can build and reconstruct 3D scenes, model assets, and create articulated assets.

These long-running agents optimize a unified reward that reflects the layered nature of world-building: at the top level, overall scene quality (visual fidelity, completeness, realism, and prompt alignment); at the middle layer, consistency with reference images or concept art; and at the bottom layer, structural correctness, whether objects are properly connected, placed, and aligned. This bottom layer is enforced through code-based verification, addressing a key limitation of vision-language models, which often miss fine-grained spatial inconsistencies that are better captured programmatically.

Figure 1: Articulated mechanical hinge assembly.

Figure 2: Articulated small parts storage cabinet.

Figure 3: Articulated refrigerator.

Figure 4: Articulated dishwasher.

Figure 5: Articulated filing cabinet.

Figure 6: Articulated paper cup.

Figure 7: Articulated cup dispenser.

Figure 8: Outdoor scene — water park.

Figure 9: Outdoor scene — kart racing circuit.

Figure 10: Outdoor scene — beach town marina.

Figure 11: Indoor scene — shopping mall interior.

Figure 12: Indoor scene — factory floor orbit.

More importantly, our agent can integrate directly into existing workflows, whether through digital asset management pipelines or by interjecting directly within Blender for fine-grained edits.

It can also learn from expert demonstrations and generalize these workflows across tasks, turning one-off effort into reusable and agentic procedures. That means no more burning hours on the boring stuff: checking naming, object states, camera framing, material consistency, light setups, export-readiness, and recurring cleanup tasks.

Towards a General-Purpose Benchmark for World-Building

The worlds we build are governed by expectations that are rarely articulated, and often taken for granted: objects should maintain spatial coherence, events should unfold with temporal consistency, interactions should respect causal ordering, and entities should persist across time. Each of these must be taken into account during evaluation, and games and simulation development are natural testbeds for this broader challenge.

Existing benchmarks are useful first steps, but they are limited. GameDevBench [1], for example, is largely built from tutorial-derived tasks and highly specified implementation instructions, making it closer to evaluating whether an agent can reproduce instructional artifacts than whether it can infer a goal-oriented behavioral contract. Its distribution also skews toward clean feature construction, underrepresenting tasks where agents must diagnose or modify existing mechanics, state transitions, and spatial interactions.

OpenGame-Bench [2] moves closer to interactive evaluation through end-to-end web game construction, but its success criteria still emphasize whether a project compiles, loads, and renders without critical errors. These are necessary signals, but many world-level failures are silent: a desynchronized item system, broken state transition, or incorrect interaction may produce no compiler warning, runtime crash, or obvious trace signal. This creates two coupled challenges: benchmarks struggle to observe such failures, and agents struggle to use them as feedback during automated debugging.

At Moonlake, we are taking steps to address this gap by turning implicit expectations of how worlds should behave from real development tasks into executable scenario tests: small, controlled interaction setups that place the world in a particular state, perform an action, and check the resulting player-visible behavior. We ground these tasks in real issues and pull requests so scenarios preserve the original goal-oriented language of game development, including both feature construction and debugging.

On the evaluation side, scenario tests make silent behavioral failures visible by probing whether worlds respond correctly under specific conditions. For example, an item-system test should verify that different items produce distinct observable effects, empty-slot use is safely ignored, and pickup behavior preserves the correct state when the player is already holding an item. In this sense, scenario tests codify how a human playtester probes a game under different conditions, but make that process reproducible and quantifiable. To avoid penalizing correct behavior for incidental interface differences, we use an agent grader to adapt the test harness to each candidate implementation while keeping candidate functionality and behavioral assertions fixed.

We are early, but our goal is to build a broader benchmark for this distinct category of work. We are expanding in three directions: (a) covering more representative real-world development tasks, (b) evaluating beyond code-level checks with interaction traces and multimodal signals, and (c) extending evaluation to be simulation-engine- and model-agnostic. If you are excited about building benchmarks for evaluating whether agents can construct, debug, and verify beyond isolated code or assets, please visit jobs.ashbyhq.com/Moonlake.

Outlook

Today, Moonlake works with partners with a combined market cap of over $8T+ to automate mundane work and unlock more time for downstream model training and creative work. If you would like to use our agent or partner with us, please email [email protected].

[1] Chi, Wayne, et al. "GameDevBench: Evaluating Agentic Capabilities Through Game Development." arXiv preprint arXiv:2602.11103 (2026).
[2] Jiang, Yilei, et al. "OpenGame: Open Agentic Coding for Games." arXiv preprint arXiv:2604.18394 (2026).