Developing safe autonomous driving systems requires massive amounts of data. Simulators offer scalable environments, but the visual gap between simulation and reality limits the usefulness of this data for downstream perception and planning tasks.
Standard diffusion models naturally alter the structural layout of a scene during generation. While conventional conditioning approaches add significant computational overhead, our framework optimizes the data representation within the diffusion process itself. This avoids extra complexity while achieving better preservation of both the geometric structure and semantics of the original simulation. As a result, we provide high-fidelity data to train autonomous driving stacks and enable safe, reliable closed-loop end-to-end evaluation.
