CrossOver

CrossOver

3D Scene Cross-Modal Alignment


Sayan Deb Sarkar1          Ondrej Miksik2          Marc Pollefeys2,3          Dániel Béla Baráth3          Iro Armeni1

1Stanford University 2Microsoft Spatial AI Lab 3ETH Zurich

TL;DR; CrossOver is a cross-modal alignment method for 3D scenes that learns a unified, modality-agnostic embedding space, enabling scene-level alignment without semantic annotations.


Video


Abstract

Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver’s adaptability for real-world applications in 3D scene understanding.

Method

Given a scene and its instances represented across different modalities, namely, images, point clouds, cad meshes, text referrals and floorplans, the goal is to align all modalities within a shared embedding space. The Instance-Level Multimodal Interaction module learns a multi-modal embedding space for independent object instances. It captures modality interactions at the instance level within the context of a scene, using spatial pairwise relationships between the object instances. This is further enhanced by the Scene-Level Multimodal Interaction module, which jointly processes all instances to represent the scene with a single feature vector Fs. The instance and the scene modules provide a unified, modality-agnostic embedding space; however, it requires semantic instance information that is consistent across modalities during inference, which is challenging to obtain in practice. The Unified Dimensionality Encoders eliminate dependency on precise semantic instance information by learning to process each scene modality independently while interacting with the feature vector Fs, thus, progressively building a modality-agnostic embedding space.

Cross-Modal Inference Pipeline

During inference for scene retrieval, we use our unified dimensionality encoders. Given point cloud as the query modality that represents the scene, we extract a feature vector F3D in the shared embedding space. To find the most similar 2D floorplan from a database, we locate the closest feature F2D and retrieve the corresponding scene.

Qualitative Results - Cross-Modal Scene Retrieval

Given a scene in query modality floorplan, we aim to retrieve the same scene in target modality point cloud.



Citation
@article{
      }


Acknowledgements
We thank Tao Sun, Jianhao Zheng, and Liyuan Zhu for the fruitful discussions.