CVPR '25 - CrossOver

3D Scene Cross-Modal Alignment

Sayan Deb Sarkar¹ Ondrej Miksik² Marc Pollefeys^2,3 Dániel Béla Baráth^3,4 Iro Armeni¹

¹Stanford University ²Microsoft Spatial AI Lab ³ETH Zurich ⁴HUN-REN SZTAKI

✨ CVPR 2025 Highlight ✨

arXiv Code 🤗 Checkpoints Video Conference Poster

TL;DR; CrossOver is a cross-modal alignment method for 3D scenes that learns a unified, modality-agnostic embedding space, enabling scene-level alignment without semantic annotations.

Video

Abstract

Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver’s adaptability for real-world applications in 3D scene understanding.

Method

Given a scene and its instances represented across different modalities, namely, images, point clouds, cad meshes, text referrals and floorplans, the goal is to align all modalities within a shared embedding space. The Instance-Level Multimodal Interaction module learns a multi-modal embedding space for independent object instances. It captures modality interactions at the instance level within the context of a scene, using spatial pairwise relationships between the object instances. This is further enhanced by the Scene-Level Multimodal Interaction module, which jointly processes all instances to represent the scene with a single feature vector Fs. The instance and the scene modules provide a unified, modality-agnostic embedding space; however, it requires semantic instance information that is consistent across modalities during inference, which is challenging to obtain in practice. The Unified Dimensionality Encoders eliminate dependency on precise semantic instance information by learning to process each scene modality independently while interacting with the feature vector Fs, thus, progressively building a modality-agnostic embedding space.

Cross-Modal Inference Pipeline

During inference for scene retrieval, we use our unified dimensionality encoders. Given point cloud as the query modality that represents the scene, we extract a feature vector F3D in the shared embedding space. To find the most similar 2D floorplan from a database, we locate the closest feature F2D and retrieve the corresponding scene.

Qualitative Results - Cross-Modal Scene Retrieval

Given a scene in query modality floorplan, we aim to retrieve the same scene in target modality point cloud.

Citation

@inproceedings{sarkar2025crossover,
        author={Sayan Deb Sarkar and Ondrej Miksik and Marc Pollefeys and Daniel Barath and Iro Armeni},
        title={CrossOver: 3D Scene Cross-Modal Alignment}, 
        booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
        year = {2025}
      }

Acknowledgements

This work was partially funded by the ETH RobotX research grant. We thank Tao Sun and Jianhao Zheng for the fruitful discussions.