| 
          
            | 
                Sayan Deb Sarkar
               I'm a 2nd-year PhD student at Stanford University in the Gradient Spaces Group, 
                advised by Prof. Iro Armeni, 
                part of the Stanford Vision Lab (SVL). In summer '25, I interned with the Microsoft Spatial AI Lab, working on efficient video understanding in spatial context. 
               
                Before starting PhD, I was a CS master student at ETH Zürich, supervised by Prof. Marc Pollefeys, working on 
                aligning real-world 3D environments from multi-modal data. I graduated with a Bachelors in 
                Information Technology from Manipal University, India, where I spent time working on face recognition and medical imaging problems.
               
                In 2020-21, I spent a wonderful time working with Shreyas Hampali and Mahdi Rad at 
                Prof. Vincent Lepetit's 
                lab on hand-object pose estimation and monte carlo scene search for 3D scene understanding. 
                I view them as mentors entering research, and strive to learn from them.
               
                My research interests are on multimodal 3D scene understanding and interactive editing. I am always looking for research collaborations, get in touch if you have something relevant. 
                If you're around the Bay Area, feel free to reach out for a cup of coffee!
               
                Email  / 
                CV  / 
                Google Scholar  / 
                Github  / 
                Twitter  / 
                LinkedIn
               |   |  
          
            
              | Research 
                  My research interests lie at the intersection of Computer Vision and Machine Learning, specifically in the areas of multimodal data representations for spatial understanding. 
                 |  
          
            
              |  | GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer Sayan Deb Sarkar, Sinisa Stekovic, Vincent Lepetit, Iro Armeni
 arXiv |    
                  Project Page |
                   Video |
                  Code
 Neural Information Processing Systems (NeurIPS), 2025
 
 
                  A training-free method that steers pre-trained generative rectified flow with differentiable guidance for robust, geometry-aware 3D appearance transfer across shapes and modalities. 
                   |  
              |  | SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment Binod Singh*, Sayan Deb Sarkar*, Iro Armeni
 arXiv |    
                  Project Page |
                   Video |
                  Code
 arXiv 2025
 
 
                  3D Scene Graph alignment framework across modalities using open-vocabulary cues and learned joint embeddings, achieving robust performance under noise and low overlap. 
                  Master Student Project.
 |  
              |  | CrossOver: 3D Scene Cross-Modal Alignment Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Dániel Béla Baráth, Iro Armeni
 arXiv |    
                  Project Page |
                   Video |
                  Code
 Computer Vision and Pattern Recognition (CVPR), 2025
 🏆 Highlight (top 3%)
 Featured: Open Robotics
 
 
                  Cross-modal alignment method for 3D scenes that learns a unified, modality-agnostic embedding space, enabling scene-level alignment without semantic annotations.
                   |  
              |  | SGAligner: 3D Scene Alignment with Scene Graphs Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Dániel Béla Baráth, Iro Armeni
 arXiv |       
                  Project Page |
                   Video |
                  Code
 International Conference on Computer Vision (ICCV), 2023
 Featured: RSIP Computer Vision Magazine, Learn OpenCV Blog
 
 
                  3D Scene Graph Alignment robust to in-the-wild scenarios powering point cloud registration and map integration.
                   |  
              |  | Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation Shreyas Hampali, Sayan Deb Sarkar, Mahdi Rad, Vincent Lepetit
 Computer Vision and Pattern Recognition (CVPR), 2022
 🏆 Oral (top 4.2%)
 arXiv |       
                  Project Page |
                  Video |
                  Code
  Efficient network for joint two-hand and object pose estimation in complex interactions, paired with the new H2O-3D dataset of two-hand interaction with YCB objects.  |  
              |  | Monte Carlo Scene Search for 3D Scene Understanding Shreyas Hampali*, Sinisa Stekovic*, Sayan Deb Sarkar, Chetan Srinivasa Kumar, Friedrich Fraundorfer, Vincent Lepetit
 Computer Vision and Pattern Recognition (CVPR), 2021
 arXiv  |
                  Project Page |
                  Video |
                  Code
  Monte-Carlo Tree Search (MCTS) based analysis-by-synthesis method to recover complete scene (3D layout+objects) from a noisy RGB-D scan. 
 |  
              |  | General 3D Room Layout from a Single View by Render-and-Compare Sinisa Stekovic, Shreyas Hampali, Mahdi Rad, Sayan Deb Sarkar, Friedrich Fraundorfer, Vincent Lepetit
 European Conference on Computer Vision (ECCV), 2020
 arXiv  |
                  Project Page |
                  Video  |
                  Code
  3D layout estimation from a single perspective view, to recover complex non-cubiod layouts by solving a constrained discrete optimization problem.  |  
          
          | Misc 
              Workshop Organisation: CV4AEC@CVPR 2023, 2024
              
               
                Conference Review: CVPR, ICCV, ECCV, NeurIPS, ICRA |  |