I am a researcher working on cutting-edge research projects at the intersection of
Computer Vision, Computational Photography, and Machine Learning at the Computational Photography team of Meta Reality Labs.
My research interests cover 3D vision, neural rendering, low-level vision, and visual-linguistic understanding.
Google Scholar  /
Boosting View Synthesis with Residual Transfer
Xuejian Rong, Jia-Bin Huang,
Changil Kim, Johannes Kopf
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
We present a simple but effective technique to boost the rendering quality, which can be easily integrated with most volumetric view synthesis methods.
The core idea is to transfer color residuals (the difference between the input images and their reconstruction) from training views to novel views.
Robust Consistent Video Depth Estimation
Johannes Kopf, Xuejian Rong, Jia-Bin Huang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021   (Oral)
We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video.
Burst Denoising via Temporally Shifted Wavelet Transforms
Xuejian Rong, Denis Demandolx, Kevin Matzen, Priyam Chatterjee, Yingli Tian
European Conference on Computer Vision (ECCV), 2020
Proposed an end-to-end trainable burst denoising pipeline which jointly captures high-resolution and high-frequency deep features derived from wavelet transforms.
Unambiguous Text Localization, Retrieval, and Recognition for Cluttered Scenes
Xuejian Rong, Chucai Yi, Yingli
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Accepted.
Extended our previous CVPR paper as an End-to-End pipeline from scene text detection and retrieval to
Incremental Scene Synthesis
Benjamin Planche, Xuejian Rong, Ziyan Wu, Srikrishna Karanam,
Harald Kosch, Yingli Tian,
and Jan Ernst
Thirty-third Conference on Neural Information Processing Systems (NeurIPS), 2019
To incrementally generates complete and consistent 2D or 3D scenes with learned scene priors, while real
observations of an actual scene can be incorporated, and unobserved parts of the scene can be
Applications include autonomous agent exploration and few-shot learning.
Towards Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point Clouds of
Haiyan Wang, Xuejian
Rong, Liang Yang, Yingli
British Machine Vision Conference (BMVC), 2019
Presents a method of 3D point cloud segmentation using 2D supervision. A graph-based pyramid feature
network is proposed to capture global and local feature of points. A perspective rendering and semantic
fusion module is also introduced to offer refined 2D supervision.
Towards Accurate Instance-level Text Spotting With Guided Attention
Haiyan Wang, Xuejian
IEEE International Conference on Multimedia and Expo (ICME), 2019
Presents an effective end-to-end framework for detecting multi-lingual scene texts in arbitrary
orientations by integrating text attention model and global enhancement block with the pixel-link method
without adopting pretrained weights or extra synthetic datasets.
Unambiguous Scene Text Segmentation with Referring Expression Comprehension
Xuejian Rong, Chucai Yi, Yingli
IEEE Transactions on Image Processing (TIP), Accepted.
Combining the power of both instance-level scene text segmentation and visual phrase grounding.
Unambiguous Text Localization and Retrieval for Cluttered Scenes
Xuejian Rong, Chucai Yi, Yingli Tian
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017   (Spotlight)
To utilize text instances for understanding natural scenes,
we have proposed a framework that combines image-based
text localization with language-based context description
for text instances.
Specifically, we explore the task of unambiguous text localization and retrieval, to accurately localize
a specific targeted text instance in a cluttered image given a natural language description that refers
Evaluation of Low-Level Features for Real-World Surveillance Event Detection
Yang Xian, Xuejian Rong, Xiaodong Yang, Yingli Tian
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2017
We evaluate several of the most commonly used low-level features for real-world surveillance event
Assistive Indoor Navigation for the Visually Impaired in Multi-Floor Environments
J. Pablo Munoz, Bing Li, Xuejian Rong, Jizhong Xiao, Yingli Tian, and Aris Arditi
IEEE International Conference on Cyber Technology (CYBER), 2017
(Best Paper Award)
Our system allows blind users to explore multi-floor environment with a wearable Tango device.
Adaptive Shrinkage Cascades for Blind Image Deconvolution.
Xuejian Rong and Yingli Tian
IEEE International Conference on Digital Signal Processing (DSP), 2016   (Oral)
A framework is proposed to deconvolve blind image with patch-wise prior and adaptive shrinkage cascades.
Region Trajectories for Video Semantic Concept Detection
Yuancheng Ye, Xuejian Rong, and Yingli Tian
ACM International Conference on Multimedia Retrieval (ICMR), 2016
We introduce an algorithm based on region trajectories to establish the connections between object
localization in individual frames and video sequences.
ISANA: Wearable Context-Aware Indoor Assistive Navigation with Obstacle Avoidance for the
Bing Li, J. Pablo Munoz, Xuejian Rong, Jizhong
Xiao, Yingli Tian,
ECCV Workshop on Assistive Computer Vision and Robotics (ACVR) 2016
We presented a novel mobile wearable context-aware indoor maps and navigation system with obstacle
avoidance for the blind.
Assisting Blind People to Avoid Obstacles: An Wearable Obstacle Stereo Feedback System
based on 3D Detection
Bing Li, Xiaochen Zhang, J. Pablo Munoz, Jizhong
Xiao, Xuejian Rong, and Yingli Tian
IEEE International Conference on Robotics and Biomimetics (ROBIO) 2015
A wearable Obstacle Stereo Feedback (OSF) System for the Blind people based on 3D space obstacle
detection is presented to assist the navigation.
Scene Text Recognition in Multiple Frames based on Text Tracking
Xuejian Rong, Chucai Yi, Xiaodong Yang, and Yingli
IEEE International Conference on Multimedia and Expo (ICME) 2014
We proposed a multi-frame based scene text recognition method by tracking text regions in a video
captured by a moving camera.
Template from Jon Barron