Xuejian Rong

I am a researcher at Media Lab of The City College, City University of New York, where I work in the intersection of Deep Learning, Computer Vision, and Image Processing. I am finishing a Ph.D. advised by Prof. Yingli Tian. Currently my research interests mainly focus on inference and learning for scene text detection and recognition in the wild, and visual-linguistic understanding on scene text Images.

My thesis proposal is entitled "Deep Features for Context-aware Text Extraction" and the slides are available as linked. I interned at Siemens Corporate Research and worked on the visual representation learning for novel view synthesis. I obtained the B.E. degree from Nanjing University of Aeronautics and Astronautics at 2013.

Email  /  Resume  /  Google Scholar  /  LinkedIn



  • Will work in Siemens Corporate Research as a research intern till December 2018.

  • Thanks to NVIDIA for supporting my research with the NVIDIA GPU Grant.


Unambiguous Text Localization, Retrieval, and Recognition for Cluttered Scenes
Xuejian Rong, Chucai Yi, Yingli Tian
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Under Review

Extended our previous CVPR paper as an End-to-End pipeline from scene text detection and retrieval to recognition.


Incremental Scene Synthesis
Benjamin Planche, Xuejian Rong, Ziyan Wu, Srikrishna Karanam, Harald Kosch, Yingli Tian, and Jan Ernst
arXiv preprint arXiv:1811.12297, Under Review

To incrementally generates complete and consistent 2D or 3D scenes with learned scene priors, while real observations of an actual scene can be incorporated, and unobserved parts of the scene can be hallucinated.

Applications include autonomous agent exploration and few-shot learning.


Unambiguously Indicated Characterness for Referring Scene Text Segmentation
Xuejian Rong, Chucai Yi, Yingli Tian
IEEE Transactions on Image Processing (TIP), Under Review
code / dataset

Combining the power of both instance-level scene text segmentation and visual phrase grounding.


Unambiguous Text Localization and Retrieval for Cluttered Scenes
Xuejian Rong, Chucai Yi, Yingli Tian
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017   (Spotlight)
code / dataset

To utilize text instances for understanding natural scenes, we have proposed a framework that combines image-based text localization with language-based context description for text instances.

Specifically, we explore the task of unambiguous text localization and retrieval, to accurately localize a specific targeted text instance in a cluttered image given a natural language description that refers to it.


Evaluation of Low-Level Features for Real-World Surveillance Event Detection
Yang Xian, Xuejian Rong, Xiaodong Yang, Yingli Tian
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2017

We evaluate several of the most commonly used low-level features for real-world surveillance event detection tasks.


Assistive Indoor Navigation for the Visually Impaired in Multi-Floor Environments
J. Pablo Munoz, Bing Li, Xuejian Rong, Jizhong Xiao, Yingli Tian, and Aris Arditi
IEEE International Conference on Cyber Technology (CYBER), 2017   (Best Paper Award)

Our system allows blind users to explore multi-floor environment with a wearable Tango device.


Guided Text Spotting for Assistive Blind Navigation in Unfamiliar Environments.
Xuejian Rong, Bing Li, Jizhong Xiao, Aris Arditi, and Yingli Tian
The 12th International Symposium on Visual Computing (ISVC), 2017   (Oral)

An assistive text spotting based navigation system is proposed, based on stroke-specific features and subsequent text tracking process.


Adaptive Shrinkage Cascades for Blind Image Deconvolution.
Xuejian Rong and Yingli Tian
IEEE International Conference on Digital Signal Processing (DSP), 2016   (Oral)

A framework is proposed to deconvolve blind image with patch-wise prior and adaptive shrinkage cascades.


Recognizing Text-based Traffic Guide Panels with Cascaded Localization Network
Xuejian Rong, and Yingli Tian
ECCV Workshop on Computer Vision for Road Scene Understanding and Autonomous Driving (CVRSUAD), 2016
poster / dataset

A top-down framework is introduced for automatic localization and recognition of text-based traffic guide panels captured by car-mounted cameras from natural scene images.


Region Trajectories for Video Semantic Concept Detection
Yuancheng Ye, Xuejian Rong, and Yingli Tian
ACM International Conference on Multimedia Retrieval (ICMR), 2016

We introduce an algorithm based on region trajectories to establish the connections between object localization in individual frames and video sequences.


ISANA: Wearable Context-Aware Indoor Assistive Navigation with Obstacle Avoidance for the Blind
Bing Li, J. Pablo Munoz, Xuejian Rong, Jizhong Xiao, Yingli Tian, and Aris Arditi
ECCV Workshop on Assistive Computer Vision and Robotics (ACVR) 2016

We presented a novel mobile wearable context-aware indoor maps and navigation system with obstacle avoidance for the blind.


Assisting Blind People to Avoid Obstacles: An Wearable Obstacle Stereo Feedback System based on 3D Detection
Bing Li, Xiaochen Zhang, J. Pablo Munoz, Jizhong Xiao, Xuejian Rong, and Yingli Tian
IEEE International Conference on Robotics and Biomimetics (ROBIO) 2015

A wearable Obstacle Stereo Feedback (OSF) System for the Blind people based on 3D space obstacle detection is presented to assist the navigation.


CCNY at TRECVID 2015: Video Semantic Concept Localization
Yuancheng Ye, Xuejian Rong, Xiaodong Yang, and Yingli Tian
NIST TREC Video Retrieval Evaluation Workshop (TREVCID) 2015

We present a novel video-based object localization system, which is developed for the Semantic Localization task of TRECVID 2015.

We won the 1st place on this track. Extended to an ICMR 2016 paper.


CCNY at TRECVID 2014: Surveillance Event Detection
Yang Xian, Xuejian Rong, Xiaodong Yang, and Yingli Tian
NIST TREC Video Retrieval Evaluation Workshop (TREVCID) 2014
poster / dataset

We present two video-based event detection systems for the Surveillance Event Detection (SED) task of TRECVID 2014.

We won the 3rd place on this track. Extended to a TCSVT paper.


Scene Text Recognition in Multiple Frames based on Text Tracking
Xuejian Rong, Chucai Yi, Xiaodong Yang, and Yingli Tian
IEEE International Conference on Multimedia and Expo (ICME) 2014

We proposed a multi-frame based scene text recognition method by tracking text regions in a video captured by a moving camera.