Research‎ > ‎

iTERM: Intelligent TExt Reading in ubiquitous Multimedia

            -- Robust Text Detection, Recognition, Retrieval and Mining in Natural Scenes, Web Images, Ubiquitous Documents, and Videos

Robust Text Detection

    Text detection in natural scenes and web images is an important prerequisite for many content-based image analysis tasks. We propose an accurate and robust method for detecting texts in natural scenes and web images. A fast and effective pruning algorithm is designed to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations. Character candidates are grouped into text candidates by the single-link clustering algorithm, where distance weights and clustering threshold are learned automatically by a novel self-training distance metric learning algorithm. The posterior probabilities of text candidates corresponding to non-text are estimated with a character classifier; text candidates with high non-text probabilities are eliminated and texts are identified with a text classifier. The proposed system is evaluated on the ICDAR 2011 Robust Reading Competition database; the f-measure is over 76%, much better than the state-of-the-art performance of 71%. Experiments on multilingual, street view, multi-orientation and even born-digital databases also demonstrate the effectiveness of our method. 
    Our technology ("USTB_TexStar") won the first place of both "Text Localization in Real Scenes" (Challenge 2) and "Text Localization in Born-Digital Images (Web and Email)" (Challenge 1) in the ICDAR 2013 Robust Reading Competition (see some Chinese news, e.g., Science and Technology DailyChina Science Daily, and We also set up online demos and Android Smart Phone demos. Technical details are shown in our IEEE TPAMI [1], SIGIR [2], ICPR [3] papers.

Multi-Orientation and Multi-View Text Detection
   Most current research efforts only focus on horizontal or near horizontal scene texts. In our project [1, 4], first we design a unified distance metric learning framework for adaptive hierarchical clustering, which can simultaneously learn similarity weights (to adaptively combine different feature similarities) and clustering threshold (to automatically determine the number of clusters). Then, we propose an effective multi-orientation and multi-view scene text detection system, which constructs text candidates by grouping character candidates based on the above adaptive hierarchical clustering. The proposed detection system is evaluated on one  multi-orientation scene text public database (MSRA-TD500); the f measure is 71%,  much better than 60% of one recent state-of-the-art performance. We also collect and construct a more practical and challenging multi-orientation and multi-view scene text dataset (USTB-SV1K) from the Google Street View. Demo of our technology and USTB-SV1K dataset are available online.


Effective End-to-End Scene Text Recognition
    Currently, many end-to-end scene text recognition technologies focus on word spotting only with a small and fixed lexicon, while other state-of-the-art systems with open-vocabulary have a very limited performance. Based on our previous scene text detection technology, we propose an accurate end-to-end scene text recognition system with open-vocabulary [5]. In effective text segmentation, we extract texts combining results from both pruned Maximally Stable Extremal Regions and local thresholding. More importantly, in accurate word recognition, we set up a classification switching strategy for fusing “shallow” and “deep” classifiers, which dynamically combines the conventional open source OCR engine with the present popular convolutional neural networks. The proposed end-to-end scene text recognition system is evaluated on the popular and challenging public database (ICDAR 2015 Robust Reading Competition Challenge 2); the f measure is 77.37% (End-To-End Recognition (Generic)), much better than the state-of-the-art performance. Moreover, won the 1st place of ICDAR 2015 Robust Reading Competition "End-To-End Focused Scene Text Recognition (Generic)", "End-To-End Born-Digital Image Text Recognition (Generic)", "End-To-End Born-Digital Image Text Recognition (Weak)".

Text Detection, Recognition and Retrieval in Street Views and Videos
    Generally, while driving to a strange place in a city street, one person usually uses a GPS Navigator, and always looks around near the almost target location. But, one possibly misses the target, and has to turn around with a long distance. At the same time, some people use the Google Street View to search one place. However, it is time consuming and trivial to online browse and find a target mark on a little display screen. In our project [6], we propose an automatic text retrieval system in street views for driving or online browsing. Our system integrates a variety of information retrieval, text detection and recognition, word spotting, and image retrieval techniques. With our system, you can automatically locate the target place while driving closely to it; you can also intelligently retrieve the target view when the browsing is approaching to the target in online street views.
    We implement a typical and realistic example system of our technology, named as GooStreet [6]. Type a street place, e.g., "Times Square Church, New York", use our text retrieval technology along with the Google Maps and the Google Street View, and you can easily get the target place which is automatically marked in the street view image.
    For text detection and recognition, our technology [9] also won the 1st place of ICDAR 2015 Robust Reading Competition "Video Text Detection".


Text Detection and Recognition from Biomedical Literature Figures

Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high quality ground truth standard can greatly facilitate the development of an automated system. We develop a database, DeTEXT, A database for evaluating text extraction from biomedical literature figures [10]. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at


[1] Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao, “Robust text detection in natural scene images”, IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), vol. 36, no. 5, pp. 970-983, 2014. <Paper Link> <Paper Link (IEEE Xplore)>
[2] Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao, “Accurate and robust text detection: A step-in for text retrieval in natural scene images,”  Proceedings of the 36th International ACM SIGIR Confernce on Research and Development in Information Retrieval (SIGIR’13), 2013. <Paper Link>
[3] Xuwang Yin, Xu-Cheng Yin, Hong-Wei Hao, and Khalid Iqbal, “Effective text localization in natural scene images with MSER, geometry-based grouping and AdaBoost,” Proceedings of the 21st International Conference on Pattern Recognition (ICPR’12), 2012. <Paper Link>
[4] Xu-Cheng Yin, Wei-Yi Pei, Xuwang Yin, Jun Zhang, and Hong-Wei Hao, “Multi-orientation scene text detection witadaptive clustering”, IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), vol. 37, no. 9, pp. 1930-1937, 2015. <Online Databset> <Paper Link> .
[5] Xu-Cheng Yin, Chun Yang, Wei-Yi Pei, and Hong-Wei Hao, End-to-end scene text recognition by fusing shallow and deep classifiers, Technical Reports, 2013.
[6] Xu-Cheng Yin, Xuwang Yin, Wei-Yi Pei, Jun Zhang, Pei-Feng Hu, and Hong-Wei Hao, “Text retrieval in street views”, Technical Reports, 2013.
[7] Khalid Iqbal, Xu-Cheng Yin, Xuwang Yin, Hazrat Ali, and Hong-Wei Hao, “Classifier comparison for MSER-based text classification in scene images,”  Proceedings of International Joint Conference on Neural Networks (IJCNN’13), 2013.
[8] Khalid Iqbal, Xu-Cheng Yin, Hong-Wei Hao, Sohail Asghar, and Hazrat Ali, “Bayesian network scores based text localization in scene images,”  Proceedings of International Joint Conference on Neural Networks (IJCNN’14), 2014.
[9] Ze-Yu Zuo, Shu Tian, Wei-Yi Pei, Xu-Cheng Yin, "Multi-strategy tracking based text detection in scene videos," Proceedings of 13th International Conference on Document Analysis and Recognition (ICDAR'15), 2015.
[10] Xu-Cheng Yin, Chun Yang, Wei-Yi Pei, Haixia Man, Jun Zhang, Erik Learned-Miller, and Hong Yu, "DeTEXT: A database for evaluating text extraction from biomedical literature figures," PLoS ONE, vol. 10, no. 5, pp. e0126200, 2015. <Online Databset> <Paper Link>


Funding and Sponsoring
The research project is partly supported by National Natural Science Foundation of China (61105018, 61473036), and Samsung Research.


|Powered By Google Sites