ICDAR2017 Competition on Text Extraction from Biomedical Literature Figures

(ICDAR2017 DeTEXT Competition)



     Figures are ubiquitous in biomedical literature, and they represent important biomedical knowledge. The sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Consequently, during the last few years, figure classification, retrieval and mining have garnered significant attention in the biomedical research communities. Since text frequently appears in figures, semantic analysis of such text may assist the task of mining information from figures. Little research, however, has specifically explored automated text extraction from biomedical figures and their semantic analysis.

    Unlike images in the open domain, biomedical figures present unique challenges. For example, biomedical figures typically have complex layout, small font size, short text, specific text, complex symbols and irregular text arrangement. The quality of figures vary depending on different publishers. Consequently, conventional OCR technologies and systems which are typically trained on open domain images do not work well on biomedical figures. To better leverage biomedical figures in research and analysis in the future as well as making them more searchable and computable, we propose Semantic Interpretation of Biomedical Figure Mining to address various challenges related to semantic biomedical figure mining.


    Semantic Interpretation of Biomedical Figure Mining Challenge is being conducted to assess the capability of text detection, recognition, mining and even NLP algorithms to correctly detect and recognize text appearing in biomedical literature figures. This ICDAR2017 Competition focuses on extracting (detecting and recognizing) text from biomedical literature figures (ICDAR2017 DeTEXT Competition).

2017-Jan-22 -- ICDAR2017 DeTEXT Competition Announcement    

    This ICDAR2017 Competition is based on an open dataset, DeTEXT: A Database for Extracting TEXT from biomedical literature figures (Dataset Paper). DeTEXT is used to evaluate text detection and recognition algorithms for complex images (specifically for biomedical literature figures), and has several important features. First, DeTEXT is composed of 500 typical biomedical literature figures existing in about 300 full-text articles randomly picked from PubMed Central. Second,  figures in DeTEXT are annotated with not only the text region’s orientation, location and ground truth text, but also the image quality and the text importance. Third and foremost, DeTEXT is the first public image dataset for biomedical literature figure detection, recognition, and retrieval. It is easy to be extended to a more large-scale set, by adding more figures randomly selected from PubMed Central.

    The 14th IAPR International Conference on Document Analysis and Recognition (ICDAR 2017) will be taken place in Kyoto, Japan. ICDAR is the premier international forum (every two years) for researchers and practitioners in the document analysis and recognition community for identifying, encouraging and exchanging ideas on the state-of-the-art technology in document analysis, understanding, retrieval, and performance evaluation, and related pattern recognition methods and trends.


Competition Tasks

    The competition is set up around three tasks:
    (1) Text Localization (Text Detection), where the objective is to obtain a rough estimation of the text areas in the figure, in terms of bounding boxes (with four corner points) that correspond to parts of text (words or text lines).
    (2) Word Recognition, where the locations (bounding boxes) of words in the figure are assumed to be known and the corresponding text transcriptions are sought (without specific dictionaries).
    (3) End-to-End Text Recognition, where the objective is to localize and recognize all words in the figure in a single step (without specific dictionaries)

Competition Platform

    This ICDAR2017 Competition (on Text Extraction from Biomedical Literature Figures) will share the Robust Reading Competition (RRC) portal for organizing this competition,
serving the participators, and evaluating the submissions.

    HERE is the competition platform for ICDAR2017 Competition on Text Extraction from Biomedical Literature Figures (ICDAR2017 DeTEXT Competition).


Tentative Timeline 

Registration, DeTEXT training set release

Until May 31, 2017

DeTEXT data (testing set) release

June 10, 2017

Submission of results deadline

June 30, 2017

Results presentation

November 10-15, 2017



     For further information please contact Xu-Cheng YIN (xuchengyin AT ustb.edu.cn) and Hong YU (Hong.Yu AT umassmed.edu).



Chun YANG, PhD Student at Department of Computer Science and Technology, University of Science and Technology Beijing, China.

Xu-Cheng YIN, Professor and Deputy Chair at Department of Computer Science and Technology, University of Science and Technology Beijing, China.

Hong YU, Professor, Dept of Quantitative Health Sciences, University of Massachusetts Medical School Worcester; Adjunct Professor, School of Computer Science, University of Massachusetts Amherst; Research Health Scientist, VA Central Western Massachusetts, USA.  
 Dimosthenis, KARATZAS, Senior Research Fellow and Associate Director at the Computer Vision Centre, Universitat Autónoma de Barcelona, Spain.


Yu CAO, Associate Professor, Department of Computer Science, Co-director, UMass Center for Digital Health, University of Massachusetts Lowell, MA, USA.


Dexter PRATT,  NDEx Project Director, Department of Medicine, University of California, San Diego, CA, USA.


Guangping GAO, Professor, Dept of Microbiology and Physiological Systems, University of Massachusetts Medical School Worcester, MA, USA.