DeTEXT: A Database for Extracting TEXT from biomedical literature figures

Xu-Cheng Yin1*, Chun Yang1, Wei-Yi Pei1, Haixia Man2, Jun Zhang1, Erik Learned-Miller3 , Hong Yu3,4*

1 Department of Computer Science and Technology, School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

2 School of Foreign Studies, University of Science and Technology Beijing, Beijing 100083, China

3 School of Computer Science, University of Massachusetts Amherst, MA 01002, USA

4 Department of Quantitative Health Sciences, University of Massachusetts Medical School, MA 01605, USA

 

1. Abstract

     Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text richly appears in figures, automatically extracting such text may assist the task of mining information from figures. A high-quality gold standard is a necessary step for any automated systems. We describes DeTEXT: A Database for Extracting TEXT from biomedical literature figures, the first publicly available human-annotated high-quality figure-text dataset. We describe how we selected representative figures from open-access full-text biomedical articles, and how we created the annotation guideline and the tool for annotation. We then describe a variety of challenges for figure text detection, and report the annotation agreement and the statistics of the DeTEXT data. Finally, we recommend the evaluation protocols for DeTEXT, and discuss research directions concerning automated systems for extracting text from biomedical figures. DeTEXT is available at http://prir.ustb.edu.cn/DeTEXT/.

2. Dataset Construction and Image Annotation

     Following the general strategies in DAR, we construct and report DeTEXT: A Database for Extracting TEXT from biomedical literature figures. DeTEXT is used to evaluate text detection and recognition algorithms for complex images (specifically for biomedical literature figures), and has several important features. First, similar to the figure dataset used in FigTExT (Kim and Yu, PLOSONE 2011) but with a larger number of figures and articles, DeTEXT is composed of 500 typical biomedical literature figures existing in about 300 full-text articles randomly picked from PubMed Central. Second, similar to the image dataset used in the recent ICDAR 2013 Robust Reading Competition but with much richer information, figures in DeTEXT are annotated with not only the text region’s orientation, location and ground truth text, but also the image quality and the text importance. Third and foremost, DeTEXT is the first public image dataset for biomedical literature figure detection, recognition, and retrieval. It is easy to be extended to a more large-scale set. By adding more figures randomly selected from PubMed Central, it is also convenient to be a benchmark dataset for international competitions (like ICDAR Robust Reading Competition).

    DeTEXT is collected from PubMed Central. It is composed of 500 typical biomedical literature figures appearing in about 300 full-text articles randomly selected. This dataset is divided into three separate non-overlapping subsets: training, validation and testing. Details are shown in Table 1 below. We also construct 5-flod and 10-fold cross validation datasets which are public and available. Figure 1 shows some representative biomedical figures and their embedded text.

Table 1. Descriptions of training, validation and testing sets in DeTEXT, where full-text articles are randomly selected from PubMed Central.

 

NO. of figures

NO. of articles

Remarks

Training set

100

100

Select one figure for each article

Validation set

100

45

Randomly select articles and then include all common figures in these articles, until 100 is reached.

Testing set

300

143

Randomly select articles and then include all common figures in these articles, until 300 figures are selected

    Firstly, the training set comprises 100 figures from 100 articles (each figure from one article), maximizing the number of both figures and articles used for training. Generally speaking, the training set is used for selecting parameters and rules, and training models and classifiers in methods and systems for text detection and recognition. These resources may be altered, amended or annotated in any way for facilitating related research issues.  Secondly, the validation set is composed by 100 figures from dozens of articles, where we randomly select full-text articles and then all common figures (where text can be seen by person) are included until 100 figures are selected. The validation set is used to evaluate, select and combine text detection and recognition systems. The validation set can be considered as one (another) part of the training set in a broader sense. Thirdly, the testing set, to be used for testing the final performance and publishing results on a common data for different text detection and recognition systems, comprises 300 figures. Its figure collection strategy is the same to the one used for the validation set . Researchers are not allowed to alter, amend, or annotate these data in any way.

    The above separate training, validation and testing sets can be used as one evaluation strategy for text detection and recognition in biomedical figures. At the same time, in order to more fully utilizing the available data, there is also another popular strategy, cross-validation, for using the dataset. For example, if we take all of the images (from the training, validation and testing sets) together, and do 5-flod cross validation, then for each fold we can use 400 for training and 100 for testing. So under this plan, all 500 images could be used for testing, and each training set would be of size 400. Consequently, we could increase the image numbers in training, validation, and testing. As a result, we also construct 5-flod and 10-fold cross validation datasets which are public and available at http://prir.ustb.edu.cn/DeTEXT/.

Figure 1. Representative biomedical figures and their texts: (1) experimental results (gene sequence), (2) research models, and (3) biomedical objects.

    Before annotation, there are some basic requirements for properly detecting and recognizing figure texts. The first requirement is the figure quality requirement. We select biomedical figures with text which can be seen by annotating engineers. Second, for the region unit requirement, each annotated region unit should be composed of one term that includes one or more words. Here, the “word” unit should be a character set composed of several enough close characters (in an alignment line). Most text regions are in a horizontal direction; a few text regions are with multi-directions (including the vertical direction). The third requirement is word length. The length of a word to be annotated should be equal to or more than 2.

    Each figure in the database corresponds to a ground truth file (we use a “.txt” file to store the annotation information), in which each line records the information of the text in the corresponding text region. The format of the ground truth file (e.g., “ex.txt”) is illustrated in Figure 2.

Figure 2. An example for the annotation information. 

    In Figure 2, the “difficulty” level means how difficult it is to detect and recognize the text in the annotated region, which is determined by the image quality. It can be “normal”, “blur”, “small”, “color”, “short”, “complex_layout”, “complex_symbol”, or “specific_text”. “LT”, “TR”, “RB” and “BL” are the left-top, top-right, right-bottom, and bottom-left points of the text region respectively. The “horizontal/oriented” indicates whether the text region is aligned in the horizontal (0) or oriented (or vertical, 1) direction. “Importance” represents the importance of the text for NLP and information retrieval.

 3. Data Analysis

      We specifically analyze various challenges of text itself for text detection and recognition from biomedical literature figures, which can be briefly categorized in Table 2. More details are shown in our description paper (the reference paper).

Table 2. Challenges for text detection and recognition from biomedical literature figures.

Challenges

Sub Challenges

Remarks (“difficulty” )

General challenges for text extraction from common complex images

Blurred text

“blur”

Small-size character

“small”

Color image / text

“color”

Short word

“short”

Specific challenges for text extraction from biomedical literature figures

Complex layout

“complex_layout”

Complex symbol

“complex_symbol”

Specific text

“specific_text”

Skew text

 

4. DeTEXT Online

     Our DeTEXT dataset (VERSION 0.1) with the annotation information is available and can be downloaded HERE (the training and validation set, and the testing set please referring to the platform of ICDAR2017 DeTEXT Competition, Competition on Text Extraction from Biomedical Literature Figures).

     Our annotation tool is also available at http://prir.ustb.edu.cn/DeTEXT/setup4.8.1.zip.

5. Reference

      Xu-Cheng Yin, Chun Yang, Wei-Yi Pei, Haixia Man, Jun Zhang, Erik Learned-Miller, and Hong Yu, "DeTEXT: A database for evaluating text extraction from biomedical literature figures," PLoS ONE, vol. 10, no. 5, pp. e0126200, 2015. (Paper Link)

6. Corresponding Author

        Xu-Cheng Yin   Ph.D.,  Professor

   Mail:      Department of Computer Science and Technology, School of Computer and Communication Engineering,

                 University of Science and Technology Beijing,

                 No. 30, Xueyuan Road, Haidian District, Beijing 100083, China

    Office:   ROOM 1005, Information Building

    Tel:       +86-10-8237-1191

    Fax:      +86-10-6233-2873  

    Homepage: http://prir.ustb.edu.cn/yin/

    Email:      

---------------------------------------------------------------------

Last Modified:  Jan 22, 2017