DocFigure: A Dataset for Scientific Document Figure Classification

Title: DocFigure: A Dataset for Scientific Document Figure Classification
Authors : K V Jobin, Ajoy Mondal and C. V. Jawahar

Abstract

Document figure classification (DFC) is an important stage of a document image understanding system. The design of a DFC system required a well defined figure categories and a dataset. To the best of the author’s knowledge, the existing datasets related to classification of figures in the document images are limited with respect to their size and categories [1]–[3]. In this paper, we introduce a scientific figure classification dataset, named as DocFigure. The dataset consists of 33K annotated figures of 28 different categories present in the document images which correspond to scientific articles published in CVPR, ECCV, ICCV, etc. conferences in the last several years. Manual annotation of such a large number (33K) of figures is time consuming and cost ineffective. In this article, we design a web based annotation tool which can efficiently assign category labels to large number of figures with the minimum efforts of human annotators. To benchmark our generated dataset on classification task, we propose three baseline classification techniques using deep feature, deep texture feature and combination of both. In our analysis, we found that the combination of both deep feature and deep texture feature is more effective for document figure classification task than the individual features.

Keywords: Document figure classification; transfer learning; deep feature; deep texture feature.

Dataset: DocFigure

iiitar-img

Fig1: Visual illustration of category wise sample figure images of our DocFigure dataset. The 28 categories corre- spond to (a) Line graph, (b) Natural image, (c) Table, (d) 3D object, (e) Bar plot, (f) Scatter plot, (g) Medical image, (h) Sketch, (i) Geographic map, (j) Flow chart, (k) Heat map, (l) Mask, (m) Block diagram, (n) Venn diagram, (o) Confusion matrix, (p) Histogram, (q) Box plot, (r) Vector plot, (s) Pie chart, (t) Surface plot, (u) Algorithm, (v) Contour plot, (w) Tree diagram, (x) Bubble chart, (y) Polar plot, (z) Area chart, (A) Pareto chart and (B) Radar chart.


Comparison of our DocFigure dataset with existing

iiitar-img

Table I: Comparison of our DocFigure dataset with existing Deepchart [2], Figureseer [3] and Revision [1] datasets with respect to category labels and samples. The last column indicate the number of images in each class.


Work-flow for generation of DocFigure dataset

iiitar-img

Fig2: Work-flow for generation of DocFigure dataset. Annotation is done using two stages. In stage I, part of the dataset is annotated using incremental learning and annotator. In stage II, remaining part is annotated based on similarity score between the rest of the images and help of annotator. Finally, complete annotated dataset is generated.


Link to the Download: [ Dataset ] * [ Code ] *

Note: * Their use is restricted to non-commercial research and educational purposes.


Baseline

iiitar-img

Fig3: Basic framework for the proposed three baseline approaches. Red dotted rectangle corresponds to FC - CCN features extraction block, Blue dotted rectangle indicates FV - CNN features extraction module and Black dotted rectangle corresponds to classification module (best viewed in color).


Results:

Labels FC-CNN FV-CNN FC-CNN + FV-CNN
3D objects 98.24% 94.73% 98.53%
Algorithm 93.81% 91.75% 93.81%
Bar plots 93.97% 91.97% 93.64%
Box plot 91.39% 88.07% 92.05%
Flow chart 92.53% 91.14% 97.01%
Heat map 99.25% 95.89% 99.62%
Histogram 94.89% 88.26% 94.89%
Medical images 97.87% 92.55% 98.93%
Pie chart 91.66% 89.81% 94.44%
Polar plot 85.71% 78.57% 85.71%
Area chart 84.61% 91.02% 92.30%
Block diagram 97.26% 97.65% 98.43%
Bubble Chart 80.95% 91.66% 90.47%
Confusion matrix 85.22% 89.65% 93.10%
Contour plot 59.34% 74.72% 72.52%
Geographic map 88.59% 95.81% 95.43%
Line graph * 98.49% 98.84% 99.33%
Mask 99.23% 99.23% 99.23%
Natural images 98.04% 98.25% 99.23%
Pareto charts 87.17% 96.15% 97.43%
Radar chart 78.94% 86.84% 85.52%
Scatter plot 90.14% 91.19% 93.66%
Sketches 95.65% 96.37% 98.18%
Surface plot 76.76% 89.89% 88.88%
Tables 97.25% 98.73% 97.67%
Tree Diagram 67.04% 68.18% 70.45%
Vector plot 79.86% 81.94% 86.80%
Venn Diagram 87.03% 93.51% 93.05%
Average 88.96% 90.80% 92.90%

Table3: The class wise accuracy of 28 classes in our proposed dataset DocFigure using shape feature (FC- CNN), texture feature (FV-CNN) and combination of both (FC-CNN+FV-CNN). The labels written in italics are more discriminative in shape feature than texture feature.

Note: * By mistake in the main paper we named Line graph as Graph plots.


Publication

  • K V Jobin, Ajoy Mondal and C V Jawahar , DocFigure: A Dataset for Scientific Document Figure Classification , 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, NSW, Australia, 2019

Bibtex

@InProceedings{jobin2019,
author = "K V Jobin, Ajoy Mondal and C V Jawahar",
title = "Graphical Object Detection in Document Images",
booktitle = "2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)",
year = "2019"
}