Conference Paper (published)
Details
Citation
Böschen F & Scherp A (2015) Formalization and preliminary evaluation of a pipeline for text extraction from infographics. In: Görg S, Bergmann R & Müller G (eds.) Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB, volume 1458. CEUR Workshop Proceedings, 1458. LWA 2015 Workshops: KDML, FGWM, IR, FGD, Trier, Germany, 07.10.2015-09.10.2015. Aachen, Germany: CEUR Workshop Proceedings, pp. 20-31. http://ceur-ws.org/Vol-1458/D03_CRC13_Boeschen.pdf
Abstract
We propose a pipeline for text extraction from infographics
that makes use of a novel combination of data mining and computer
vision techniques. The pipeline defines a sequence of steps to identify
characters, cluster them into text lines, determine their rotation angle, and apply state-of-the-art OCR to recognise the text. In this paper, we formally define the pipeline and present its current implementation. In addition, we have conducted preliminary evaluations over a data corpus of 121 manually annotated infographics from a broad range of illustration types such as bar charts, pie charts, and line charts, maps, and others. We assess the results of our text extraction pipeline by comparing it with two baselines. Finally, we sketch an outline for future work and possibilities for improving the pipeline.
Keywords
Infographics; OCR; multi-oriented text extraction; formalization;
Journal
CEUR Workshop Proceedings: Volume 1458
Status | Published |
---|---|
Title of series | CEUR Workshop Proceedings |
Number in series | 1458 |
Publication date | 31/12/2015 |
URL | http://hdl.handle.net/1893/28051 |
Publisher | CEUR Workshop Proceedings |
Publisher URL | http://ceur-ws.org/Vol-1458/D03_CRC13_Boeschen.pdf |
Place of publication | Aachen, Germany |
ISSN of series | 1613-0073 |
ISBN | N/A |
Conference | LWA 2015 Workshops: KDML, FGWM, IR, FGD |
Conference location | Trier, Germany |
Dates | – |