Apache POI - Text Extraction

Overview

For a number of years now, Apache POI has provided basic text extraction for all the project supported file formats. In addition, as well as the (plain) text, these provides access to the metadata associated with a given file, such as title and author.

For more advanced text extraction needs, including Rich Text extraction (such as formatting and styling), along with XML and HTML output, Apache POI works closely with Apache Tika to deliver POI-powered Tika Parsers for all the project supported file formats.

If you are after turn-key text extraction, including the latest support, styles etc, you are strongly advised to make use of Apache Tika, which builds on top of POI to provide Text and Metadata extraction. If you wish to have something very simple and stand-alone, or you wish to make heavy modificiations, then the POI provided text extractors documented below might be a better fit for your needs.

Common functionality

All of the POI text extractors extend from org.apache.poi.POITextExtractor. This provides a common method across all extractors, getText(). For many cases, the text returned will be all you need. However, many extractors do provide more targetted text extraction methods, so you may wish to use these in some cases.

All POIFS / OLE 2 based text extractors also extend from org.apache.poi.POIOLE2TextExtractor. This additionally provides common methods to get at the HPFS document metadata.

All OOXML based text extractors also extend from org.apache.poi.POIOOXMLTextExtractor. This additionally provides common methods to get at the OOXML metadata.

Text Extractor Factory

POI provides a a common class to select the appropriate text extractor for you, based on the supplied document's contents. org.apache.poi.extractor.ExtractorFactory provides a similar function to WorkbookFactory. You simply pass it an InputStream, a File, a POIFSFileSystem or a OOXML Package. It figures out the correct text extractor for you, and returns it.

For complete detection and text extractor auto-selection, users are strongly encouraged to investigate Apache Tika.

Excel

For .xls files, there is org.apache.poi.hssf.extractor.ExcelExtractor, which will return text, optionally with formulas instead of their contents. Similarly, for .xlsx files there is org.apache.poi.xssf.extractor.XSSFExcelExtractor, which provides the same functionality.

For those working in constrained memory footprints, there are two more Excel text extractors available. For .xls files, it's org.apache.poi.hssf.extractor.EventBasedExcelExtractor, based on the streaming EventUserModel code, and will generally deliver a lower memory footprint for extraction. However, it will have problems correctly outputting more complex formulas, as it works with records as they pass, and so doesn't have access to all parts of complex and shared formulas. For .xlsx files the equivalent is org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor, which is based on the XSSF SAX Event codebase.

Word

For .doc files from Word 97 - Word 2003, in scratchpad there is org.apache.poi.hwpf.extractor.WordExtractor, which will return text for your document.

Those using POI 3.7 can also extract simple textual content from older Word 6 and Word 95 files, using the scratchpad class org.apache.poi.hwpf.extractor.Word6Extractor.

For .docx files, the relevant class is org.apache.poi.xwpf.extractor.XPFFWordExtractor

PowerPoint

For .ppt files, in scratchpad there is org.apache.poi.hslf.extractor.PowerPointExtractor, which will return text for your slideshow, optionally restricted to just slides text or notes text. For .pptx files, the class to use is org.apache.poi.xslf.extractor.XSLFPowerPointExtractor

Publisher

For .pub files, in scratchpad there is org.apache.poi.hpbf.extractor.PublisherExtractor, which will return text for your file.

Visio

For .vsd files, in scratchpad there is org.apache.poi.hdgf.extractor.VisioTextExtractor, which will return text for your file.

Embedded Objects

Extractors already exist for Excel, Word, PowerPoint and Visio; if one of these objects is embedded into a worksheet, the ExtractorFactory class can be used to recover an extractor for it.

FileInputStream fis = new FileInputStream(inputFile);
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
// Firstly, get an extractor for the Workbook
POIOLE2TextExtractor oleTextExtractor = 
   ExtractorFactory.createExtractor(fileSystem);
// Then a List of extractors for any embedded Excel, Word, PowerPoint
// or Visio objects embedded into it.
POITextExtractor[] embeddedExtractors =
   ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
for (POITextExtractor textExtractor : embeddedExtractors) {
   // If the embedded object was an Excel spreadsheet.
   if (textExtractor instanceof ExcelExtractor) {
      ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
      System.out.println(excelExtractor.getText());
   }
   // A Word Document
   else if (textExtractor instanceof WordExtractor) {
      WordExtractor wordExtractor = (WordExtractor) textExtractor;
      String[] paragraphText = wordExtractor.getParagraphText();
      for (String paragraph : paragraphText) {
         System.out.println(paragraph);
      }
      // Display the document's header and footer text
      System.out.println("Footer text: " + wordExtractor.getFooterText());
      System.out.println("Header text: " + wordExtractor.getHeaderText());
   }
   // PowerPoint Presentation.
   else if (textExtractor instanceof PowerPointExtractor) {
      PowerPointExtractor powerPointExtractor =
         (PowerPointExtractor) textExtractor;
      System.out.println("Text: " + powerPointExtractor.getText());
      System.out.println("Notes: " + powerPointExtractor.getNotes());
   }
   // Visio Drawing
   else if (textExtractor instanceof VisioTextExtractor) {
      VisioTextExtractor visioTextExtractor = 
         (VisioTextExtractor) textExtractor;
      System.out.println("Text: " + visioTextExtractor.getText());
   }
}

by Nick Burch

Copyright © 2002-2014 The Apache Software Foundation. All rights reserved. Apache POI, POI, Apache, the Apache feather logo, and the Apache POI project logo are trademarks of The Apache Software Foundation.