J2EE Labs: Apache Tika - Extract MetaData and Stractured Text Content Extraction

Apache Tika is used for detecting and extracting metadata and structured text content from different documents using existing parser libraries.

Apache Tika provides interface called Parser (org.apache.tika.parser) which provides api called parse whose job is to parses a document stream into a sequence of xhtml sax events.

Tika supports different formats like text,audio,image,video,word document,open document,pdf,xml,html etc. Please refer to Apache Tika Supported Documents Format for more details. You may want have a look at Apache Tika 1.4 API documentation

What's important while using parse is that we are providing InputStream as parameter to parse method where given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller. If document could not be parsed TikaException will be SAXException will be thrown if sax events could not be processed
thrown and

If You are using Maven then You should add following dependencies into pom.xml


    org.apache.tika
    tika-core
    1.4


    org.apache.tika
    tika-parsers
    1.4

Java Program to extract metadata and content using Apache Tika :

package com.anuj.apachetika;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * 
 * @author Anuj
 */
public class ApacheTikaParser {

    private void parseResource(String resourceName) {
        System.out.println("Parsing resource : " + resourceName);
        InputStream inputStream = null;

        try {
            inputStream = getClass().getClassLoader().getResourceAsStream(resourceName);

            Parser parser = new AutoDetectParser();
            ContentHandler contentHandler = new BodyContentHandler();
            Metadata metadata = new Metadata();

            parser.parse(inputStream, contentHandler, metadata, new ParseContext());

            for (String name : metadata.names()) {
                String value = metadata.get(name);
                System.out.println("Metadata Name: " + name);
                System.out.println("Metadata Value: " + value);
            }

            System.out.println("Title: " + metadata.get("title"));
            System.out.println("Author: " + metadata.get("Author"));
            System.out.println("content: " + contentHandler.toString());

        } catch (IOException e) {
            e.printStackTrace();
        } catch (TikaException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } finally {
            if (inputStream != null) {
                try {
                    inputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

    public static void main(String[] args) {
        ApacheTikaParser apacheTikaParser = new ApacheTikaParser();
        apacheTikaParser.parseResource("NB Shortcuts.pdf");
        apacheTikaParser.parseResource("netbeans.txt");
        apacheTikaParser.parseResource("J2EE.docx");
    }
}

Output :
Parsing resource : NB Shortcuts.pdf
Metadata Name: xmpTPg:NPages
Metadata Value: 2
Metadata Name: Creation-Date
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: meta:creation-date
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: created
Metadata Value: Sun Aug 28 18:23:16 IST 2011
Metadata Name: dcterms:created
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: producer
Metadata Value: LibreOffice 3.3
Metadata Name: xmp:CreatorTool
Metadata Value: Writer
Metadata Name: Content-Type
Metadata Value: application/pdf
.....

Parsing resource : netbeans.txt
Metadata Name: Content-Encoding
Metadata Value: windows-1252
Metadata Name: Content-Type
Metadata Value: text/plain; charset=windows-1252
.....

Parsing resource : J2EE.docx
Metadata Name: cp:revision
Metadata Value: 25
Metadata Name: meta:last-author
Metadata Value: Anuj J Patel
Metadata Name: Last-Author
Metadata Value: Anuj J Patel
Metadata Name: meta:save-date
Metadata Value: 2008-09-17T12:08:00Z
Metadata Name: Application-Name
Metadata Value: Microsoft Office Word
Metadata Name: Author
Metadata Value: Anuj J Patel
Metadata Name: dcterms:created
Metadata Value: 2008-09-17T12:03:00Z
....

Please refer to Apache License before using.

You may be Interested In :

Apache PDFBox - Parse PDF to text using java

J2EE Labs

Pages

Apache Tika - Extract MetaData and Stractured Text Content Extraction

No comments:

Post a Comment

Contact Form

Translate