Wednesday, August 14, 2013

Apache Tika - Extract MetaData and Stractured Text Content Extraction

Apache Tika is used for detecting and extracting metadata and structured text content from different documents using existing parser libraries.

Apache Tika provides interface called Parser (org.apache.tika.parser) which provides api called parse whose job is to parses a document stream into a sequence of xhtml sax events.

Tika supports different formats like text,audio,image,video,word document,open document,pdf,xml,html etc. Please refer to Apache Tika Supported Documents Format for more details. You may want have a look at Apache Tika 1.4 API documentation


    What's important while using parse is that we are providing InputStream as parameter to parse method where given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller. If document could not be parsed TikaException will be SAXException will be thrown if sax events could not be processed
    thrown and

    If You are using Maven then You should add following dependencies into pom.xml
    
        org.apache.tika
        tika-core
        1.4
    
    
        org.apache.tika
        tika-parsers
        1.4
    
    

    Java Program to extract metadata and content using Apache Tika :

    package com.anuj.apachetika;
    import java.io.IOException;
    import java.io.InputStream;
    import org.apache.tika.exception.TikaException;
    import org.apache.tika.metadata.Metadata;
    import org.apache.tika.parser.AutoDetectParser;
    import org.apache.tika.parser.ParseContext;
    import org.apache.tika.parser.Parser;
    import org.apache.tika.sax.BodyContentHandler;
    import org.xml.sax.ContentHandler;
    import org.xml.sax.SAXException;
    
    /**
     * 
     * @author Anuj
     */
    public class ApacheTikaParser {
    
        private void parseResource(String resourceName) {
            System.out.println("Parsing resource : " + resourceName);
            InputStream inputStream = null;
    
            try {
                inputStream = getClass().getClassLoader().getResourceAsStream(resourceName);
    
                Parser parser = new AutoDetectParser();
                ContentHandler contentHandler = new BodyContentHandler();
                Metadata metadata = new Metadata();
    
                parser.parse(inputStream, contentHandler, metadata, new ParseContext());
    
                for (String name : metadata.names()) {
                    String value = metadata.get(name);
                    System.out.println("Metadata Name: " + name);
                    System.out.println("Metadata Value: " + value);
                }
    
                System.out.println("Title: " + metadata.get("title"));
                System.out.println("Author: " + metadata.get("Author"));
                System.out.println("content: " + contentHandler.toString());
    
            } catch (IOException e) {
                e.printStackTrace();
            } catch (TikaException e) {
                e.printStackTrace();
            } catch (SAXException e) {
                e.printStackTrace();
            } finally {
                if (inputStream != null) {
                    try {
                        inputStream.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
            }
        }
    
        public static void main(String[] args) {
            ApacheTikaParser apacheTikaParser = new ApacheTikaParser();
            apacheTikaParser.parseResource("NB Shortcuts.pdf");
            apacheTikaParser.parseResource("netbeans.txt");
            apacheTikaParser.parseResource("J2EE.docx");
        }
    }
    

    Output :
    Parsing resource : NB Shortcuts.pdf
    Metadata Name: xmpTPg:NPages
    Metadata Value: 2
    Metadata Name: Creation-Date
    Metadata Value: 2011-08-28T12:53:16Z
    Metadata Name: meta:creation-date
    Metadata Value: 2011-08-28T12:53:16Z
    Metadata Name: created
    Metadata Value: Sun Aug 28 18:23:16 IST 2011
    Metadata Name: dcterms:created
    Metadata Value: 2011-08-28T12:53:16Z
    Metadata Name: producer
    Metadata Value: LibreOffice 3.3
    Metadata Name: xmp:CreatorTool
    Metadata Value: Writer
    Metadata Name: Content-Type
    Metadata Value: application/pdf
    .....

    Parsing resource : netbeans.txt
    Metadata Name: Content-Encoding
    Metadata Value: windows-1252
    Metadata Name: Content-Type
    Metadata Value: text/plain; charset=windows-1252
    .....

    Parsing resource : J2EE.docx
    Metadata Name: cp:revision
    Metadata Value: 25
    Metadata Name: meta:last-author
    Metadata Value: Anuj J Patel
    Metadata Name: Last-Author
    Metadata Value: Anuj J Patel
    Metadata Name: meta:save-date
    Metadata Value: 2008-09-17T12:08:00Z
    Metadata Name: Application-Name
    Metadata Value: Microsoft Office Word
    Metadata Name: Author
    Metadata Value: Anuj J Patel
    Metadata Name: dcterms:created
    Metadata Value: 2008-09-17T12:03:00Z
    ....

    Please refer to Apache License before using.


    You may be Interested In :

    Author : Anuj Patel
    Blog : http://goldenpackagebyanuj.blogspot.in/