Apache Tika is used for detecting and extracting metadata and structured text content from different documents using existing parser libraries.
Apache Tika provides interface called Parser (org.apache.tika.parser) which provides api called parse whose job is to parses a document stream into a sequence of xhtml sax events.
Tika supports different formats like text,audio,image,video,word document,open document,pdf,xml,html etc. Please refer to Apache Tika Supported Documents Format for more details. You may want have a look at Apache Tika 1.4 API documentation
What's important while using parse is that we are providing InputStream as parameter to parse method where given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller. If document could not be parsed TikaException will be SAXException will be thrown if sax events could not be processed
thrown and
If You are using Maven then You should add following dependencies into pom.xml
Java Program to extract metadata and content using Apache Tika :
Output :
Parsing resource : NB Shortcuts.pdf
Metadata Name: xmpTPg:NPages
Metadata Value: 2
Metadata Name: Creation-Date
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: meta:creation-date
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: created
Metadata Value: Sun Aug 28 18:23:16 IST 2011
Metadata Name: dcterms:created
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: producer
Metadata Value: LibreOffice 3.3
Metadata Name: xmp:CreatorTool
Metadata Value: Writer
Metadata Name: Content-Type
Metadata Value: application/pdf
.....
Parsing resource : netbeans.txt
Metadata Name: Content-Encoding
Metadata Value: windows-1252
Metadata Name: Content-Type
Metadata Value: text/plain; charset=windows-1252
.....
Parsing resource : J2EE.docx
Metadata Name: cp:revision
Metadata Value: 25
Metadata Name: meta:last-author
Metadata Value: Anuj J Patel
Metadata Name: Last-Author
Metadata Value: Anuj J Patel
Metadata Name: meta:save-date
Metadata Value: 2008-09-17T12:08:00Z
Metadata Name: Application-Name
Metadata Value: Microsoft Office Word
Metadata Name: Author
Metadata Value: Anuj J Patel
Metadata Name: dcterms:created
Metadata Value: 2008-09-17T12:03:00Z
....
Please refer to Apache License before using.
You may be Interested In :
Apache Tika provides interface called Parser (org.apache.tika.parser) which provides api called parse whose job is to parses a document stream into a sequence of xhtml sax events.
Tika supports different formats like text,audio,image,video,word document,open document,pdf,xml,html etc. Please refer to Apache Tika Supported Documents Format for more details. You may want have a look at Apache Tika 1.4 API documentation
What's important while using parse is that we are providing InputStream as parameter to parse method where given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller. If document could not be parsed TikaException will be SAXException will be thrown if sax events could not be processed
thrown and
If You are using Maven then You should add following dependencies into pom.xml
org.apache.tika tika-core 1.4 org.apache.tika tika-parsers 1.4
Java Program to extract metadata and content using Apache Tika :
package com.anuj.apachetika; import java.io.IOException; import java.io.InputStream; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; /** * * @author Anuj */ public class ApacheTikaParser { private void parseResource(String resourceName) { System.out.println("Parsing resource : " + resourceName); InputStream inputStream = null; try { inputStream = getClass().getClassLoader().getResourceAsStream(resourceName); Parser parser = new AutoDetectParser(); ContentHandler contentHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); parser.parse(inputStream, contentHandler, metadata, new ParseContext()); for (String name : metadata.names()) { String value = metadata.get(name); System.out.println("Metadata Name: " + name); System.out.println("Metadata Value: " + value); } System.out.println("Title: " + metadata.get("title")); System.out.println("Author: " + metadata.get("Author")); System.out.println("content: " + contentHandler.toString()); } catch (IOException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } finally { if (inputStream != null) { try { inputStream.close(); } catch (IOException e) { e.printStackTrace(); } } } } public static void main(String[] args) { ApacheTikaParser apacheTikaParser = new ApacheTikaParser(); apacheTikaParser.parseResource("NB Shortcuts.pdf"); apacheTikaParser.parseResource("netbeans.txt"); apacheTikaParser.parseResource("J2EE.docx"); } }
Output :
Parsing resource : NB Shortcuts.pdf
Metadata Name: xmpTPg:NPages
Metadata Value: 2
Metadata Name: Creation-Date
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: meta:creation-date
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: created
Metadata Value: Sun Aug 28 18:23:16 IST 2011
Metadata Name: dcterms:created
Metadata Value: 2011-08-28T12:53:16Z
Metadata Name: producer
Metadata Value: LibreOffice 3.3
Metadata Name: xmp:CreatorTool
Metadata Value: Writer
Metadata Name: Content-Type
Metadata Value: application/pdf
.....
Parsing resource : netbeans.txt
Metadata Name: Content-Encoding
Metadata Value: windows-1252
Metadata Name: Content-Type
Metadata Value: text/plain; charset=windows-1252
.....
Parsing resource : J2EE.docx
Metadata Name: cp:revision
Metadata Value: 25
Metadata Name: meta:last-author
Metadata Value: Anuj J Patel
Metadata Name: Last-Author
Metadata Value: Anuj J Patel
Metadata Name: meta:save-date
Metadata Value: 2008-09-17T12:08:00Z
Metadata Name: Application-Name
Metadata Value: Microsoft Office Word
Metadata Name: Author
Metadata Value: Anuj J Patel
Metadata Name: dcterms:created
Metadata Value: 2008-09-17T12:03:00Z
....
Please refer to Apache License before using.
You may be Interested In :
No comments:
Post a Comment