search - Lucene 4.2.0 index pdf -

August 15, 2012

i using example source code lucene 4.2.0 demo api: http://lucene.apache.org/core/4_2_0/demo/overview-summary.html

i run indexfiles.java create index directory of rtf, pdf, doc, , docx files. run searcfiles.java , notice encounter several instances searches fail i.e. not return document contains word searched for.

i suspect has lucene 4.2.0 not being able correctly index non .txt files without additional customization.

question: can indexfiles.java source code (lucene 4.2.0) correctly index pdf, doc, docx files written in provided link? have examples or references on how code functionality?

thank

no, can't. indexfiles demo, example learn from, not designed production use. if take @ code, you'll see uses fileinputstream (wrapped inputstreamreader, wrapped bufferedreader). generally, lucene won't handle how parse different file formats (except it's own index files, of course). how parse file provide meaningful content lucene define.

apache tika might place functionality. here simple example using tika lucene.

you might consider using solr.

Search This Blog

Three

search - Lucene 4.2.0 index pdf -

Comments

Post a Comment

Popular posts from this blog

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -

iphone - How do I keep MDScrollView from truncating my row headers and making my cells look bad? -