search - Lucene 4.2.0 index pdf -


i using example source code lucene 4.2.0 demo api: http://lucene.apache.org/core/4_2_0/demo/overview-summary.html

i run indexfiles.java create index directory of rtf, pdf, doc, , docx files. run searcfiles.java , notice encounter several instances searches fail i.e. not return document contains word searched for.

i suspect has lucene 4.2.0 not being able correctly index non .txt files without additional customization.

question: can indexfiles.java source code (lucene 4.2.0) correctly index pdf, doc, docx files written in provided link? have examples or references on how code functionality?

thank

no, can't. indexfiles demo, example learn from, not designed production use. if take @ code, you'll see uses fileinputstream (wrapped inputstreamreader, wrapped bufferedreader). generally, lucene won't handle how parse different file formats (except it's own index files, of course). how parse file provide meaningful content lucene define.

apache tika might place functionality. here simple example using tika lucene.

you might consider using solr.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -