Information extraction, indexing and search of PDF, word and text documents with MongoDB -


does mongodb have feature store pdf, text or .doc/docx documents , search them or match between 2 documents on keyword found in content?

for example:

i might want store 1 document called 'claim.txt' has values for
diagnosis code, short description, date , amount in it.
need store 1 called 'physician_diagnosis.pdf' has, among other text, matching short description in it.

i issue query find document has both matching date , same diagnosis. (e.g. 'pneumonia', '12/12/2012')

is possible mongodb using api, or need pre-processing?

if possible, please point me example , documentation.

your task better suited solr (http://lucene.apache.org/solr/), has inputs many different documents (http://wiki.apache.org/solr/extractingrequesthandler). have write code proper extraction though.

mongodb more meant structured data - although call them documents, not mean "pdf documents" or "word documents" here. it's generic format supports nested field types call document, opposed relational database row doesn't allow that.


Comments