Information extraction, indexing and search of PDF, word and text documents with MongoDB -

April 15, 2012

does mongodb have feature store pdf, text or .doc/docx documents , search them or match between 2 documents on keyword found in content?

for example:

i might want store 1 document called 'claim.txt' has values for
diagnosis code, short description, date , amount in it.
need store 1 called 'physician_diagnosis.pdf' has, among other text, matching short description in it.

i issue query find document has both matching date , same diagnosis. (e.g. 'pneumonia', '12/12/2012')

is possible mongodb using api, or need pre-processing?

if possible, please point me example , documentation.

your task better suited solr (http://lucene.apache.org/solr/), has inputs many different documents (http://wiki.apache.org/solr/extractingrequesthandler). have write code proper extraction though.

mongodb more meant structured data - although call them documents, not mean "pdf documents" or "word documents" here. it's generic format supports nested field types call document, opposed relational database row doesn't allow that.

Search This Blog

Three

Information extraction, indexing and search of PDF, word and text documents with MongoDB -

Comments

Post a Comment

Popular posts from this blog

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -

iphone - How do I keep MDScrollView from truncating my row headers and making my cells look bad? -