Elastic search With MongoDB : Searching PDFs

I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files --type
    application/pdf put D:\Sherlock-Holmes.pdf

  2. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request" tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

Will anybody help me on this????

On Monday, October 29, 2012 7:16:19 PM UTC+5:30, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files --type
    application/pdf put D:\Sherlock-Holmes.pdf

  2. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request" tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

Hi Vimlesh,

I haven't worked with the MongoDB river, so maybe somebody else will
jump in and give specific tips.

I suppose your document isn't indexed as you'd expected it. Getting
the document should produce a rather big base64 string in the _source,
as opposed to what you got there.

If you follow the wiki here (the GridFS part), does it work for you?

If not, please post your mapping here.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Oct 30, 2012 at 9:51 AM, Vimlesh vimlesh.mishra@gmail.com wrote:

Will anybody help me on this????

On Monday, October 29, 2012 7:16:19 PM UTC+5:30, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files --type
    application/pdf put D:\Sherlock-Holmes.pdf

  2. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" :

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

--

Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper, or
to the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not possible
to find out whether they can be processed by Apache Tika (which is the
workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files --type
    application/pdf put D:\Sherlock-Holmes.pdf

  2. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request" tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

Yes. That’s strange to me also.

I don’t see this option in the Wiki page:

De : elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com]
De la part de Jörg Prante
Envoyé : mardi 30 octobre 2012 16:31
À : elasticsearch@googlegroups.com
Objet : Re: Elastic search With MongoDB : Searching PDFs

Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper, or to
the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not possible
to find out whether they can be processed by Apache Tika (which is the
workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then searching in
that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files --type
    application/pdf put D:\Sherlock-Holmes.pdf

  2. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request" tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" :

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chun
kSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d
2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true
&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf. Please
help. Thanks in advance.

--

--

Hi Radu,

I already went through GridFS wiki URL but its not working for me. What
mapping do you expect from me to post?

Vimlesh

On Tuesday, October 30, 2012 8:13:54 PM UTC+5:30, Radu Gheorghe wrote:

Hi Vimlesh,

I haven't worked with the MongoDB river, so maybe somebody else will
jump in and give specific tips.

I suppose your document isn't indexed as you'd expected it. Getting
the document should produce a rather big base64 string in the _source,
as opposed to what you got there.

If you follow the wiki here (the GridFS part), does it work for you?
https://github.com/richardwilly98/elasticsearch-river-mongodb/wiki

If not, please post your mapping here.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Oct 30, 2012 at 9:51 AM, Vimlesh <vimlesh...@gmail.com<javascript:>>
wrote:

Will anybody help me on this????

On Monday, October 29, 2012 7:16:19 PM UTC+5:30, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then
searching

in that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files
    --type

application/pdf put D:\Sherlock-Holmes.pdf

  1. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab

URL : http://localhost:9200/_river/mongodb/ 
_meta/PUT 

{ 
  "type": "mongodb", 
  "mongodb": { 
    "db": "testmongo", 
    "collection": "fs.files", 
    "gridfs": true, 
    "contentType": "", 
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'" 
  }, 
  "index": { 
    "name": "testmongo", 
    "type": "files", 
    "content_type": "application/pdf" 
  } 
} 

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{ 
  "_index" : "testmongo", 
  "_type" : "files", 
  "_id" : "508e82e21e43def09b5e1602", 
  "_version" : 1, 
  "exists" : true, "_source" : 

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}

} 

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true 

Its giving me following result :

{ 
  "took" : 0, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 5, 
    "successful" : 5, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : 0, 
    "max_score" : null, 
    "hits" : [ ] 
  } 
} 

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

--

Hello Jörg,

My question related to mongodb river and the attachment mapper. I want to
search in pdfs file using elastic search which have been stored in Mongo
DB' Grid FS.

I followed the Grid FS wiki's url for this configuration but that didn't
help me in searching texts in those pdfs. The actual problem which i am
seeing here attachment mapper plugin not able to extracts text from that
pdfs which mongodb river plugin fetched from mongo db [to enable that i
tried that content wiered configuration :(]. Can you send me exact
steps/configuration to enable that. I really stuck at this point.

Thanks
Vimlesh

On Tuesday, October 30, 2012 9:01:17 PM UTC+5:30, Jörg Prante wrote:

Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper, or
to the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not possible
to find out whether they can be processed by Apache Tika (which is the
workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files --type
    application/pdf put D:\Sherlock-Holmes.pdf

  2. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

Hi Vimlesh,

The river does not work, with MongoDB 2.2.1 installed here. It connects but
does not fetch any data.

My steps:

So I can
confirm https://github.com/richardwilly98/elasticsearch-river-mongodb/issues/37

Hopefully you have more luck with 2.0.7

Best regards,

Jörg

On Wednesday, October 31, 2012 7:50:07 AM UTC+1, Vimlesh wrote:

Hello Jörg,

My question related to mongodb river and the attachment mapper. I want to
search in pdfs file using elastic search which have been stored in Mongo
DB' Grid FS.

I followed the Grid FS wiki's url for this configuration but that didn't
help me in searching texts in those pdfs. The actual problem which i am
seeing here attachment mapper plugin not able to extracts text from that
pdfs which mongodb river plugin fetched from mongo db [to enable that i
tried that content wiered configuration :(]. Can you send me exact
steps/configuration to enable that. I really stuck at this point.

Thanks
Vimlesh

On Tuesday, October 30, 2012 9:01:17 PM UTC+5:30, Jörg Prante wrote:

Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper, or
to the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not
possible to find out whether they can be processed by Apache Tika (which is
the workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then
searching in that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files
    --type application/pdf put D:\Sherlock-Holmes.pdf

  2. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

Hello Jörg,

I were already trying with Mongo DB-2.0.7, also tried steps you have
provided @ [https://gist.github.com/3987668]

When I executed curl -XGET
'http://localhost:9200/mongoindex/files/_search?q=
'*

I got following response (which confirms me that river able to fetch files
from mongo db) :

{"took":0,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":1,"max_score":1.0,"hits":
[{"_index":"mongoindex","_type":"files","_id":"5092077ee572eaa4f5a2cc3a","_score":1.0, "_source" :
{"_id":"5092077ee572eaa4f5a2cc3a","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,
"uploadDate":"2012-11-01T05:24:14.608Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272}}]}}

But when i tried curl -XGET
'http://localhost:9200/mongoindex/files/_search?q=Albers'

I got following response :

{"took":0,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

also following mapping configuration i found when i executed curl -XGET 'http://localhost:9200/_mapping'

{

  • _river: {
    • mongodb: {
      • properties: {
        • _last_ts: {
          • type: string
            }
        • index: {
          • dynamic: true
          • properties: {
            • name: {
              • type: string
                }
            • type: {
              • type: string
                }
                }
                }
        • mongodb: {
          • dynamic: true
          • properties: {
            • collection: {
              • type: string
                }
            • db: {
              • type: string
                }
            • gridfs: {
              • type: boolean
                }
                }
                }
        • node: {
          • dynamic: true
          • properties: {
            • id: {
              • type: string
                }
            • name: {
              • type: string
                }
            • transport_address: {
              • type: string
                }
                }
                }
        • ok: {
          • type: boolean
            }
        • type: {
          • type: string
            }
            }
            }
            }
  • mongoindex: {
    • files: {
      • properties: {
        • chunkSize: {
          • type: long
            }
        • content: {
          • type: attachment
          • path: full
          • fields: {
            • content: {
              • type: string
                }
            • author: {
              • type: string
                }
            • title: {
              • type: string
                }
            • name: {
              • type: string
                }
            • date: {
              • type: date
              • format: dateOptionalTime
                }
            • keywords: {
              • type: string
                }
            • content_type: {
              • type: string
                }
                }
                }
        • contentType: {
          • type: string
            }
        • filename: {
          • type: string
            }
        • length: {
          • type: long
            }
        • md5: {
          • type: string
            }
        • uploadDate: {
          • type: date
          • format: dateOptionalTime
            }
            }
            }
            }

}

On Wednesday, October 31, 2012 9:06:19 PM UTC+5:30, Jörg Prante wrote:

Hi Vimlesh,

The river does not work, with MongoDB 2.2.1 installed here. It connects
but does not fetch any data.

My steps:

https://gist.github.com/3987668

So I can confirm
https://github.com/richardwilly98/elasticsearch-river-mongodb/issues/37

Hopefully you have more luck with 2.0.7

Best regards,

Jörg

On Wednesday, October 31, 2012 7:50:07 AM UTC+1, Vimlesh wrote:

Hello Jörg,

My question related to mongodb river and the attachment mapper. I want to
search in pdfs file using elastic search which have been stored in Mongo
DB' Grid FS.

I followed the Grid FS wiki's url for this configuration but that didn't
help me in searching texts in those pdfs. The actual problem which i am
seeing here attachment mapper plugin not able to extracts text from that
pdfs which mongodb river plugin fetched from mongo db [to enable that i
tried that content wiered configuration :(]. Can you send me exact
steps/configuration to enable that. I really stuck at this point.

Thanks
Vimlesh

On Tuesday, October 30, 2012 9:01:17 PM UTC+5:30, Jörg Prante wrote:

Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper,
or to the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not
possible to find out whether they can be processed by Apache Tika (which is
the workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then
searching in that pdfs using elastic search. I performed following :

  1. Mongo DB Side:

    mongod --port 27017 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
    mongod --port 27018 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
    mongod --port 27019 --replSet rs0 --dbpath
    "D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

    mongo localhost:27017
    rs.initiate()
    rs.add("hostname:27018")
    rs.add("hostname:27019")

    mongofiles -hlocalhost:27017 --db testmongo --collection files
    --type application/pdf put D:\Sherlock-Holmes.pdf

  2. Elastic Search Side (Installed Plugins :
    bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

Yes, without knowing the characteristics of the PDF it is hard to find out
if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
http://pdfbox.apache.org/commandlineutilities/ExtractText.html

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Jörg we don't have any way/configuration in mongodb river itself to extract
texts from pdf [using attachment-mapper] while fetching pdfs from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find out
if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
http://pdfbox.apache.org/commandlineutilities/ExtractText.html

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [using attachment-mapper]
while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
http://pdfbox.apache.org/commandlineutilities/ExtractText.html

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Hi,

The collection in the river settings should be "fs" and not "fs.files". See
example here:
https://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/src/test/java/test/elasticsearch/plugin/river/mongodb/test-gridfs-mongodb-river.json

I have just published version 1.5.0 which has been tested with MongoDB
2.2.1 and ES 0.19.11

Thanks,
Richard.

On Monday, November 5, 2012 2:39:10 AM UTC-5, Vimlesh wrote:

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [using attachment-mapper]
while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
http://pdfbox.apache.org/commandlineutilities/ExtractText.html

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Hi Vimlesh,

I have a double look at the wiki page. It says:

Create the river as follow:

$ curl -XPUT "localhost:9200/_river/mongogridfs/_meta" -d'
{
type: "mongodb",
mongodb: {
db: "testmongo",
collection: "files",
gridfs: true
},
index: {
name: "testmongo",
type: "files"
}
}

Import the document using the command line:

%MONGO_HOME%\bin\mongofiles.exe —host localhost:27017 —db testmongo —collection files put test-large-document.pdf

Please let me know if the wiki page need to be updated.

Thanks,
Richard.

On Monday, November 5, 2012 3:18:13 AM UTC-5, Richard Louapre wrote:

Hi,

The collection in the river settings should be "fs" and not "fs.files".
See example here:
https://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/src/test/java/test/elasticsearch/plugin/river/mongodb/test-gridfs-mongodb-river.json

I have just published version 1.5.0 which has been tested with MongoDB
2.2.1 and ES 0.19.11

Thanks,
Richard.

On Monday, November 5, 2012 2:39:10 AM UTC-5, Vimlesh wrote:

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [using attachment-mapper]
while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
http://pdfbox.apache.org/commandlineutilities/ExtractText.html

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Hello Richard,

As you suggested i tried above configuration, now everything working fine
as expected. Thanks a ton.

Also I tried "fs" configuration and its working so above command will be :

mongofiles.exe —host localhost:27017 —db testmongo —collection fs put
test-large-document.pdf and

$ curl -XPUT "localhost:9200/_river/mongogridfs/_meta" -d'
{
type: "mongodb",
mongodb: {
db: "testmongo",
collection: "fs",
gridfs: true
},
index: {
name: "testmongo",
type: "files"
}
}

On Mon, Nov 5, 2012 at 10:28 PM, Richard Louapre
richard.louapre@gmail.comwrote:

Hi Vimlesh,

I have a double look at the wiki page. It says:

Create the river as follow:

$ curl -XPUT "localhost:9200/_river/mongogridfs/_meta" -d'
{
type: "mongodb",
mongodb: {
db: "testmongo",
collection: "files",
gridfs: true
},
index: {
name: "testmongo",
type: "files"
}
}

Import the document using the command line:

%MONGO_HOME%\bin\mongofiles.exe —host localhost:27017 —db testmongo —collection files put test-large-document.pdf

Please let me know if the wiki page need to be updated.

Thanks,
Richard.

On Monday, November 5, 2012 3:18:13 AM UTC-5, Richard Louapre wrote:

Hi,

The collection in the river settings should be "fs" and not "fs.files".
See example here: https://github.com/richardwilly98/elasticsearch-
river-mongodb/blob/master/src/test/java/test/elasticsearch/
plugin/river/mongodb/test-**gridfs-mongodb-river.jsonhttps://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/src/test/java/test/elasticsearch/plugin/river/mongodb/test-gridfs-mongodb-river.json

I have just published version 1.5.0 which has been tested with MongoDB
2.2.1 and ES 0.19.11

Thanks,
Richard.

On Monday, November 5, 2012 2:39:10 AM UTC-5, Vimlesh wrote:

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [*using attachment-mapper
*] while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see http://pdfbox.apache.org/*commandlineutilities/
*ExtractText.htmlhttp://pdfbox.apache.org/commandlineutilities/ExtractText.html

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

--

How to create index for a attachment of pdf by using
elasticsearch-river-mongodb: 1.6.9 (don't have any hits,or missing fields)

Dear All,
I am new to elasticsearch. I have tried to follow the different tutorials
and post on index and mapping attached pdf document in a mongodb database
for days without success. After running the codes below i don't have any
hits from words that exist in the mongodb attached files.

software version:
MongoDB: mongodb-linux-x86_64-2.4.3
elasticsearch-river-mongodb: 1.6.9
elasticsearch: 0.90
elasticsearch-mapper-attachments: 1.7.0

Problem No. 1


  1. BSON Structure, PDF attachment is in the "FileContent" field, the
    attachment is not in GridFS.
    byte [] fileser = iou.read(file);
    Pagecount = getpagenum(file);
    BasicDBObject articleobject = new BasicDBObject();
    articleobject.put("Title", jsonArray.getJSONObject(i).get("Title"));
    articleobject.put("Authors",jsonArray.getJSONObject(i).get("Authors"));
    articleobject.put("Organization",
    jsonArray.getJSONObject(i).get("Organization"));
    articleobject.put("Media", jsonArray.getJSONObject(i).get("Media"));
    articleobject.put("ISSN", jsonArray.getJSONObject(i).get("ISSN"));
    articleobject.put("Pages", jsonArray.getJSONObject(i).get("Pages"));
    articleobject.put("Pagecount", Pagecount);
    articleobject.put("Abstracts", jsonArray.getJSONObject(i).get("Abstracts"));
    articleobject.put("Keywords", "");
    articleobject.put("FileContent", fileser);
    collection.insert(articleobject);

  1. create a index
    curl -XPUT "http://localhost:9200/articleindex"

  1. create a mapping
    curl -XPUT 'http://localhost:9200/articleindex/cardiopathy/_mapping' -d '
    {
    "cardiopathy" : {
    "properties" : {
    "Authors" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
    "ik","store" : "yes"},
    "Media" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
    "ik","store" : "yes"},
    "Organization" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
    "ik","store" : "yes"},
    "Keywords" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
    "ik","store" : "yes"},
    "Title" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
    "ik","store" : "yes"},
    "ISSN" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
    "ik","store" : "yes"},
    "Pages" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
    "ik","store" : "yes"},
    "Abstracts" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
    "ik","store" : "yes"},
    "FileContent" : { "type" : "string" ,"indexAnalyzer":
    "ik","searchAnalyzer": "ik"}
    }
    }
    }'

  1. create the river
    curl -XPUT "http://localhost:9200/_river/mongodb/_meta" -d '
    {
    "type": "mongodb",
    "mongodb": {
    "host": "192.168.1.112",
    "port": "27107",
    "options": {"drop_collection": true },
    "db": "ftsearch1",
    "collection": "pdf"
    },
    "index": {
    "name": "articleindex",
    "type": "cardiopathy"
    }
    }'

  1. Retrieve the indexed document by the keyword
    curl -XGET http://localhost:9200/articleindex/cardiopathy/_search -d'
    {
    "fields" : ["Title"],
    "query" : { "text" : { "FileContent" : "高血压病辨证分型与靶器官相关性研究的新进展" }}
    }
    '

{"took":179,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Problem No. 2


alter mapping:

curl -XPUT 'http://localhost:9200/articleindex/cardiopathy/_mapping' -d '
{
"cardiopathy" : {
"file" : {
"properties" : {
"Authors" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Media" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Organization" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Keywords" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Title" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"ISSN" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Pages" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Abstracts" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"FileContent" : {
"type" : "attachment",
"fields" : {
"file" : { "indexAnalyzer": "ik","searchAnalyzer": "ik","store" : "yes",
"index" : "analyzed" },
"date" : { "store" : "yes" },
"author" : { "store" : "yes" },
"keywords" : { "store" : "yes" },
"content_type" : { "store" : "yes" },
"title" : { "store" : "yes" }
}
}
}
}
}
}'

Retrieve the indexed document by the keyword:

{"took":63,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Problem No. 3


  1. The attachment is in GridFS, in addition, we define the other fields.
    GridFSInputFile gfsFile = gfsPhoto.createFile(file);
    String filename = file.getName();
    filename = filename.substring(0, filename.lastIndexOf("."));
    gfsFile.setFilename(filename);
    gfsFile.put("Title", jsonArray.getJSONObject(i).get("Title"));
    gfsFile.put("Authors",jsonArray.getJSONObject(i).get("Authors"));
    gfsFile.put("Organization", jsonArray.getJSONObject(i).get("Organization"));
    gfsFile.put("Media", jsonArray.getJSONObject(i).get("Media"));
    gfsFile.put("ISSN", jsonArray.getJSONObject(i).get("ISSN"));
    gfsFile.put("Pages", jsonArray.getJSONObject(i).get("Pages"));
    gfsFile.put("Pagecount", Pagecount);
    gfsFile.put("Abstracts", jsonArray.getJSONObject(i).get("Abstracts"));
    gfsFile.put("Keywords", "");
    gfsFile.save();

  1. create a index
    curl -XPUT "http://localhost:9200/articleindex"

  1. create a mapping
    curl -XPUT 'http://localhost:9200/cardiopathyindex/cardiopathy/_mapping' -d
    '{
    "cardiopathy": {
    "properties" : {
    "content" : {
    "path" : "full",
    "type" : "attachment",
    "fields" : {
    "content" : {"type": "string","indexAnalyzer":
    "ik","searchAnalyzer": "ik"},
    "Authors" : {"type": "string","indexAnalyzer":
    "ik","searchAnalyzer": "ik"},
    "Media" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer": "ik"},
    "Organization" : {"type": "string","indexAnalyzer":
    "ik","searchAnalyzer": "ik"},
    "Keywords" : { "type" : "string" ,"indexAnalyzer":
    "ik","searchAnalyzer": "ik"},
    "Title" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer": "ik"},
    "ISSN" : {"type": "string","indexAnalyzer":
    "ik","searchAnalyzer": "ik"},
    "Pages" : { "type" : "string" ,"indexAnalyzer":
    "ik","searchAnalyzer": "ik"},
    "Abstracts" : { "type" : "string" ,"indexAnalyzer":
    "ik","searchAnalyzer": "ik"},
    "date" : {"format" : "dateOptionalTime","type" : "date" },
    "content_type" : { "type" : "string" }
    }
    },
    "chunkSize" : { "type" : "long" },
    "md5" : { "type" : "string" },
    "length" : { "type" : "long" },
    "filename" : { "type" : "string" },
    "contentType" : { "type" : "string" },
    "uploadDate" : {
    "format" : "dateOptionalTime",
    "type" : "date"
    },
    "metadata" : { "type" : "object" }
    }
    }
    }'

  1. create the river
    curl -XPUT "http://localhost:9200/_river/mongodb/_meta" -d '
    {
    "type": "mongodb",
    "mongodb": {
    "host": "192.168.1.112",
    "port": "27107",
    "options": {"drop_collection": true },
    "db": "ftsearch",
    "collection": "fs",
    "gridfs": true
    },
    "index": {
    "name": "cardiopathyindex",
    "type": "cardiopathy",
    "content_type": "application/pdf"
    }
    }'

  1. Retrieve the indexed document by the keyword, hit, but the query result
    is missing the "Title" and "Authors" fields.
    curl -XGET http://localhost:9200/cardiopathyindex/cardiopathy/_search -d'
    {
    "fields" : ["Title","Authors"],
    "query" : { "text" : { "content" : "高血压病辨证分型与靶器官相关性研究的新进展" }}
    }
    '
    {"took":1005,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":96,"max_score":0.68918943,"hits":[{"_index":"cardiopathyindex","_type":"cardiopathy","_id":"51d972d5948516489d1674d1","_score":0.68918943},{"_index":"cardiopathyindex","_type":"cardiopathy","_id":"51d972db948516489d167545","_score":0.22994329},{"_index":"cardiopathyindex","_type":"cardiopathy","_id":"51d972da948516489d16752c","_score":0.20929527},.......

在 2012年11月5日星期一UTC+8下午4时18分13秒,Richard Louapre写道:

Hi,

The collection in the river settings should be "fs" and not "fs.files".
See example here:
https://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/src/test/java/test/elasticsearch/plugin/river/mongodb/test-gridfs-mongodb-river.json

I have just published version 1.5.0 which has been tested with MongoDB
2.2.1 and ES 0.19.11

Thanks,
Richard.

On Monday, November 5, 2012 2:39:10 AM UTC-5, Vimlesh wrote:

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [using attachment-mapper]
while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
http://pdfbox.apache.org/commandlineutilities/ExtractText.html

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.