Elastic search With MongoDB : Searching PDFs

Vimlesh_Mishra · October 29, 2012, 1:46pm

I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using elastic search. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files --type
application/pdf put D:\Sherlock-Holmes.pdf
Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request" tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" :

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

Vimlesh_Mishra · October 30, 2012, 7:51am

Will anybody help me on this????

On Monday, October 29, 2012 7:16:19 PM UTC+5:30, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using Elasticsearch. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files --type
application/pdf put D:\Sherlock-Holmes.pdf

Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request" tab
URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}
Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :
{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 
{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:
http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true
Its giving me following result :
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

radu_gheorghe · October 30, 2012, 2:43pm

Hi Vimlesh,

I haven't worked with the MongoDB river, so maybe somebody else will
jump in and give specific tips.

I suppose your document isn't indexed as you'd expected it. Getting
the document should produce a rather big base64 string in the _source,
as opposed to what you got there.

If you follow the wiki here (the GridFS part), does it work for you?

If not, please post your mapping here.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 30, 2012 at 9:51 AM, Vimlesh vimlesh.mishra@gmail.com wrote:

Will anybody help me on this????

On Monday, October 29, 2012 7:16:19 PM UTC+5:30, Vimlesh wrote:
I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using Elasticsearch. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files --type
application/pdf put D:\Sherlock-Holmes.pdf

Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab
URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}
Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :
{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" :
{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:
http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true
Its giving me following result :
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.
--

--

jprante · October 30, 2012, 3:31pm

Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper, or
to the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not possible
to find out whether they can be processed by Apache Tika (which is the
workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using Elasticsearch. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files --type
application/pdf put D:\Sherlock-Holmes.pdf

Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request" tab
URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}
Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :
{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 
{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:
http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true
Its giving me following result :
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

dadoonet · October 30, 2012, 3:59pm

Yes. That’s strange to me also.

I don’t see this option in the Wiki page:

De : elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com]
De la part de Jörg Prante
Envoyé : mardi 30 octobre 2012 16:31
À : elasticsearch@googlegroups.com
Objet : Re: Elastic search With MongoDB : Searching PDFs

Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper, or to
the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not possible
to find out whether they can be processed by Apache Tika (which is the
workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:

I were trying to save my pdf file in Mongo Db's gridFS and then searching in
that pdfs using elastic search. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files --type
application/pdf put D:\Sherlock-Holmes.pdf
Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request" tab

URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}

Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :

{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" :

{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chun
kSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d
2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:

http://localhost:9200/testmongo/files/_search?q=Albers

http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true
&pretty=true

Its giving me following result :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Here its showing me no any hit but word "Albers" present in this pdf. Please
help. Thanks in advance.

--

Vimlesh_Mishra · October 31, 2012, 6:32am

Hi Radu,

I already went through GridFS wiki URL but its not working for me. What
mapping do you expect from me to post?

Vimlesh

On Tuesday, October 30, 2012 8:13:54 PM UTC+5:30, Radu Gheorghe wrote:

Hi Vimlesh,

I haven't worked with the MongoDB river, so maybe somebody else will
jump in and give specific tips.

I suppose your document isn't indexed as you'd expected it. Getting
the document should produce a rather big base64 string in the _source,
as opposed to what you got there.

If you follow the wiki here (the GridFS part), does it work for you?
Home · richardwilly98/elasticsearch-river-mongodb Wiki · GitHub

If not, please post your mapping here.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 30, 2012 at 9:51 AM, Vimlesh <vimlesh...@gmail.com<javascript:>>
wrote:
Will anybody help me on this????

On Monday, October 29, 2012 7:16:19 PM UTC+5:30, Vimlesh wrote:
I were trying to save my pdf file in Mongo Db's gridFS and then
searching
in that pdfs using Elasticsearch. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files
--type
application/pdf put D:\Sherlock-Holmes.pdf

Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab
URL : http://localhost:9200/_river/mongodb/ 
_meta/PUT 

{ 
  "type": "mongodb", 
  "mongodb": { 
    "db": "testmongo", 
    "collection": "fs.files", 
    "gridfs": true, 
    "contentType": "", 
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'" 
  }, 
  "index": { 
    "name": "testmongo", 
    "type": "files", 
    "content_type": "application/pdf" 
  } 
} 
Now i am trying to access following URL :
http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true
I got following response (Which i believe is as expected) :
{ 
  "_index" : "testmongo", 
  "_type" : "files", 
  "_id" : "508e82e21e43def09b5e1602", 
  "_version" : 1, 
  "exists" : true, "_source" : 
{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
} 
But when i were trying to search on this pdf using following URL:
http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true 
Its giving me following result :
{ 
  "took" : 0, 
  "timed_out" : false, 
  "_shards" : { 
    "total" : 5, 
    "successful" : 5, 
    "failed" : 0 
  }, 
  "hits" : { 
    "total" : 0, 
    "max_score" : null, 
    "hits" : [ ] 
  } 
} 
Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.
--

--

Vimlesh_Mishra · October 31, 2012, 6:50am

Hello Jörg,

My question related to mongodb river and the attachment mapper. I want to
search in pdfs file using Elasticsearch which have been stored in Mongo
DB' Grid FS.

I followed the Grid FS wiki's url for this configuration but that didn't
help me in searching texts in those pdfs. The actual problem which i am
seeing here attachment mapper plugin not able to extracts text from that
pdfs which mongodb river plugin fetched from mongo db [to enable that i
tried that content wiered configuration :(]. Can you send me exact
steps/configuration to enable that. I really stuck at this point.

Thanks
Vimlesh

On Tuesday, October 30, 2012 9:01:17 PM UTC+5:30, Jörg Prante wrote:

Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper, or
to the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not possible
to find out whether they can be processed by Apache Tika (which is the
workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:
I were trying to save my pdf file in Mongo Db's gridFS and then searching
in that pdfs using Elasticsearch. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files --type
application/pdf put D:\Sherlock-Holmes.pdf

Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab
URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}
Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :
{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 
{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:
http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true
Its giving me following result :
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

jprante · October 31, 2012, 3:36pm

Hi Vimlesh,

The river does not work, with MongoDB 2.2.1 installed here. It connects but
does not fetch any data.

My steps:

gist.github.com

https://gist.github.com/jprante/3987668

gistfile1.txt

Jorg-Prantes-MacBook-Pro:mongodb-osx-x86_64-2.2.1 joerg$ mkdir -p data/rs0-0
Jorg-Prantes-MacBook-Pro:mongodb-osx-x86_64-2.2.1 joerg$ ./bin/mongod --dbpath data/rs0-0 --port 27017 --replSet rs0 
Wed Oct 31 15:57:06 [initandlisten] MongoDB starting : pid=13387 port=27017 dbpath=data/rs0-0 64-bit host=Jorg-Prantes-MacBook-Pro.local
Wed Oct 31 15:57:06 [initandlisten] db version v2.2.1, pdfile version 4.5
Wed Oct 31 15:57:06 [initandlisten] git version: d6764bf8dfe0685521b8bc7b98fd1fab8cfeb5ae
Wed Oct 31 15:57:06 [initandlisten] build info: Darwin erh-tnt.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_49
Wed Oct 31 15:57:06 [initandlisten] options: { dbpath: "data/rs0-0", port: 27017, replSet: "rs0" }
Wed Oct 31 15:57:06 [initandlisten] journal dir=data/rs0-0/journal
Wed Oct 31 15:57:06 [initandlisten] recover : no journal files present, no recovery needed
Wed Oct 31 15:57:06 [websvr] admin web console waiting for connections on port 28017

This file has been truncated. show original

So I can
confirm issues with mongo 2.2.1 · Issue #37 · richardwilly98/elasticsearch-river-mongodb · GitHub

Hopefully you have more luck with 2.0.7

Best regards,

Jörg

On Wednesday, October 31, 2012 7:50:07 AM UTC+1, Vimlesh wrote:

Hello Jörg,

My question related to mongodb river and the attachment mapper. I want to
search in pdfs file using Elasticsearch which have been stored in Mongo
DB' Grid FS.

I followed the Grid FS wiki's url for this configuration but that didn't
help me in searching texts in those pdfs. The actual problem which i am
seeing here attachment mapper plugin not able to extracts text from that
pdfs which mongodb river plugin fetched from mongo db [to enable that i
tried that content wiered configuration :(]. Can you send me exact
steps/configuration to enable that. I really stuck at this point.

Thanks
Vimlesh

On Tuesday, October 30, 2012 9:01:17 PM UTC+5:30, Jörg Prante wrote:
Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper, or
to the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not
possible to find out whether they can be processed by Apache Tika (which is
the workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:
I were trying to save my pdf file in Mongo Db's gridFS and then
searching in that pdfs using Elasticsearch. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files
--type application/pdf put D:\Sherlock-Holmes.pdf

Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab
URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}
Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :
{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 
{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:
http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true
Its giving me following result :
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

Vimlesh_Mishra · November 1, 2012, 5:53am

Hello Jörg,

I were already trying with Mongo DB-2.0.7, also tried steps you have
provided @ [MongoDB river not working · GitHub]

When I executed curl -XGET
'http://localhost:9200/mongoindex/files/_search?q='*

I got following response (which confirms me that river able to fetch files
from mongo db) :

{"took":0,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":1,"max_score":1.0,"hits":
[{"_index":"mongoindex","_type":"files","_id":"5092077ee572eaa4f5a2cc3a","_score":1.0, "_source" :
{"_id":"5092077ee572eaa4f5a2cc3a","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,
"uploadDate":"2012-11-01T05:24:14.608Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272}}]}}

But when i tried curl -XGET
'http://localhost:9200/mongoindex/files/_search?q=Albers'

I got following response :

{"took":0,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

also following mapping configuration i found when i executed curl -XGET 'http://localhost:9200/_mapping'

{

_river: {
- mongodb: {
  - properties: {
    - _last_ts: {
      - type: string
        }
    - index: {
      - dynamic: true
      - properties: {
        
        name: {
        
        type: string
        }
        
        type: {
        
        type: string
        }
        }
        }
    - mongodb: {
      - dynamic: true
      - properties: {
        
        collection: {
        
        type: string
        }
        
        db: {
        
        type: string
        }
        
        gridfs: {
        
        type: boolean
        }
        }
        }
    - node: {
      - dynamic: true
      - properties: {
        
        id: {
        
        type: string
        }
        
        name: {
        
        type: string
        }
        
        transport_address: {
        
        type: string
        }
        }
        }
    - ok: {
      - type: boolean
        }
    - type: {
      - type: string
        }
        }
        }
        }
mongoindex: {
- files: {
  - properties: {
    - chunkSize: {
      - type: long
        }
    - content: {
      - type: attachment
      - path: full
      - fields: {
        
        content: {
        
        type: string
        }
        
        author: {
        
        type: string
        }
        
        title: {
        
        type: string
        }
        
        name: {
        
        type: string
        }
        
        date: {
        
        type: date
        
        format: dateOptionalTime
        }
        
        keywords: {
        
        type: string
        }
        
        content_type: {
        
        type: string
        }
        }
        }
    - contentType: {
      - type: string
        }
    - filename: {
      - type: string
        }
    - length: {
      - type: long
        }
    - md5: {
      - type: string
        }
    - uploadDate: {
      - type: date
      - format: dateOptionalTime
        }
        }
        }
        }

}

On Wednesday, October 31, 2012 9:06:19 PM UTC+5:30, Jörg Prante wrote:

Hi Vimlesh,

The river does not work, with MongoDB 2.2.1 installed here. It connects
but does not fetch any data.

My steps:

MongoDB river not working · GitHub

So I can confirm
issues with mongo 2.2.1 · Issue #37 · richardwilly98/elasticsearch-river-mongodb · GitHub

Hopefully you have more luck with 2.0.7

Best regards,

Jörg

On Wednesday, October 31, 2012 7:50:07 AM UTC+1, Vimlesh wrote:
Hello Jörg,

My question related to mongodb river and the attachment mapper. I want to
search in pdfs file using Elasticsearch which have been stored in Mongo
DB' Grid FS.

I followed the Grid FS wiki's url for this configuration but that didn't
help me in searching texts in those pdfs. The actual problem which i am
seeing here attachment mapper plugin not able to extracts text from that
pdfs which mongodb river plugin fetched from mongo db [to enable that i
tried that content wiered configuration :(]. Can you send me exact
steps/configuration to enable that. I really stuck at this point.

Thanks
Vimlesh

On Tuesday, October 30, 2012 9:01:17 PM UTC+5:30, Jörg Prante wrote:
Hi Vimlesh,

is your question related to the mongodb river, the attachment mapper,
or to the PDF indexing?

Where did you read this syntax? It looks wierd:

"content": "base64 /path/filename | perl -pe 's/\n/\n/g'"

But, as you can imagine, without knowing the PDF files, it is not
possible to find out whether they can be processed by Apache Tika (which is
the workhorse of the attachment mapper plugin)

Jörg

On Monday, October 29, 2012 2:46:19 PM UTC+1, Vimlesh wrote:
I were trying to save my pdf file in Mongo Db's gridFS and then
searching in that pdfs using Elasticsearch. I performed following :

Mongo DB Side:

mongod --port 27017 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data17"
mongod --port 27018 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data18"
mongod --port 27019 --replSet rs0 --dbpath
"D:\Mongo-DB\mongodb-win32-i386-2.0.7\data19"

mongo localhost:27017
rs.initiate()
rs.add("hostname:27018")
rs.add("hostname:27019")

mongofiles -hlocalhost:27017 --db testmongo --collection files
--type application/pdf put D:\Sherlock-Holmes.pdf

Elastic Search Side (Installed Plugins :
bigdesk/head/mapper-attachments/river-mongodb)

-> Using Elastic Search Head given following request from "Any request"
tab
URL : http://localhost:9200/_river/mongodb/
_meta/PUT

{
  "type": "mongodb",
  "mongodb": {
    "db": "testmongo",
    "collection": "fs.files",
    "gridfs": true,
    "contentType": "",
    "content": "base64 /path/filename | perl -pe 's/\n/\\n/g'"
  },
  "index": {
    "name": "testmongo",
    "type": "files",
    "content_type": "application/pdf"
  }
}
Now i am trying to access following URL :

http://localhost:9200/testmongo/files/508e82e21e43def09b5e1602?pretty=true

I got following response (Which i believe is as expected) :
{
  "_index" : "testmongo",
  "_type" : "files",
  "_id" : "508e82e21e43def09b5e1602",
  "_version" : 1,
  "exists" : true, "_source" : 
{"_id":"508e82e21e43def09b5e1602","filename":"D:\Sherlock-Holmes.pdf","chunkSize":262144,"uploadDate":"2012-10-29T13:21:38.969Z","md5":"025fa2046f9254d2aecb9e52ae851065","length":98272,"contentType":"application/pdf"}
}

But when i were trying to search on this pdf using following URL:
http://localhost:9200/testmongo/files/_search?q=Albers&pretty=true
Its giving me following result :
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
Here its showing me no any hit but word "Albers" present in this pdf.
Please help. Thanks in advance.

--

jprante · November 1, 2012, 10:49am

Yes, without knowing the characteristics of the PDF it is hard to find out
if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Vimlesh_Mishra · November 1, 2012, 11:01am

Jörg we don't have any way/configuration in mongodb river itself to extract
texts from pdf [using attachment-mapper] while fetching pdfs from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find out
if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
Apache PDFBox | Command-Line Tools

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Vimlesh_Mishra · November 5, 2012, 7:39am

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [using attachment-mapper]
while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
Apache PDFBox | Command-Line Tools

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Richard_Louapre · November 5, 2012, 8:18am

Hi,

The collection in the river settings should be "fs" and not "fs.files". See
example here:
https://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/src/test/java/test/elasticsearch/plugin/river/mongodb/test-gridfs-mongodb-river.json

I have just published version 1.5.0 which has been tested with MongoDB
2.2.1 and ES 0.19.11

Thanks,
Richard.

On Monday, November 5, 2012 2:39:10 AM UTC-5, Vimlesh wrote:

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [using attachment-mapper]
while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
Apache PDFBox | Command-Line Tools

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Richard_Louapre · November 5, 2012, 4:58pm

Hi Vimlesh,

I have a double look at the wiki page. It says:

Create the river as follow:

$ curl -XPUT "localhost:9200/_river/mongogridfs/_meta" -d'
{
type: "mongodb",
mongodb: {
db: "testmongo",
collection: "files",
gridfs: true
},
index: {
name: "testmongo",
type: "files"
}
}

Import the document using the command line:

%MONGO_HOME%\bin\mongofiles.exe —host localhost:27017 —db testmongo —collection files put test-large-document.pdf

Please let me know if the wiki page need to be updated.

Thanks,
Richard.

On Monday, November 5, 2012 3:18:13 AM UTC-5, Richard Louapre wrote:

Hi,

The collection in the river settings should be "fs" and not "fs.files".
See example here:
https://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/src/test/java/test/elasticsearch/plugin/river/mongodb/test-gridfs-mongodb-river.json

I have just published version 1.5.0 which has been tested with MongoDB
2.2.1 and ES 0.19.11

Thanks,
Richard.

On Monday, November 5, 2012 2:39:10 AM UTC-5, Vimlesh wrote:

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [using attachment-mapper]
while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
Apache PDFBox | Command-Line Tools

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

Vimlesh_Mishra · November 7, 2012, 10:08am

Hello Richard,

As you suggested i tried above configuration, now everything working fine
as expected. Thanks a ton.

Also I tried "fs" configuration and its working so above command will be :

mongofiles.exe —host localhost:27017 —db testmongo —collection fs put
test-large-document.pdf and

$ curl -XPUT "localhost:9200/_river/mongogridfs/_meta" -d'
{
type: "mongodb",
mongodb: {
db: "testmongo",
collection: "fs",
gridfs: true
},
index: {
name: "testmongo",
type: "files"
}
}

On Mon, Nov 5, 2012 at 10:28 PM, Richard Louapre
richard.louapre@gmail.comwrote:

Hi Vimlesh,

I have a double look at the wiki page. It says:

Create the river as follow:

$ curl -XPUT "localhost:9200/_river/mongogridfs/_meta" -d'
{
type: "mongodb",
mongodb: {
db: "testmongo",
collection: "files",
gridfs: true
},
index: {
name: "testmongo",
type: "files"
}
}

Import the document using the command line:

%MONGO_HOME%\bin\mongofiles.exe —host localhost:27017 —db testmongo —collection files put test-large-document.pdf

Please let me know if the wiki page need to be updated.

Thanks,
Richard.

On Monday, November 5, 2012 3:18:13 AM UTC-5, Richard Louapre wrote:

Hi,

The collection in the river settings should be "fs" and not "fs.files".
See example here: https://github.com/**richardwilly98/elasticsearch-**
river-mongodb/blob/master/src/test/java/test/elasticsearch/
plugin/river/mongodb/test-**gridfs-mongodb-river.jsonhttps://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/src/test/java/test/elasticsearch/plugin/river/mongodb/test-gridfs-mongodb-river.json

I have just published version 1.5.0 which has been tested with MongoDB
2.2.1 and ES 0.19.11

Thanks,
Richard.

On Monday, November 5, 2012 2:39:10 AM UTC-5, Vimlesh wrote:

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [*using attachment-mapper
*] while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see http://pdfbox.apache.org/**commandlineutilities/*
*ExtractText.htmlhttp://pdfbox.apache.org/commandlineutilities/ExtractText.html

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--

--

Jordon · July 8, 2013, 10:45am

How to create index for a attachment of pdf by using
elasticsearch-river-mongodb: 1.6.9 (don't have any hits,or missing fields)

Dear All,
I am new to elasticsearch. I have tried to follow the different tutorials
and post on index and mapping attached pdf document in a mongodb database
for days without success. After running the codes below i don't have any
hits from words that exist in the mongodb attached files.

software version:
MongoDB: mongodb-linux-x86_64-2.4.3
elasticsearch-river-mongodb: 1.6.9
elasticsearch: 0.90
elasticsearch-mapper-attachments: 1.7.0

Problem No. 1

BSON Structure, PDF attachment is in the "FileContent" field, the
attachment is not in GridFS.
byte fileser = iou.read(file);
Pagecount = getpagenum(file);
BasicDBObject articleobject = new BasicDBObject();
articleobject.put("Title", jsonArray.getJSONObject(i).get("Title"));
articleobject.put("Authors",jsonArray.getJSONObject(i).get("Authors"));
articleobject.put("Organization",
jsonArray.getJSONObject(i).get("Organization"));
articleobject.put("Media", jsonArray.getJSONObject(i).get("Media"));
articleobject.put("ISSN", jsonArray.getJSONObject(i).get("ISSN"));
articleobject.put("Pages", jsonArray.getJSONObject(i).get("Pages"));
articleobject.put("Pagecount", Pagecount);
articleobject.put("Abstracts", jsonArray.getJSONObject(i).get("Abstracts"));
articleobject.put("Keywords", "");
articleobject.put("FileContent", fileser);
collection.insert(articleobject);

create a index
curl -XPUT "http://localhost:9200/articleindex"

create a mapping
curl -XPUT 'http://localhost:9200/articleindex/cardiopathy/_mapping' -d '
{
"cardiopathy" : {
"properties" : {
"Authors" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Media" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Organization" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Keywords" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Title" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"ISSN" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Pages" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Abstracts" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"FileContent" : { "type" : "string" ,"indexAnalyzer":
"ik","searchAnalyzer": "ik"}
}
}
}'

create the river
curl -XPUT "http://localhost:9200/_river/mongodb/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"host": "192.168.1.112",
"port": "27107",
"options": {"drop_collection": true },
"db": "ftsearch1",
"collection": "pdf"
},
"index": {
"name": "articleindex",
"type": "cardiopathy"
}
}'

Retrieve the indexed document by the keyword
curl -XGET http://localhost:9200/articleindex/cardiopathy/_search -d'
{
"fields" : ["Title"],
"query" : { "text" : { "FileContent" : "高血压病辨证分型与靶器官相关性研究的新进展" }}
}
'

{"took":179,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

Problem No. 2

alter mapping：

curl -XPUT 'http://localhost:9200/articleindex/cardiopathy/_mapping' -d '
{
"cardiopathy" : {
"file" : {
"properties" : {
"Authors" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Media" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Organization" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Keywords" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Title" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"ISSN" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Pages" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"Abstracts" : { "type" : "string" ,"indexAnalyzer": "ik","searchAnalyzer":
"ik","store" : "yes"},
"FileContent" : {
"type" : "attachment",
"fields" : {
"file" : { "indexAnalyzer": "ik","searchAnalyzer": "ik","store" : "yes",
"index" : "analyzed" },
"date" : { "store" : "yes" },
"author" : { "store" : "yes" },
"keywords" : { "store" : "yes" },
"content_type" : { "store" : "yes" },
"title" : { "store" : "yes" }
}
}
}
}
}
}'

Retrieve the indexed document by the keyword：

{"took":63,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

Problem No. 3

The attachment is in GridFS, in addition, we define the other fields.
GridFSInputFile gfsFile = gfsPhoto.createFile(file);
String filename = file.getName();
filename = filename.substring(0, filename.lastIndexOf("."));
gfsFile.setFilename(filename);
gfsFile.put("Title", jsonArray.getJSONObject(i).get("Title"));
gfsFile.put("Authors",jsonArray.getJSONObject(i).get("Authors"));
gfsFile.put("Organization", jsonArray.getJSONObject(i).get("Organization"));
gfsFile.put("Media", jsonArray.getJSONObject(i).get("Media"));
gfsFile.put("ISSN", jsonArray.getJSONObject(i).get("ISSN"));
gfsFile.put("Pages", jsonArray.getJSONObject(i).get("Pages"));
gfsFile.put("Pagecount", Pagecount);
gfsFile.put("Abstracts", jsonArray.getJSONObject(i).get("Abstracts"));
gfsFile.put("Keywords", "");
gfsFile.save();

create a index
curl -XPUT "http://localhost:9200/articleindex"

create a mapping
curl -XPUT 'http://localhost:9200/cardiopathyindex/cardiopathy/_mapping' -d
'{
"cardiopathy": {
"properties" : {
"content" : {
"path" : "full",
"type" : "attachment",
"fields" : {
"content" : {"type": "string","indexAnalyzer":
"ik","searchAnalyzer": "ik"},
"Authors" : {"type": "string","indexAnalyzer":
"ik","searchAnalyzer": "ik"},
"Media" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer": "ik"},
"Organization" : {"type": "string","indexAnalyzer":
"ik","searchAnalyzer": "ik"},
"Keywords" : { "type" : "string" ,"indexAnalyzer":
"ik","searchAnalyzer": "ik"},
"Title" : {"type": "string","indexAnalyzer": "ik","searchAnalyzer": "ik"},
"ISSN" : {"type": "string","indexAnalyzer":
"ik","searchAnalyzer": "ik"},
"Pages" : { "type" : "string" ,"indexAnalyzer":
"ik","searchAnalyzer": "ik"},
"Abstracts" : { "type" : "string" ,"indexAnalyzer":
"ik","searchAnalyzer": "ik"},
"date" : {"format" : "dateOptionalTime","type" : "date" },
"content_type" : { "type" : "string" }
}
},
"chunkSize" : { "type" : "long" },
"md5" : { "type" : "string" },
"length" : { "type" : "long" },
"filename" : { "type" : "string" },
"contentType" : { "type" : "string" },
"uploadDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"metadata" : { "type" : "object" }
}
}
}'

create the river
curl -XPUT "http://localhost:9200/_river/mongodb/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"host": "192.168.1.112",
"port": "27107",
"options": {"drop_collection": true },
"db": "ftsearch",
"collection": "fs",
"gridfs": true
},
"index": {
"name": "cardiopathyindex",
"type": "cardiopathy",
"content_type": "application/pdf"
}
}'

Retrieve the indexed document by the keyword, hit, but the query result
is missing the "Title" and "Authors" fields.
curl -XGET http://localhost:9200/cardiopathyindex/cardiopathy/_search -d'
{
"fields" : ["Title","Authors"],
"query" : { "text" : { "content" : "高血压病辨证分型与靶器官相关性研究的新进展" }}
}
'
{"took":1005,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":96,"max_score":0.68918943,"hits":[{"_index":"cardiopathyindex","_type":"cardiopathy","_id":"51d972d5948516489d1674d1","_score":0.68918943},{"_index":"cardiopathyindex","_type":"cardiopathy","_id":"51d972db948516489d167545","_score":0.22994329},{"_index":"cardiopathyindex","_type":"cardiopathy","_id":"51d972da948516489d16752c","_score":0.20929527},.......

在 2012年11月5日星期一UTC+8下午4时18分13秒，Richard Louapre写道：

Hi,

The collection in the river settings should be "fs" and not "fs.files".
See example here:
https://github.com/richardwilly98/elasticsearch-river-mongodb/blob/master/src/test/java/test/elasticsearch/plugin/river/mongodb/test-gridfs-mongodb-river.json

I have just published version 1.5.0 which has been tested with MongoDB
2.2.1 and ES 0.19.11

Thanks,
Richard.

On Monday, November 5, 2012 2:39:10 AM UTC-5, Vimlesh wrote:

Jörg, waiting for your response don't we have any way/configuration in
mongodb river itself to extract texts from pdf [using attachment-mapper]
while fetching pdfs from MongoDB?
On Thursday, November 1, 2012 4:31:23 PM UTC+5:30, Vimlesh wrote:

Jörg we don't have any way/configuration in mongodb river itself to
extract texts from pdf [using attachment-mapper] while fetching pdfs
from MongoDB.

On Thursday, November 1, 2012 4:19:07 PM UTC+5:30, Jörg Prante wrote:

Yes, without knowing the characteristics of the PDF it is hard to find
out if the Tika PDF processing works in this case. You can test it by using
PDFbox directly, see
Apache PDFBox | Command-Line Tools

Best regards,

Jörg

On Thursday, November 1, 2012 6:53:14 AM UTC+1, Vimlesh wrote:

So the actual problem is attachment mapper plugin is not able to extracts texts from pdf file else above query
should return some hits because pdf file actually contains "Albers" word. I am not sure that i have missed any configuration
or its not feasible.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
How to create index for a attachment of pdf by using elasticsearch-river-mongodb: 1.6.9 (don't have any hits,or missing fields) Elasticsearch	2	323	July 6, 2017
Integrating MongoDB with Elastic Search Elasticsearch	5	1272	July 6, 2017
Searching pdf files by content with Mongodb-river Elasticsearch	11	744	July 6, 2017
How to create index for a attachment of pdf by using elasticsearch-river-couchdb(1.2.0) (don't have any hits) Elasticsearch	7	576	July 6, 2017
Weird Behavior of Elastic Search Elasticsearch	5	479	July 6, 2017

Elastic search With MongoDB : Searching PDFs

Best regards, Radu

Best regards, Radu

Related topics

Best regards,
Radu

Best regards,
Radu