Hello -
I'm using Elasticsearch 1.2.1, with mapper-attachments-2.0.0. I'm a little
baffled by how to surface the text that Tika extracts from a PDF into the
structured document that ES is storing.
Long story short, with a trivial PDF file with one line of text, I'm
getting something like this:
{
"_index" : "test", "_type" : "doc", "_id" : "1", "_score" : 0.067124054, "fields" : { "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ], "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ], "my_attachment.title" : [ "Untitled" ] } }
When what I want is this (with the content of the file included):
{
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_score" : 0.067124054,
"fields" : {
"my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],
"my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],
"my_attachment.title" : [ "Untitled" ],
"my_attachment.file" : "This is the easiest PDF ever."
}
}
A somewhat related question: I'm also a bit confused as to the difference
between the "fields" from the attachment, and other fields in my document
that I'm storing in my _source. If I ask for the attachment fields, I don't
get anything else I stored in the document; if I don't ask for any fields,
I get everything from _source. Is there a way I can make the
my_attachment.* fields and the "Thing" field I store in my document
co-equals? I think what I want is for the my_attachment fields to show up
without having to explicitly ask for them.
My sample PDF documents are here:
And my curl/shell is below, followed by the sample output of a run.
curl -X DELETE localhost:9200/test
curl -X PUT localhost:9200/test
curl -X PUT localhost:9200/test/doc/_mapping -d '
{
"doc" : {
"properties" : {
"my_attachment" : {
"type" : "attachment",
"fields": {
"title" : { "store" : "yes" },
"date" : {"store" : "yes"},
"author" : {"store" : "yes"},
"keywords" : {store : "yes"},
"content_type" : {store : "yes"},
"content_length" : {store : "yes"},
"language" : {"store" : "yes"},
"file": { "store" : "yes", "term_vector":
"with_positions_offsets"}
}
}
}
}
}'
echo
echo "Uploading a PDF with 'This is the easiest PDF ever'"
coded=cat simple/Untitled1.pdf | base64
json="{"Thing":"first","my_attachment":"${coded}"}"
echo "$json" > json.file
curl -X PUT 'localhost:9200/test/doc/1?refresh=true' -d @json.file
rm json.file
echo
echo "Uploading a PDF with 'This is the second easiest PDF ever'"
coded=cat simple/Untitled2.pdf | base64
json="{"Thing": "followup", "my_attachment":"${coded}"}"
echo "$json" > json.file
curl -X PUT 'localhost:9200/test/doc/2?refresh=true' -d @json.file
rm json.file
echo
echo "Querying: Should get two hits"
curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{
"fields": ["title", "author", "date", "file", "keywords"],
"query" : { "match" : { "_all" : "easiest" } }
}'
echo
echo
echo "Querying: Should get one hit"
curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{
"fields": "*",
"query" : { "match" : { "_all" : "second" } }
}'
echo
echo
echo "Directly loading object 1"
echo
curl 'localhost:9200/test/doc/1'
echo
And the output
{"acknowledged":true}{"acknowledged":true}{"acknowledged":true}
Uploading a PDF with 'This is the easiest PDF ever'
{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}
Uploading a PDF with 'This is the second easiest PDF ever'
{"_index":"test","_type":"doc","_id":"2","_version":1,"created":true}
Querying: Should get two hits
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.067124054,
"hits" : [ {
"_index" : "test",
"_type" : "doc",
"_id" : "2",
"_score" : 0.067124054,
"fields" : {
"my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ],
"my_attachment.keywords" : [ "" ],
"my_attachment.title" : [ "Untitled" ]
}
}, {
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_score" : 0.067124054,
"fields" : {
"my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],
"my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],
"my_attachment.title" : [ "Untitled" ]
}
} ]
}
}
Querying: Should get one hit
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.067124054,
"hits" : [ {
"_index" : "test",
"_type" : "doc",
"_id" : "2",
"_score" : 0.067124054,
"fields" : {
"my_attachment.content_type" : [ "application/pdf" ],
"my_attachment.keywords" : [ "" ],
"my_attachment.title" : [ "Untitled" ],
"my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ],
"my_attachment.content_length" : [ 9458 ]
}
} ]
}
}
Directly loading object 1
{"_index":"test","_type":"doc","_id":"1","_version":1,"found":true,"_source":{"Thing":"first","my_attachment":"JVBERi0xLjMKJcTl....lots
of base64 data removed....VPRgo="}}
Thanks for any help you can point me at!
-Erik
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKJO4n4P09NrP1R8OMRD11XEkYBAOa3w5Ug%3DCcx_M9%3DDi%2B_Hpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.