Extracted text visibility from a Tika-processed attachment


(Erik Paulson) #1

Hello -

I'm using Elasticsearch 1.2.1, with mapper-attachments-2.0.0. I'm a little
baffled by how to surface the text that Tika extracts from a PDF into the
structured document that ES is storing.

Long story short, with a trivial PDF file with one line of text, I'm
getting something like this:

{

  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_score" : 0.067124054,
  "fields" : {
    "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],
    "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],
    "my_attachment.title" : [ "Untitled" ]
  }
}

When what I want is this (with the content of the file included):

{
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_score" : 0.067124054,
"fields" : {
"my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],
"my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],
"my_attachment.title" : [ "Untitled" ],
"my_attachment.file" : "This is the easiest PDF ever."
}
}

A somewhat related question: I'm also a bit confused as to the difference
between the "fields" from the attachment, and other fields in my document
that I'm storing in my _source. If I ask for the attachment fields, I don't
get anything else I stored in the document; if I don't ask for any fields,
I get everything from _source. Is there a way I can make the
my_attachment.* fields and the "Thing" field I store in my document
co-equals? I think what I want is for the my_attachment fields to show up
without having to explicitly ask for them.

My sample PDF documents are here:

And my curl/shell is below, followed by the sample output of a run.

curl -X DELETE localhost:9200/test
curl -X PUT localhost:9200/test

curl -X PUT localhost:9200/test/doc/_mapping -d '
{
"doc" : {
"properties" : {
"my_attachment" : {
"type" : "attachment",
"fields": {
"title" : { "store" : "yes" },
"date" : {"store" : "yes"},
"author" : {"store" : "yes"},
"keywords" : {store : "yes"},
"content_type" : {store : "yes"},
"content_length" : {store : "yes"},
"language" : {"store" : "yes"},
"file": { "store" : "yes", "term_vector":
"with_positions_offsets"}
}
}
}
}
}'

echo
echo "Uploading a PDF with 'This is the easiest PDF ever'"
coded=cat simple/Untitled1.pdf | base64
json="{"Thing":"first","my_attachment":"${coded}"}"
echo "$json" > json.file
curl -X PUT 'localhost:9200/test/doc/1?refresh=true' -d @json.file
rm json.file

echo
echo "Uploading a PDF with 'This is the second easiest PDF ever'"
coded=cat simple/Untitled2.pdf | base64
json="{"Thing": "followup", "my_attachment":"${coded}"}"
echo "$json" > json.file
curl -X PUT 'localhost:9200/test/doc/2?refresh=true' -d @json.file
rm json.file

echo
echo "Querying: Should get two hits"
curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{
"fields": ["title", "author", "date", "file", "keywords"],
"query" : { "match" : { "_all" : "easiest" } }
}'
echo
echo
echo "Querying: Should get one hit"
curl -X POST 'localhost:9200/test/doc/_search?pretty=true' -d '{
"fields": "*",

"query" : { "match" : { "_all" : "second" } }

}'

echo

echo

echo "Directly loading object 1"

echo

curl 'localhost:9200/test/doc/1'
echo

And the output

{"acknowledged":true}{"acknowledged":true}{"acknowledged":true}

Uploading a PDF with 'This is the easiest PDF ever'

{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}

Uploading a PDF with 'This is the second easiest PDF ever'

{"_index":"test","_type":"doc","_id":"2","_version":1,"created":true}

Querying: Should get two hits

{

"took" : 2,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 2,

"max_score" : 0.067124054,

"hits" : [ {

  "_index" : "test",

  "_type" : "doc",

  "_id" : "2",

  "_score" : 0.067124054,

  "fields" : {

    "my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ],

    "my_attachment.keywords" : [ "" ],

    "my_attachment.title" : [ "Untitled" ]

  }

}, {

  "_index" : "test",

  "_type" : "doc",

  "_id" : "1",

  "_score" : 0.067124054,

  "fields" : {

    "my_attachment.date" : [ "2014-07-31T23:29:45.000Z" ],

    "my_attachment.keywords" : [ "TestKeyword1, TestKeyword2" ],

    "my_attachment.title" : [ "Untitled" ]

  }

} ]

}

}

Querying: Should get one hit

{

"took" : 2,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 1,

"max_score" : 0.067124054,

"hits" : [ {

  "_index" : "test",

  "_type" : "doc",

  "_id" : "2",

  "_score" : 0.067124054,

  "fields" : {

    "my_attachment.content_type" : [ "application/pdf" ],

    "my_attachment.keywords" : [ "" ],

    "my_attachment.title" : [ "Untitled" ],

    "my_attachment.date" : [ "2014-07-31T21:48:21.000Z" ],

    "my_attachment.content_length" : [ 9458 ]

  }

} ]

}

}

Directly loading object 1
{"_index":"test","_type":"doc","_id":"1","_version":1,"found":true,"_source":{"Thing":"first","my_attachment":"JVBERi0xLjMKJcTl....lots
of base64 data removed....VPRgo="}}

Thanks for any help you can point me at!

-Erik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKJO4n4P09NrP1R8OMRD11XEkYBAOa3w5Ug%3DCcx_M9%3DDi%2B_Hpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #2