Indexing pdf documents


(ElasticSearchUser07) #1

Hi All,

ES version - 2.1
Mapper attachments plugin - 3.1

I am trying to index a pdf document in elastic search. I used Nest,ElasticSearch.net APIs to index the document. I have installed mapper-attachment plugin. Indexing part is done and when i query the file, i get the below mapping and settings
{
"file" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"id" : {
"type" : "integer"
},
"name" : {
"type" : "string"
},
"title" : {
"type" : "string",
"store" : true
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1480322538205",
"number_of_shards" : "5",
"number_of_replicas" : "0",
"uuid" : "XnDEug4OQ6205Ja_1_RSMA",
"version" : {
"created" : "2010199"
}
}
},
"warmers" : { }
}
}
The pdf file is stored as encoded base 64 string and i am able to view it when is query
localhost:9200/file/_search?q=sometextinsidepdf
it returns me the json
{"took":25,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.02117227,"hits":[{"_index":"file","_type":"document","_id":"1","_score":0.02117227,"_source":{
"id": 1,
"title": "test",
"file": {
"_content": "....base 64 encoded string .....
}
}

Actually its returning the encoded string if the search text is present in pdf, but ideally i would like to see the actual text in the pdf returned instead of encoded string with highlighted search text . How to achieve this? Any help is really appreciated.
Thanks,
Sooraj.S


(David Pilato) #2

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Have a look at: https://www.elastic.co/blog/the-future-of-attachments-for-elasticsearch-and-dotnet

Note that the original text won't be part of the _source document when using mapper-attachments plugin. You should use ingest-attachment instead as explained in "The future" part of the page I linked to..


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.