Is there anyway we can perform the text search present in the images or pdf files through elasticsearch.
I mean to say that suppose I have pdf/image(will be stored in ES as base64 format) file indexed in ES. And if that image file contains "prashant" as text in it so is there a way I can search for the prashant and get the record for that image as well.
Is there anyway we can perform the text search present in the images or pdf
files through elasticsearch.
I mean to say that suppose I have pdf/image(will be stored in ES as base64
format) file indexed in ES. And if that image file contains "prashant" as
text in it so is there a way I can search for the prashant and get the
record for that image as well.
So in this I can index the attachments(say pdf file) and that will be stored as base64 encoding. So is this plugin made available for searching the text present in pdf file as well?
If yes what will be the result if I search for some keyword in attachment, will it return the proper text data or the base64 encoded data?
You'll need to send the file contents to Elasticsearch in base64 form
and Elasticsearch will use Tika to extract data from the file.
However, in typical case, you would rather store, not the whole data
of the binary file (as it can be quite big), but rather a path to the
file, so that the application that will query Elasticsearch know where
to look for the original file itself.
So in this I can index the attachments(say pdf file) and that will be stored
as base64 encoding. So is this plugin made available for searching the text
present in pdf file as well?
If yes what will be the result if I search for some keyword in attachment,
will it return the proper text data or the base64 encoded data?
So can I say that the mapper-attachment plugin is made to work like below:
Whether I am sending text file or pdf file or image file to ES , the plugin will extract the text content in all three scenarios and will store it into the ES and then it will be available for search as well?
The attachment plugin will use Tika to extract the text from binary
file content that you send in the base64. Tika does a good job with
text extraction, however you have to test it yourself, if your files
are parsed well enough for your use case.
So can I say that the mapper-attachment plugin is made to work like below:
Whether I am sending text file or pdf file or image file to ES , the plugin
will extract the text content in all three scenarios and will store it
into the ES and then it will be available for search as well?
Hi Rafał Kuć,
I tried doing the same but I didnt get the result as I want.
Just explaining the problem in details:
I have a pdf file which has the text as "There is already a big market for mid-range 4G LTE market, being pushed by telecom operators and device manufacturers."
I indexed this file in ES and when checked in ES the content present was in unicode like "PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiDQp4bWxuczpvPSJ1cm46c2NoZW1hHAtZXF1aXY9Q"
So if I search for "LTE" it wont return any result because the content stored in ES is in unicode format.
So my question is, Is there anyway or any plugin to store the pdf content in normal string format so that I can perform the search on top of that.
Hi Rafał Kuć,
I tried doing the same but I didnt get the result as I want.
Just explaining the problem in details:
I have a pdf file which has the text as "There is already a big market
for mid-range 4G LTE market, being pushed by telecom operators and device
manufacturers."
I indexed this file in ES and when checked in ES the content present was
in unicode like
"PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiDQp4bWxuczpvPSJ1cm46c2NoZW1hHAtZXF1aXY9Q"
So if I search for "LTE" it wont return any result because the content
stored in ES is in unicode format.
So my question is, Is there anyway or any plugin to store the pdf content in
normal string format so that I can perform the search on top of that.
Hi
I install tesseract-alpha for windows.As per following
"To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika. Then add an image (png, jpg, ...) into your Fscrawler root directory. After the next index update, the text will be indexed and placed in "_source.content". "
I add images to fscrawler root directory. and run the fscrawler in cmd
It gives following output....
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.