Can we perform the text search present in the images or pdf files through elasticsearch


(Prashant Agrawal) #1

Hi ES users,

Is there anyway we can perform the text search present in the images or pdf files through elasticsearch.

I mean to say that suppose I have pdf/image(will be stored in ES as base64 format) file indexed in ES. And if that image file contains "prashant" as text in it so is there a way I can search for the prashant and get the record for that image as well.


(Rafał Kuć) #2

Hello!

Please look at the attachment plugin for Elasticsearch: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html

It uses Apache Tika under the hood. The list of supported formats is
available here: http://tika.apache.org/0.10/formats.html

--
Regards,
Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

Hi ES users,

Is there anyway we can perform the text search present in the images or pdf
files through elasticsearch.

I mean to say that suppose I have pdf/image(will be stored in ES as base64
format) file indexed in ES. And if that image file contains "prashant" as
text in it so is there a way I can search for the prashant and get the
record for that image as well.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Can-we-perform-the-text-search-presnet-in-the-images-or-pdf-files-through-elasticsearch-tp4054367.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/588849345.20140418080555%40alud.com.pl.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #3

Hi ,

If I am not wrong you are talking about https://github.com/elasticsearch/elasticsearch-mapper-attachments

So in this I can index the attachments(say pdf file) and that will be stored as base64 encoding. So is this plugin made available for searching the text present in pdf file as well?

If yes what will be the result if I search for some keyword in attachment, will it return the proper text data or the base64 encoded data?

~Prashant


(Rafał Kuć) #4

Hello!

You'll need to send the file contents to Elasticsearch in base64 form
and Elasticsearch will use Tika to extract data from the file.

However, in typical case, you would rather store, not the whole data
of the binary file (as it can be quite big), but rather a path to the
file, so that the application that will query Elasticsearch know where
to look for the original file itself.

--
Regards,
Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

Hi ,

If I am not wrong you are talking about
https://github.com/elasticsearch/elasticsearch-mapper-attachments
https://github.com/elasticsearch/elasticsearch-mapper-attachments

So in this I can index the attachments(say pdf file) and that will be stored
as base64 encoding. So is this plugin made available for searching the text
present in pdf file as well?

If yes what will be the result if I search for some keyword in attachment,
will it return the proper text data or the base64 encoded data?

~Prashant

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Can-we-perform-the-text-search-present-in-the-images-or-pdf-files-through-elasticsearch-tp4054367p4054371.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2310555013.20140418083728%40alud.com.pl.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #5

So can I say that the mapper-attachment plugin is made to work like below:
Whether I am sending text file or pdf file or image file to ES , the plugin will extract the text content in all three scenarios and will store it into the ES and then it will be available for search as well?


(Rafał Kuć) #6

Hello!

The attachment plugin will use Tika to extract the text from binary
file content that you send in the base64. Tika does a good job with
text extraction, however you have to test it yourself, if your files
are parsed well enough for your use case.

--
Regards,
Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

So can I say that the mapper-attachment plugin is made to work like below:
Whether I am sending text file or pdf file or image file to ES , the plugin
will extract the text content in all three scenarios and will store it
into the ES and then it will be available for search as well?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Can-we-perform-the-text-search-present-in-the-images-or-pdf-files-through-elasticsearch-tp4054367p4054374.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/241416263.20140418094630%40alud.com.pl.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #7

Hi Rafał Kuć,
I tried doing the same but I didnt get the result as I want.
Just explaining the problem in details:

  1. I have a pdf file which has the text as "There is already a big market for mid-range 4G LTE market, being pushed by telecom operators and device manufacturers."

  2. I indexed this file in ES and when checked in ES the content present was in unicode like "PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiDQp4bWxuczpvPSJ1cm46c2NoZW1hHAtZXF1aXY9Q"

  3. So if I search for "LTE" it wont return any result because the content stored in ES is in unicode format.

So my question is, Is there anyway or any plugin to store the pdf content in normal string format so that I can perform the search on top of that.


(David Pilato) #8

It should work with mapper attachment. Remember that what you see in _ source is not what you get indexed.

About extracting and storing text content, fsriver does it. See https://github.com/dadoonet/fsriver#generated-fields

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 avr. 2014 à 08:44, Prashant Agrawal prashant.agrawal@paladion.net a écrit :

Hi Rafał Kuć,
I tried doing the same but I didnt get the result as I want.
Just explaining the problem in details:

  1. I have a pdf file which has the text as "There is already a big market
    for mid-range 4G LTE market, being pushed by telecom operators and device
    manufacturers."

  2. I indexed this file in ES and when checked in ES the content present was
    in unicode like
    "PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiDQp4bWxuczpvPSJ1cm46c2NoZW1hHAtZXF1aXY9Q"

  3. So if I search for "LTE" it wont return any result because the content
    stored in ES is in unicode format.

So my question is, Is there anyway or any plugin to store the pdf content in
normal string format so that I can perform the search on top of that.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Can-we-perform-the-text-search-present-in-the-images-or-pdf-files-through-elasticsearch-tp4054367p4054541.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1398149086168-4054541.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/BA350CA1-91D3-4A80-9F7F-6A45DC742C66%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Nikhil Chandrakant Parab) #9

Hi
I install tesseract-alpha for windows.As per following
"To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika. Then add an image (png, jpg, ...) into your Fscrawler root directory. After the next index update, the text will be indexed and placed in "_source.content". "

I add images to fscrawler root directory. and run the fscrawler in cmd
It gives following output....

{
"_index": "photo",
"_type": "image",
"_id": "54d256ed121e93f6946f8e177634ff0",
"_score": 1,
"_source": {
"meta": {},
"file": {
"extension": "jpg",
"content_type": "image/jpeg",
"last_modified": "2016-12-20T10:37:22.125",
"indexing_date": "2017-04-15T15:34:16.65",
"filesize": 12952,
"filename": "bigdata.jpg",
"url": """file://C:\tmp\image\bigdata.jpg"""
},
"path": {
"encoded": "45f07b74406231761c074a1189bc9aa",
"root": "45f07b74406231761c074a1189bc9aa",
"virtual": "/",
"real": """C:\tmp\image\bigdata.jpg"""
}
}
}
]
}
}

It is not giving text in image.


(system) #10