Can we index images with extension types like .jpeg,.img,jpg in elasticsearch?

pyerunka · August 19, 2019, 6:17am

Hello,

Can we index images with extension types like .jpeg,.img,jpg in elasticsearch?
If yes, then can anyone explain how i can achieve it?

Thanks & Regards,
Priyanka Yerunkar.

spinscale · August 19, 2019, 7:05am

Hey,

can you explain your use-case? You could convert those either into base64 or use SMILE as a protocol to index in binary format, however I would try not to do this, as this increases the size of your index, and a smaller index will always result in faster searches, less memory required etc.

--Alex

pyerunka · August 19, 2019, 7:23am

Hello @spinscale,

Thanks for reply!!!
so we have a file system where user uploads various types of files from frontend application and we indexed that.
like wise here in ES can we index images so that if any user searches for that image, he or she will get that image.

Regards,
Priyanka

spinscale · August 19, 2019, 7:37am

unless you write some logic to make an image searchable, there currently is no such out of the box solution. You could try to use the ingest attachment processor, but that one will merely try to extract metadata. If you need to search for contents of an image, you need to do the preprocessing.

pyerunka · August 19, 2019, 10:50am

Hello @spinscale,

Thanks for reply!!.
Yeah we have to write some logic to make an image searchable.
I have one question again:
Currently to index attachment data i am using fscrawler so that various kinds of attachment extensions will get index. i have tried .pdf,.pptx,.txt,.doc,.docx, so when i am indexing any file having suggested extensions, file is getting indexed means we can read content of that attachment.
so is there any way to do for images? you have suggested prepossessing for the same. could you please more elaborate on that.

Regards,
Priyanka

dadoonet · August 19, 2019, 11:03am

If you activate the OCR option of FSCrawler text from images should be extracted.

pyerunka · August 19, 2019, 11:10am

Hello @dadoonet,

Thanks for reply!!
How to activate OCR?

Thanks & Regards,
Priyanka Yerunkar.

dadoonet · August 19, 2019, 11:41am

Read https://fscrawler.readthedocs.io/en/latest/user/ocr.html

pyerunka · August 20, 2019, 6:14am

Hello @dadoonet,

I tried using following code:

ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
  output_type: "hocr"
  path: "C:\Users\ITS-BETA\AppData\Local\Tesseract-OCR\tesseract.exe"

but still i am not able to index content of images. could you please suggest what is wrong in above lines? what other parameters i need to add extra?

Regards,
Priyanka

dadoonet · August 20, 2019, 6:32am

Could you run with the debug option and share the output?
I think it's not finding Tesseract.

Also make sure the indentation is correct. The sample you pasted is not.

Could you share also a file you'd like to index?

pyerunka · August 20, 2019, 6:44am

Hello @dadoonet,

How to set debug option?
I have used fscrawler att --debug while runing job through CMD. but it is asking me to create a job again which i have already created.

Regards,
Priyanka

dadoonet · August 20, 2019, 6:56am

What are you running exactly? Did you look at the documentation?

pyerunka · August 20, 2019, 6:57am

Hello @dadoonet,

I am running fscrawler job to index images. Yes i have gone through the documentation.
as suggested i have downloaded Tesseract also.

Regards,
Priyanka

dadoonet · August 20, 2019, 8:30am

I meant: "what exact command are you running to launch fscrawler with the debug option?"

pyerunka · August 20, 2019, 8:42am

Hello @dadoonet,

I am running "fscrawler att --debug".

Regards,
Priyanka

dadoonet · August 20, 2019, 8:54am

And can you tell what is the output of the command?

pyerunka · August 20, 2019, 9:03am

Hello @dadoonet,

Please find below:

E:\ES\fscrawler-es7-2.7-20190625.065648-37\fscrawler-es7-2.7-SNAPSHOT\bin>fscrawler att --debug
08:58:06,261 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [att]...
08:58:06,683 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client
 version 7
08:58:07,277 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client fo
r version 7.x connected to a node running version 7.2.0
08:58:07,324 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
08:58:07,324 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
08:58:07,324 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to an elasticsearch [7.2.0] node.
08:58:07,324 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [att]
08:58:07,480 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [att]
08:58:07,496 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [att_folder]
08:58:07,496 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [att_folder]
08:58:07,512 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [att] for [C:\tmp\Test] every [1m]
08:58:07,512 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [att] for [C:\tmp\Test] every [1m]
08:58:07,512 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [att] is now running. Run #1...
08:58:07,512 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\tmp\Test] content
08:58:07,512 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C
:\tmp\Test
08:58:07,527 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found
08:58:07,527 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes
t, C:\tmp\Test\0000_6954754_01.jpg) = /0000_6954754_01.jpg
08:58:07,527 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
/0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]]
08:58:07,527 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, excludes = [[*/~*]]
08:58:07,527 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, includes = [null]
08:58:07,527 DEBUG [f.p.e.c.f.FsParserAbstract] [/0000_6954754_01.jpg] can be indexed: [true]
08:58:07,527 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /0000_6954754_01.jpg
08:58:07,527 DEBUG [f.p.e.c.f.FsParserAbstract]     - not modified: creation dat
e 2019-08-20T08:56:37.015500 , file date 2017-05-25T13:55:06.660, last scan date
 2019-08-20T08:56:40.569
08:58:07,527 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C:
\tmp\Test]...

Regards,
Priyanka

dadoonet · August 20, 2019, 9:17am

It's not, right?

Could you use the --restart option as well?
So launch:

fscrawler att --debug --restart

And share the logs.

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.
I updated your post.

pyerunka · August 20, 2019, 9:27am

Hello @dadoonet,

Thanks for reply!!
I have changed some parameters in .settings.yaml file, after that it started running without asking me to create a new job again.

Please find below logs for launch of "fscrawler att --debug --restart":

 E:\ES\fscrawler-es7-2.7-20190625.065648-37\fscrawler-es7-2.7-SNAPSHOT\bin>fscrawler att --debug --restart
 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json]
 already exists
09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] alread
 y exists
09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json]
already exists
09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] alread
y exists
09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json]
already exists
09:20:46,995 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [
att]...
09:20:46,995 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [att]...
09:20:47,323 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client
version 7
:20:47,918 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client fo
r version 7.x connected to a node running version 7.2.0
09:20:47,950 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
09:20:47,950 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. I
t will run unless you stop it with CTRL+C.
09:20:47,950 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to
an elasticsearch [7.2.0] node.
09:20:47,950 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [att]
09:20:48,106 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health
 on index [att]
09:20:48,137 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [att_folde
 r]
09:20:48,153 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health
on index [att_folder]
 09:20:48,153 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [att]
  for [C:\tmp\Test] every [1m]
09:20:48,153 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [att] for
[C:\tmp\Test] every [1m]
09:20:48,153 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [att] is now r
unning. Run #1...
09:20:48,168 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes
t, C:\tmp\Test) = /
09:20:48,168 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing att_folder/3db93f181133
40589d6d165775ec8a24?pipeline=null
09:20:48,168 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\tmp\Test] content
09:20:48,168 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C
:\tmp\Test
 09:20:48,168 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found
09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes
t, C:\tmp\Test\0000_6954754_01.jpg) = /0000_6954754_01.jpg
09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
/0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]]
09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, excludes = [[*/~*]]
09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
 , includes = [null]
09:20:48,184 DEBUG [f.p.e.c.f.FsParserAbstract] [/0000_6954754_01.jpg] can be in
 dexed: [true]
 09:20:48,184 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /0000_6954754_01.jpg
 09:20:48,184 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [C:\tmp\Te
st],[0000_6954754_01.jpg]
 09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes
t, C:\tmp\Test\0000_6954754_01.jpg) = /0000_6954754_01.jpg
 09:20:48,215 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
 09:20:48,231 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so
we won't run OCR.
09:20:48,512 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files
 will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
 for optional dependencies.

09:20:48,809 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to con
figure Tesseract in case we have specific settings.
09:20:48,809 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [C:\Users\IT
 S-BETA-ENDECA\AppData\Local\Tesseract-OCR\tesseract.exe].
09:20:48,809 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
09:20:48,809 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Output Type set to [txt]
.
09:20:48,950 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing att/d659b7f72abeb26397c
 659773471bdf8?pipeline=null
09:20:48,965 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C:
\tmp\Test]...
 09:20:49,028 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes
 t, C:\tmp\Test\0000_6954754_01.jpg) = /0000_6954754_01.jpg
09:20:49,028 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
 /0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]]
 09:20:49,028 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, excludes = [[*/~*]]
 09:20:49,028 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, includes = [null]
 09:20:49,028 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories
in [C:\tmp\Test]...
 09:20:49,028 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for
1m

i have used </> to update logs.

Regards,
Priyanka

dadoonet · August 20, 2019, 9:46am

Don't use citation icon for code/logs please.

In logs you have:

But Tesseract is not installed so we won't run OCR.

I think that the problem is that you are using \ in OCR PATH instead of / or you should use \\.

Topic		Replies	Views
Elasticsearch for Image Extraction Elasticsearch	4	148	April 4, 2024
Can We Index .JPG format photo In ES? Elasticsearch	4	514	July 6, 2017
What data type can be indexed in elasticsearch? Elasticsearch	3	462	January 10, 2017
Indexing binary Elasticsearch	7	451	July 6, 2017
OCR integration Elasticsearch	2	227	August 8, 2022

Can we index images with extension types like .jpeg,.img,jpg in elasticsearch?

Related topics