Hello,
Can we index images with extension types like .jpeg,.img,jpg in elasticsearch?
If yes, then can anyone explain how i can achieve it?
Thanks & Regards,
Priyanka Yerunkar.
Hello,
Can we index images with extension types like .jpeg,.img,jpg in elasticsearch?
If yes, then can anyone explain how i can achieve it?
Thanks & Regards,
Priyanka Yerunkar.
Hey,
can you explain your use-case? You could convert those either into base64 or use SMILE as a protocol to index in binary format, however I would try not to do this, as this increases the size of your index, and a smaller index will always result in faster searches, less memory required etc.
--Alex
Hello @spinscale,
Thanks for reply!!!
so we have a file system where user uploads various types of files from frontend application and we indexed that.
like wise here in ES can we index images so that if any user searches for that image, he or she will get that image.
Regards,
Priyanka
unless you write some logic to make an image searchable, there currently is no such out of the box solution. You could try to use the ingest attachment processor, but that one will merely try to extract metadata. If you need to search for contents of an image, you need to do the preprocessing.
Hello @spinscale,
Thanks for reply!!.
Yeah we have to write some logic to make an image searchable.
I have one question again:
Currently to index attachment data i am using fscrawler so that various kinds of attachment extensions will get index. i have tried .pdf,.pptx,.txt,.doc,.docx, so when i am indexing any file having suggested extensions, file is getting indexed means we can read content of that attachment.
so is there any way to do for images? you have suggested prepossessing for the same. could you please more elaborate on that.
Regards,
Priyanka
If you activate the OCR option of FSCrawler text from images should be extracted.
Hello @dadoonet,
I tried using following code:
ocr: language: "eng" enabled: true pdf_strategy: "ocr_and_text" follow_symlinks: false output_type: "hocr" path: "C:\Users\ITS-BETA\AppData\Local\Tesseract-OCR\tesseract.exe"
but still i am not able to index content of images. could you please suggest what is wrong in above lines? what other parameters i need to add extra?
Regards,
Priyanka
Could you run with the debug option and share the output?
I think it's not finding Tesseract.
Also make sure the indentation is correct. The sample you pasted is not.
Could you share also a file you'd like to index?
Hello @dadoonet,
How to set debug option?
I have used fscrawler att --debug while runing job through CMD. but it is asking me to create a job again which i have already created.
Regards,
Priyanka
What are you running exactly? Did you look at the documentation?
Hello @dadoonet,
I am running fscrawler job to index images. Yes i have gone through the documentation.
as suggested i have downloaded Tesseract also.
Regards,
Priyanka
I meant: "what exact command are you running to launch fscrawler with the debug option?"
And can you tell what is the output of the command?
Hello @dadoonet,
Please find below:
E:\ES\fscrawler-es7-2.7-20190625.065648-37\fscrawler-es7-2.7-SNAPSHOT\bin>fscrawler att --debug
08:58:06,261 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
08:58:06,293 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [att]...
08:58:06,683 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client
version 7
08:58:07,277 INFO [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client fo
r version 7.x connected to a node running version 7.2.0
08:58:07,324 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
08:58:07,324 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
08:58:07,324 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to an elasticsearch [7.2.0] node.
08:58:07,324 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [att]
08:58:07,480 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [att]
08:58:07,496 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [att_folder]
08:58:07,496 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [att_folder]
08:58:07,512 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [att] for [C:\tmp\Test] every [1m]
08:58:07,512 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [att] for [C:\tmp\Test] every [1m]
08:58:07,512 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [att] is now running. Run #1...
08:58:07,512 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\tmp\Test] content
08:58:07,512 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C
:\tmp\Test
08:58:07,527 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found
08:58:07,527 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes
t, C:\tmp\Test\0000_6954754_01.jpg) = /0000_6954754_01.jpg
08:58:07,527 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
/0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]]
08:58:07,527 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, excludes = [[*/~*]]
08:58:07,527 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, includes = [null]
08:58:07,527 DEBUG [f.p.e.c.f.FsParserAbstract] [/0000_6954754_01.jpg] can be indexed: [true]
08:58:07,527 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /0000_6954754_01.jpg
08:58:07,527 DEBUG [f.p.e.c.f.FsParserAbstract] - not modified: creation dat
e 2019-08-20T08:56:37.015500 , file date 2017-05-25T13:55:06.660, last scan date
2019-08-20T08:56:40.569
08:58:07,527 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C:
\tmp\Test]...
Regards,
Priyanka
It's not, right?
Could you use the --restart
option as well?
So launch:
fscrawler att --debug --restart
And share the logs.
Please format your code, logs or configuration files using </>
icon as explained in this guide and not the citation button. It will make your post more readable.
I updated your post.
Hello @dadoonet,
Thanks for reply!!
I have changed some parameters in .settings.yaml file, after that it started running without asking me to create a new job again.
Please find below logs for launch of "fscrawler att --debug --restart":
E:\ES\fscrawler-es7-2.7-20190625.065648-37\fscrawler-es7-2.7-SNAPSHOT\bin>fscrawler att --debug --restart 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] alread y exists 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] alread y exists 09:20:46,995 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists 09:20:46,995 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [ att]... 09:20:46,995 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [att]... 09:20:47,323 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client version 7 :20:47,918 INFO [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client fo r version 7.x connected to a node running version 7.2.0 09:20:47,950 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler 09:20:47,950 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. I t will run unless you stop it with CTRL+C. 09:20:47,950 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to an elasticsearch [7.2.0] node. 09:20:47,950 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [att] 09:20:48,106 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [att] 09:20:48,137 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [att_folde r] 09:20:48,153 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [att_folder] 09:20:48,153 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [att] for [C:\tmp\Test] every [1m] 09:20:48,153 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [att] for [C:\tmp\Test] every [1m] 09:20:48,153 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [att] is now r unning. Run #1... 09:20:48,168 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes t, C:\tmp\Test) = / 09:20:48,168 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing att_folder/3db93f181133 40589d6d165775ec8a24?pipeline=null 09:20:48,168 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\tmp\Test] content 09:20:48,168 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C :\tmp\Test 09:20:48,168 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found 09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes t, C:\tmp\Test\0000_6954754_01.jpg) = /0000_6954754_01.jpg 09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [ /0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]] 09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg] , excludes = [[*/~*]] 09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg] , includes = [null] 09:20:48,184 DEBUG [f.p.e.c.f.FsParserAbstract] [/0000_6954754_01.jpg] can be in dexed: [true] 09:20:48,184 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /0000_6954754_01.jpg 09:20:48,184 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [C:\tmp\Te st],[0000_6954754_01.jpg] 09:20:48,184 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes t, C:\tmp\Test\0000_6954754_01.jpg) = /0000_6954754_01.jpg 09:20:48,215 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated. 09:20:48,231 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR. 09:20:48,512 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. 09:20:48,809 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to con figure Tesseract in case we have specific settings. 09:20:48,809 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [C:\Users\IT S-BETA-ENDECA\AppData\Local\Tesseract-OCR\tesseract.exe]. 09:20:48,809 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng]. 09:20:48,809 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Output Type set to [txt] . 09:20:48,950 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing att/d659b7f72abeb26397c 659773471bdf8?pipeline=null 09:20:48,965 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C: \tmp\Test]... 09:20:49,028 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\Tes t, C:\tmp\Test\0000_6954754_01.jpg) = /0000_6954754_01.jpg 09:20:49,028 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [ /0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]] 09:20:49,028 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg] , excludes = [[*/~*]] 09:20:49,028 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg] , includes = [null] 09:20:49,028 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [C:\tmp\Test]... 09:20:49,028 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 1m
i have used </> to update logs.
Regards,
Priyanka
Don't use citation icon for code/logs please.
In logs you have:
But Tesseract is not installed so we won't run OCR.
I think that the problem is that you are using \
in OCR PATH instead of /
or you should use \\
.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.