Can we index images with extension types like .jpeg,.img,jpg in elasticsearch?

Hello @dadoonet,

I have tried with
path: "C:\\Users\\ITS-BETA\\AppData\\Local\\Tesseract-OCR\\tesseract.exe"

Still i am getting same output.
And Tesseract i have installed it already.

Regards,
Priyanka

If FSCrawler is running from C: could you try:

path: "/Users/ITS-BETA/AppData/Local/Tesseract-OCR/tesseract.exe" 

If it still doesn't work, could you try to add the Tesseract dir to your PATH ?

If still, could you run with --trace instead of --debug?

Hello @dadoonet,

I have tried with path: "/Users/ITS-BETA/AppData/Local/Tesseract-OCR/tesseract.exe", when i am trying to run fscrawler job, it is asking me to create new job again even if that is present already.

and one more thing what is difference between ofcr and txt options given for output_type parameter?

I have also tried to run trace command, Please find out below lines:

06:18:19,338 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health
on index [img]
06:18:19,354 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"clus
ter_name":"elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":
1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocati
ng_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_s
hards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_wai
ting_in_queue_millis":0,"active_shards_percent_as_number":54.166666666666664}
06:18:19,354 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [img_folde
r]
06:18:19,354 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] index settings: [{
  "settings": {
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "properties" : {
      "real" : {
        "type" : "keyword",
        "store" : true
      },
      "root" : {
        "type" : "keyword",
        "store" : true
      },
      "virtual" : {
        "type" : "keyword",
        "store" : true
      }
    }
  }
}
]
06:18:19,370 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health
on index [img_folder]
06:18:19,370 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"clus
ter_name":"elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":
1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocati
ng_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_s
hards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_wai
ting_in_queue_millis":0,"active_shards_percent_as_number":54.166666666666664}
06:18:19,370 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [img]
 for [C:\tmp\images] every [15m]
06:18:19,370 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [img] for
 [C:\tmp\images] every [15m]
06:18:19,370 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [img] is now r
unning. Run #1...
06:18:19,385 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\tmp\images] content
06:18:19,385 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C
:\tmp\images
06:18:19,385 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped fo
r file [C:\tmp\images\0000_6954754_01.jpg] on [windows server 2012 r2]
06:18:19,385 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped fo
r file [C:\tmp\images\0000_6954754_01.jpg] on [windows server 2012 r2]
06:18:19,385 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found
06:18:19,385 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstract
Model{name='0000_6954754_01.jpg', file=true, directory=false, lastModifiedDate=2
017-05-25T13:55:06.660, creationDate=2019-08-20T08:56:37.015500, accessDate=2019
-08-20T11:48:28.843397, path='C:\tmp\images', owner='BUILTIN\Administrators', gr
oup='null', permissions=-1, extension='jpg', fullpath='C:\tmp\images\0000_695475
4_01.jpg', size=151201}
06:18:19,385 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\ima
ges, C:\tmp\images\0000_6954754_01.jpg) = /0000_6954754_01.jpg
06:18:19,385 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
/0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]]
06:18:19,385 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, excludes = [[*/~*]]
06:18:19,385 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
06:18:19,385 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude patter
n
06:18:19,385 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, includes = [null]
06:18:19,385 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
06:18:19,385 DEBUG [f.p.e.c.f.FsParserAbstract] [/0000_6954754_01.jpg] can be in
dexed: [true]
06:18:19,385 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /0000_6954754_01.jpg
06:18:19,385 DEBUG [f.p.e.c.f.FsParserAbstract]     - not modified: creation dat
e 2019-08-20T08:56:37.015500 , file date 2017-05-25T13:55:06.660, last scan date
 2019-08-21T06:06:02.847
06:18:19,385 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C:
\tmp\images]...
06:18:19,401 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files
 in dir [path.root:d87deebc67187943f7c35ecc9869f9]
06:18:19,432 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearc
h.crawler.fs.client.ESSearchResponse@1ffd1bda]
06:18:19,432 TRACE [f.p.e.c.f.FsParserAbstract] We found: [0000_6954754_01.jpg]
06:18:19,432 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [0000_6954754_01.j
pg]
06:18:19,432 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\ima
ges, C:\tmp\images\0000_6954754_01.jpg) = /0000_6954754_01.jpg
06:18:19,432 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
/0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]]
06:18:19,432 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, excludes = [[*/~*]]
06:18:19,432 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
06:18:19,432 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude patter
n
06:18:19,432 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, includes = [null]
06:18:19,432 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
06:18:19,432 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories
in [C:\tmp\images]...
06:18:19,448 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for
 15m        

Regards,
Priyanka

You did not run it with --restart right?

Hello,

Earlier in the update, you have mentioned to use --restart. After that No, not used.

Regards,
Priyanka

If you don't use that option it won't pick old files.

Hello @dadoonet,

Do you want me to run --restart option with debug?

Regards,
Priyanka

Restart and trace options.

Hello @dadoonet,

I have tried with restart and trace option. It is giving me following logs.

04:36:27,502 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstract
Model{name='8_2_Edit Content_6159418_01.jpg', file=true, directory=false, lastMo
difiedDate=2014-01-02T20:46:04.561, creationDate=2019-08-22T04:34:30.810227, acc
essDate=2019-08-22T04:34:30.810227, path='C:\tmp\images', owner='BUILTIN\Adminis
trators', group='null', permissions=-1, extension='jpg', fullpath='C:\tmp\images
\8_2_Edit Content_6159418_01.jpg', size=16635}
04:36:27,502 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\ima
ges, C:\tmp\images\8_2_Edit Content_6159418_01.jpg) = /8_2_Edit Content_6159418_
01.jpg
04:36:27,502 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
/8_2_Edit Content_6159418_01.jpg], includes = [null], excludes = [[*/~*]]
04:36:27,502 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/8_2_Edit Content_615
9418_01.jpg], excludes = [[*/~*]]
04:36:27,502 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
04:36:27,502 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude patter
n
04:36:27,502 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/8_2_Edit Content_615
9418_01.jpg], includes = [null]
04:36:27,502 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
04:36:27,502 DEBUG [f.p.e.c.f.FsParserAbstract] [/8_2_Edit Content_6159418_01.jp
g] can be indexed: [true]
04:36:27,518 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /8_2_Edit Content_6159
418_01.jpg
04:36:27,518 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [C:\tmp\im
ages],[8_2_Edit Content_6159418_01.jpg]
04:36:27,518 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\ima
ges, C:\tmp\images\8_2_Edit Content_6159418_01.jpg) = /8_2_Edit Content_6159418_
01.jpg
04:36:27,518 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [C:\tmp\image
s\8_2_Edit Content_6159418_01.jpg]
04:36:27,518 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
04:36:27,534 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
04:36:27,534 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
04:36:27,534 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matc
hes.
04:36:27,534 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing img/ee91153ce51b141bbd0
2098585f8fbc?pipeline=null
04:36:27,534 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "meta" : { },
  "file" : {
    "extension" : "jpg",
    "content_type" : "image/jpeg",
    "created" : "2019-08-22T04:34:30.810+0000",
    "last_modified" : "2014-01-02T20:46:04.561+0000",
    "last_accessed" : "2019-08-22T04:34:30.810+0000",
    "indexing_date" : "2019-08-22T04:36:27.518+0000",
    "filesize" : 16635,
    "filename" : "8_2_Edit Content_6159418_01.jpg",
    "url" : "file://C:\\tmp\\images\\8_2_Edit Content_6159418_01.jpg"
  },
  "path" : {
    "root" : "d87deebc67187943f7c35ecc9869f9",
    "virtual" : "/8_2_Edit Content_6159418_01.jpg",
    "real" : "C:\\tmp\\images\\8_2_Edit Content_6159418_01.jpg"
  }
}
04:36:27,534 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C:
\tmp\images]...
04:36:27,534 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files
 in dir [path.root:d87deebc67187943f7c35ecc9869f9]
04:36:28,065 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearc
h.crawler.fs.client.ESSearchResponse@15ec5d6a]
04:36:28,081 TRACE [f.p.e.c.f.FsParserAbstract] We found: [0000_6954754_01.jpg,
8_2_Edit Content_6159418_01.jpg]
04:36:28,081 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [0000_6954754_01.j
pg]
04:36:28,081 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\ima
ges, C:\tmp\images\0000_6954754_01.jpg) = /0000_6954754_01.jpg
04:36:28,081 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
/0000_6954754_01.jpg], includes = [null], excludes = [[*/~*]]
04:36:28,081 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, excludes = [[*/~*]]
04:36:28,081 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
04:36:28,081 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude patter
n
04:36:28,081 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/0000_6954754_01.jpg]
, includes = [null]
04:36:28,081 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
04:36:28,081 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [8_2_Edit Content_
6159418_01.jpg]
04:36:28,081 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\tmp\ima
ges, C:\tmp\images\8_2_Edit Content_6159418_01.jpg) = /8_2_Edit Content_6159418_
01.jpg
04:36:28,081 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
/8_2_Edit Content_6159418_01.jpg], includes = [null], excludes = [[*/~*]]
04:36:28,081 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/8_2_Edit Content_615
9418_01.jpg], excludes = [[*/~*]]
04:36:28,081 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
04:36:28,081 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude patter
n
04:36:28,081 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/8_2_Edit Content_615
9418_01.jpg], includes = [null]
04:36:28,081 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
04:36:28,081 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories
in [C:\tmp\images]...
04:36:28,096 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for
 15m
04:36:31,568 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request
of [3] requests
04:36:31,927 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request w
ith [3] requests

Regards,
Priyanka

This is not the full logs I think. I can't see the OCR messages here.

Hello @dadoonet,

Full log is very huge. It is not allowing me to update more than 7000 letters.
Is there any other option to share logs?

Regards,
Priyanka

gist.github.com for example.

Hello @dadoonet,

I have reinstalled Tesseract-OCR and mentioned path as "C:/Program Files/Tesseract-OCR/tesseract.exe".
I can able to run fscrawler job for images in the format .jpg, jpeg.
but i dont see any content is getting indexed when i checked on kibana.

Could you please guide?
My configuration file:

ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
path: "C:/Program Files/Tesseract-OCR/tesseract.exe"

Regards,
Priyanka