Problem when using Elasticsearch and Tesseract-OCR

hanguyen.uet · June 26, 2020, 3:10am

Hi everybody!
I have a directory containing text data at /opt/data/. When a new file is uploaded to that directory. Is there a way to automatically use tesseract-ocr to convert to another language, then use elasticsearch to automatically index to search for content in it?
I'm using Ubuntu Server 16.04, Elasticsearch version 7.7.1and Tesseract-OCR 3.04.01.
Look forward to your help.

dadoonet · June 26, 2020, 5:57am

You can use FSCrawler. There's a tutorial to help you getting started.

hanguyen.uet · July 14, 2020, 8:09am

Thanks for the suggestion.
When I use fscrawler, I index on elasticsearch. The files are all stored in tmp/es directory. But every time there is a new file in the tmp/es directory, I see it is not automatically updated, but I have to run the "bin/fscrawler job-name --restart" command again. I find this really inconvenient. Is there any way to run fscrawler forever?
Best regards!

dadoonet · July 14, 2020, 5:59pm

It's probably because you're moving a file to dir instead of copying it. So it has an old creation date and is not picked up by FSCrawler.

You can activate the debug mode to see why it's ignored.

Otherwise on Linux, you can also touch the file.

hanguyen.uet · July 15, 2020, 3:42am

I have applied it to my project. I have the directory structure as shown and install the _setting.yaml file as shown below. And when I ran the command "bin / fscrawler job-name --restart", I saw that fscrawler started indexing but it happened very slowly and there were some files missing. I have tried to run the above command again and wait for a long time but still do not see the change in the number of files in elasticsearch.
Look forward to the help. Thank you!

dadoonet · July 15, 2020, 6:45am

Could you share the output of the following command:

GET /indexname/_search

How many files are you expecting?
Run FSCrawler with --debug option and share the full logs.

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

If some outputs are too big, please share them on gist.github.com and link them here.

hanguyen.uet · July 15, 2020, 8:31am

I realize that some xlsx files are not indexed. I see only one of all indexed xlsx files.
GET /indexname/_search

And file debug fscrawler: https://gist.github.com/hanguyenuet96/b7413659993d434ca6e869a4ebbcaa17

dadoonet · July 15, 2020, 8:47am

Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

There are 31 documents indexed here. How many did you add to the folder? What are the file which are not indexed? What's their names?

hanguyen.uet · July 15, 2020, 9:34am

There are a total of 36 documents, and has added 31 files to elasticsearch. The names of the files not indexed are:

20_8_73734_TestCase.xls
_380_filename (2).xls

1_44548_73744_lichcoquan.xls

 {
   "took" : 2,
   "timed_out" : false,
   "_shards" : {
     "total" : 1,
     "successful" : 1,
     "skipped" : 0,
     "failed" : 0
   },
   "hits" : {
     "total" : {
       "value" : 31,
       "relation" : "eq"
     },
     "max_score" : 1.0,
     "hits" : [
       {
         "_index" : "elasticsearch",
         "_type" : "_doc",
         "_id" : "407e16516d98ac317e46a9212a86b2",
         "_score" : 1.0,
         "_source" : {
           "content" : """
          }
       }
      ]
 }

}

dadoonet · July 15, 2020, 2:18pm

I did not see 20_8_73734_TestCase.xls in the logs. What is its full path?

hanguyen.uet · July 16, 2020, 4:20am

Thank you for helping me.
The full path is: /opt/lampp/htdocs/selab/Contents/OfficialDispatch/2020/07/15/20_8_73734_TestCase.xls

dadoonet · July 16, 2020, 7:53am

In the logs, it sounds like the 07 dir is not available when the crawler ran.

I searched for OfficialDispatch/2020/07 in the logs and it is not there. But OfficialDispatch/2020/05 is in logs.

Could you ls -l /opt/lampp/htdocs/selab/Contents/OfficialDispatch/2020/?

Also share the fscrawler job settings please.

hanguyen.uet · July 21, 2020, 10:56am

I have checked and missing files, this is one of my shortcomings. Thank you for your help.
By the way, may I ask, is there a way to run FScrawler forever?

dadoonet · July 21, 2020, 1:48pm

On windows, you can follow this maybe ? https://fscrawler.readthedocs.io/en/latest/installation.html#running-as-a-service-on-windows

On Linux, i guess you need to create a service.

hanguyen.uet · July 22, 2020, 3:48am

Yeahh. Thank you so much.
Best regards!

system · August 19, 2020, 3:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic search and fscrawler Elasticsearch	5	367	December 12, 2018
Not able to index content of images Elasticsearch	7	836	October 14, 2019
Speed of elastic search Elasticsearch	16	475	April 1, 2023
Fscrawler does not index to ES with https Elasticsearch	4	1034	October 27, 2020
Fscrawler image file text extraction Elasticsearch	7	747	August 22, 2021

Problem when using Elasticsearch and Tesseract-OCR

Related topics