Problem when using Elasticsearch and Tesseract-OCR

Hi everybody!
I have a directory containing text data at /opt/data/. When a new file is uploaded to that directory. Is there a way to automatically use tesseract-ocr to convert to another language, then use elasticsearch to automatically index to search for content in it?
I'm using Ubuntu Server 16.04, Elasticsearch version 7.7.1and Tesseract-OCR 3.04.01.
Look forward to your help.

You can use FSCrawler. There's a tutorial to help you getting started.

Thanks for the suggestion.
When I use fscrawler, I index on elasticsearch. The files are all stored in tmp/es directory. But every time there is a new file in the tmp/es directory, I see it is not automatically updated, but I have to run the "bin/fscrawler job-name --restart" command again. I find this really inconvenient. Is there any way to run fscrawler forever?
Best regards!

It's probably because you're moving a file to dir instead of copying it. So it has an old creation date and is not picked up by FSCrawler.

You can activate the debug mode to see why it's ignored.

Otherwise on Linux, you can also touch the file.

1 Like

I have applied it to my project. I have the directory structure as shown and install the _setting.yaml file as shown below. And when I ran the command "bin / fscrawler job-name --restart", I saw that fscrawler started indexing but it happened very slowly and there were some files missing. I have tried to run the above command again and wait for a long time but still do not see the change in the number of files in elasticsearch.
Look forward to the help. Thank you!

Could you share the output of the following command:

GET /indexname/_search

How many files are you expecting?
Run FSCrawler with --debug option and share the full logs.

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

If some outputs are too big, please share them on gist.github.com and link them here.

I realize that some xlsx files are not indexed. I see only one of all indexed xlsx files.
GET /indexname/_search

And file debug fscrawler: https://gist.github.com/hanguyenuet96/b7413659993d434ca6e869a4ebbcaa17

Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

There are 31 documents indexed here. How many did you add to the folder? What are the file which are not indexed? What's their names?

There are a total of 36 documents, and has added 31 files to elasticsearch. The names of the files not indexed are:

  1. 20_8_73734_TestCase.xls

  2. _380_filename (2).xls

  3. 1_44548_73744_lichcoquan.xls

     {
       "took" : 2,
       "timed_out" : false,
       "_shards" : {
         "total" : 1,
         "successful" : 1,
         "skipped" : 0,
         "failed" : 0
       },
       "hits" : {
         "total" : {
           "value" : 31,
           "relation" : "eq"
         },
         "max_score" : 1.0,
         "hits" : [
           {
             "_index" : "elasticsearch",
             "_type" : "_doc",
             "_id" : "407e16516d98ac317e46a9212a86b2",
             "_score" : 1.0,
             "_source" : {
               "content" : """
              }
           }
          ]
     }
    

    }

I did not see 20_8_73734_TestCase.xls in the logs. What is its full path?

Thank you for helping me.
The full path is: /opt/lampp/htdocs/selab/Contents/OfficialDispatch/2020/07/15/20_8_73734_TestCase.xls

In the logs, it sounds like the 07 dir is not available when the crawler ran.

I searched for OfficialDispatch/2020/07 in the logs and it is not there. But OfficialDispatch/2020/05 is in logs.

Could you ls -l /opt/lampp/htdocs/selab/Contents/OfficialDispatch/2020/?

Also share the fscrawler job settings please.

1 Like

I have checked and missing files, this is one of my shortcomings. Thank you for your help.
By the way, may I ask, is there a way to run FScrawler forever?

On windows, you can follow this maybe ? https://fscrawler.readthedocs.io/en/latest/installation.html#running-as-a-service-on-windows

On Linux, i guess you need to create a service.

1 Like

Yeahh. Thank you so much.
Best regards!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.