Enhance performance when using FSCrawler and Elasticsearch together

Hello
I know the question and the scenario I'm going to make is general and the answer depends on the circumstances, but I would like to know what is the best approach and setting for such a scenario.

What we are going to do is:
Suppose we have a directory where files are added at a high rate in various formats such as txt, html, PDF, office files, audio and video files, image files, compressed files and etc.

To extract content from these files and also to index them, FSCrawler and Elasticsearch are used together.

But the problem is that the indexing rate is very very low and it takes a lot of time to index the files and be ready to search.

Before asking questions, some of the essential information is listed below

system specification:

OS:        Centos 7
Memory:    ~ 120GB
SSD:       > 2TB 

FSCrawler settings:

{
  "name" : "job_name",
  "fs" : {
    "url" : "/home/dir/",
    "update_rate" : "30s",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "indexed_chars": "100%",
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng+fas"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8081,
    "endpoint" : "fscrawler"
  }
}

Elasticsearch:

number of clusters:  1
number of Nodes:     1
number of shards:    5
number of replicas:  1
number of indices:   1

Suppose the number of files we are on a millionth scale (3, 4, 5 or 6 million or more) and, on the other hand, consider the OCR on the files.

The questions are:

  • How can I understand where is our bottleneck? FSCrawler or Elasticsearch?
  • How can I know the content extraction rate at FSCrawler as well as the indexing rate in Elasticsearch?
  • What are the best values ​​for refresh_interval, flush_interval and update_rate for such a scenario?
  • What are the important settings that we need to apply to improve performance? (For example, for the time and size of the Merge)
  • Suppose we have several indexes instead of just one index, and for each one, run a FSCrawler program (with its own job) and by the includes and excludes settings, each job will be responsible for extracting the content from the specified file format (). Does this have any effect on performance?

for example:

Index 1: job_1 ---> for PDF, Office
Index 2: job_2 ---> for txt, source_code, json, xml
Index 3: job_3 ---> for audio, video, images
Index 4: job_4 ---> for other formats like compressed files, etc
...

In general, any solution and idea that can increase performance, increase the speed of extracting content from files and increase the speed of indexing, and can bring indexing rates closer to the rate of adding files to the directory, will make us happy and pleased. :wink:

Thank You ...

As the author of FSCrawler, I can tell you that the bottleneck is definitely FSCrawler and not elasticsearch.

Why this?

First it's single threaded. Then we use Tika behind the scene and when combined with OCR it can take several seconds before the content gets extracted.

You can try to start indeed multiple instances of FSCrawler watching different sub directories.
You can also use the FSCrawler REST interface to avoid FSCrawler having to scan your drive, and "just" upload your documents to FSCrawler.
Which will give you an idea of the process time.

You can start multiple FSCrawler in parallel on different machines or on different ports and manage that on your side.

I have some plans for the future for some optimizations but nothing concrete yet.

I'll be happy to hear more if you are able to find a good way to deploy it.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.