Hello
I know the question and the scenario I'm going to make is general and the answer depends on the circumstances, but I would like to know what is the best approach and setting for such a scenario.
What we are going to do is:
Suppose we have a directory where files are added at a high rate in various formats such as txt, html, PDF, office files, audio and video files, image files, compressed files and etc.
To extract content from these files and also to index them, FSCrawler and Elasticsearch are used together.
But the problem is that the indexing rate is very very low and it takes a lot of time to index the files and be ready to search.
Before asking questions, some of the essential information is listed below
system specification:
OS: Centos 7
Memory: ~ 120GB
SSD: > 2TB
FSCrawler settings:
{
"name" : "job_name",
"fs" : {
"url" : "/home/dir/",
"update_rate" : "30s",
"excludes" : [ "~*" ],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : false,
"indexed_chars": "100%",
"pdf_ocr" : true,
"ocr" : {
"language" : "eng+fas"
}
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"bulk_size" : 100,
"flush_interval" : "5s"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8081,
"endpoint" : "fscrawler"
}
}
Elasticsearch:
number of clusters: 1
number of Nodes: 1
number of shards: 5
number of replicas: 1
number of indices: 1
Suppose the number of files we are on a millionth scale (3, 4, 5 or 6 million or more) and, on the other hand, consider the OCR on the files.
The questions are:
- How can I understand where is our bottleneck? FSCrawler or Elasticsearch?
- How can I know the content extraction rate at FSCrawler as well as the indexing rate in Elasticsearch?
- What are the best values for refresh_interval, flush_interval and update_rate for such a scenario?
- What are the important settings that we need to apply to improve performance? (For example, for the time and size of the Merge)
- Suppose we have several indexes instead of just one index, and for each one, run a FSCrawler program (with its own job) and by the includes and excludes settings, each job will be responsible for extracting the content from the specified file format (). Does this have any effect on performance?
for example:
Index 1: job_1 ---> for PDF, Office
Index 2: job_2 ---> for txt, source_code, json, xml
Index 3: job_3 ---> for audio, video, images
Index 4: job_4 ---> for other formats like compressed files, etc
...
In general, any solution and idea that can increase performance, increase the speed of extracting content from files and increase the speed of indexing, and can bring indexing rates closer to the rate of adding files to the directory, will make us happy and pleased.
Thank You ...