We are considering the question: how FSCrawler manages failure?
What happened if FSCrawler failed to upload a certain file because of a failure on the ElasticSearch side? Does FSCrawler implements an automatic failure handling mechanism?
Consider the scenario, FSCrawler fails to upload a file because Elasticsearch is Down.
Can we have FSCrawler to retry uploading the file? can we configure number_retries?
Can we have FSCrawler to log the following:
a. File on which failure happened
b. Exception information
What is the structure (i.e. fields) of the log file "documents.log"? I looked at that file but i didn't find reference to the "file name" on which the failure happened.
But when the document is sent to elasticsearch, the path value is "unknown" within the bulk request. I think that could be a great thing to add. Would you like to open a feature request?
If you activate the debug mode for documents, you could see what is the id of the document with the full path and then when you have an error, you could try to find the same id in the debug part...
2021-07-01 22:11:53,627 [ERROR] [375040c5d4baa5408ae296233dc6e79c][null] ElasticsearchException[Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];]]
A per my understanding of your explanation. It logs mainly 3 things:
375040c5d4baa5408ae296233dc6e79c <-- Id null: --> path to be added Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];] --> the exception
I am supposing that the Id field is what FSCrawler assigns for the elastic search document. true?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.