FSCrawler failure handling

Francois_Saab · September 20, 2021, 11:13am

Hello,

We are considering the question: how FSCrawler manages failure?

What happened if FSCrawler failed to upload a certain file because of a failure on the ElasticSearch side? Does FSCrawler implements an automatic failure handling mechanism?

Consider the scenario, FSCrawler fails to upload a file because Elasticsearch is Down.

Can we have FSCrawler to retry uploading the file? can we configure number_retries?
Can we have FSCrawler to log the following:
a. File on which failure happened
b. Exception information

What is the structure (i.e. fields) of the log file "documents.log"? I looked at that file but i didn't find reference to the "file name" on which the failure happened.

Thank you

dadoonet · September 20, 2021, 11:54am

Hello François

It will just log it to the logs and/or to the console. I'm not sure how fs.continue-on-error setting will play a role here. (https://fscrawler.readthedocs.io/en/fscrawler-2.7/admin/fs/local-fs.html#continue-on-error). I did not look at the code but I assume that this setting only apply to the "input" part and not the "output" part.

Can we have FSCrawler to retry uploading the file? can we configure number_retries?

No and no. That's something that should be implemented. A Dead Letter Queue (DLQ) mechanism would be great to have.

Can we have FSCrawler to log the following:
a. File on which failure happened
b. Exception information

See my comments later. I think the exception is provided. it's not the case?

The code is:

    public static void documentError(String id, String path, String error) {
        documentLogger.error("[{}][{}] {}", id, path, error);
    }

But when the document is sent to elasticsearch, the path value is "unknown" within the bulk request. I think that could be a great thing to add. Would you like to open a feature request?

If you activate the debug mode for documents, you could see what is the id of the document with the full path and then when you have an error, you could try to find the same id in the debug part...

Not ideal for sure.

Francois_Saab · September 20, 2021, 12:37pm

I created this new feature request:

github.com/dadoonet/fscrawler

documents.log to contain The physical path of the File being indexed

opened 12:35PM - 20 Sep 21 UTC

fsaab

feature_request

- Target feature: Provide information for the physical path of the file for …which FSCralwer has failed to operate on (e.g. index in ES). - Current Situation: Currently you are logging as the following example shows in documents.log: `2021-07-01 22:11:53,627 [ERROR] [375040c5d4baa5408ae296233dc6e79c][null] ElasticsearchException[Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];]]` - Issue with current implementation: No information on the physical path of the file. And therefore we are not able to locate the file in question to retry uploading it.

Francois_Saab · September 20, 2021, 12:46pm

Hello David, thanks for your Quick reply.

Below is an example of documents log:

2021-07-01 22:11:53,627 [ERROR] [375040c5d4baa5408ae296233dc6e79c][null] ElasticsearchException[Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];]]

A per my understanding of your explanation. It logs mainly 3 things:

375040c5d4baa5408ae296233dc6e79c <-- Id
null: --> path to be added
Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];] --> the exception

I am supposing that the Id field is what FSCrawler assigns for the elastic search document. true?

dadoonet · September 20, 2021, 1:22pm

Great. thanks!

That's true.
It's using the full path to generate an id.

Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];

About this error, did you intentionally changed the watermark settings to make FSCrawler failing? For tests purpose I mean.

Francois_Saab · September 20, 2021, 1:40pm

Thanks.

No this failure happened without any intervention from my side.

dadoonet · September 20, 2021, 2:05pm

Ok. So that's something you really need to fix then!

Make sure you have enough disk space on your machine. Specifically on a production machine.

Francois_Saab · September 20, 2021, 4:59pm

Yes i will. thanks for mentioning this.
this was from my development machine which is full of databases

dadoonet · September 20, 2021, 6:16pm

On dev machine, you can change the watermarks threshold or disable the beahvior by setting cluster.routing.allocation.disk.threshold_enabled to false. See Cluster-level shard allocation and routing settings | Elasticsearch Guide [7.14] | Elastic for more info.

system · October 18, 2021, 6:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FSCrawler not running Elasticsearch	11	2906	March 28, 2018
Error while indexing documents into ES using Fscrawler Elasticsearch	6	2655	December 9, 2018
ElasticSearch - fscrawler missing documents in Index Elasticsearch	8	3056	October 30, 2017
Elasticsearch Transfer Physical Files Elasticsearch	14	1181	April 7, 2017
Ingesting documents (pdf, word, .txt) to elasticsearch Elasticsearch	31	39026	March 21, 2017

FSCrawler failure handling

Related topics