FSCrawler failure handling

Hello,

We are considering the question: how FSCrawler manages failure?

What happened if FSCrawler failed to upload a certain file because of a failure on the ElasticSearch side? Does FSCrawler implements an automatic failure handling mechanism?

Consider the scenario, FSCrawler fails to upload a file because Elasticsearch is Down.

  1. Can we have FSCrawler to retry uploading the file? can we configure number_retries?
  2. Can we have FSCrawler to log the following:
    a. File on which failure happened
    b. Exception information

What is the structure (i.e. fields) of the log file "documents.log"? I looked at that file but i didn't find reference to the "file name" on which the failure happened.

Thank you

Hello François

It will just log it to the logs and/or to the console. I'm not sure how fs.continue-on-error setting will play a role here. (https://fscrawler.readthedocs.io/en/fscrawler-2.7/admin/fs/local-fs.html#continue-on-error). I did not look at the code but I assume that this setting only apply to the "input" part and not the "output" part.

  1. Can we have FSCrawler to retry uploading the file? can we configure number_retries?

No and no. That's something that should be implemented. A Dead Letter Queue (DLQ) mechanism would be great to have.

  1. Can we have FSCrawler to log the following:
    a. File on which failure happened
    b. Exception information

See my comments later. I think the exception is provided. it's not the case?

The code is:

    public static void documentError(String id, String path, String error) {
        documentLogger.error("[{}][{}] {}", id, path, error);
    }

But when the document is sent to elasticsearch, the path value is "unknown" within the bulk request. I think that could be a great thing to add. Would you like to open a feature request?

If you activate the debug mode for documents, you could see what is the id of the document with the full path and then when you have an error, you could try to find the same id in the debug part...

Not ideal for sure.

I created this new feature request:

Hello David, thanks for your Quick reply.

Below is an example of documents log:

2021-07-01 22:11:53,627 [ERROR] [375040c5d4baa5408ae296233dc6e79c][null] ElasticsearchException[Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];]]

A per my understanding of your explanation. It logs mainly 3 things:

375040c5d4baa5408ae296233dc6e79c <-- Id
null: --> path to be added
Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];] --> the exception

I am supposing that the Id field is what FSCrawler assigns for the elastic search document. true?

Great. thanks!

That's true.
It's using the full path to generate an id.

Elasticsearch exception [type=cluster_block_exception, reason=index [data_files] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];

About this error, did you intentionally changed the watermark settings to make FSCrawler failing? For tests purpose I mean.

Thanks.

No this failure happened without any intervention from my side.

Ok. So that's something you really need to fix then! :slight_smile:

Make sure you have enough disk space on your machine. Specifically on a production machine.

Yes i will. thanks for mentioning this.
this was from my development machine which is full of databases :slight_smile:

On dev machine, you can change the watermarks threshold or disable the beahvior by setting cluster.routing.allocation.disk.threshold_enabled to false. See Cluster-level shard allocation and routing settings | Elasticsearch Guide [7.14] | Elastic for more info.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.