ES cluster seems to be crashing in middle of a snapshot creation and then return not found errors when queried with the API wrapper

Hello!

We have a Elasticsearch cluster running in production. We are using github.com/olivere/elastic(V7) to create snapshots.

Here is how it goes,

  1. Create a *elastic.Client, Verify authentication works by fetching the ES version.

  2. Trigger a snapshot creation in async mode. Then, Every 1 minute we check the status of snapshot creation.

Here is what the logs looks like,

time="2020-05-17T03:36:12Z" level=warning msg="error in fetching snapshot state. Try 1 of 5: elastic: Error 503 (Service Unavailable)" repository=esbackup-mw-elk-prod snapshot=20200516032609
time="2020-05-18T03:32:12Z" level=warning msg="error in fetching snapshot state. Try 2 of 5: elastic: Error 429 (Too Many Requests): [parent] Data too large, data for [<http_request>] would be [8162093384/7.6gb], which is larger than the limit of [8127315968/7.5gb], real usage: [8162093384/7.6gb], new bytes reserved: [0/0b] [type=circuit_breaking_exception]" repository=esbackup-mw-elk-prod snapshot=20200516032609
time="2020-05-18T03:40:12Z" level=warning msg="error in fetching snapshot state. Try 3 of 5: elastic: Error 503 (Service Unavailable)" repository=esbackup-mw-elk-prod snapshot=20200516032609
time="2020-05-20T10:15:09Z" level=warning msg="error in fetching snapshot state. Try 4 of 5: elastic: Error 404 (Not Found): [esbackup-mw-elk-prod:20200516032609] is missing [type=snapshot_missing_exception]" repository=esbackup-mw-elk-prod snapshot=20200516032609

Upon googling, I learned, circuit_breaking_exception can occur if there is less memory than needed to complete a operation[1].

What I don't understand is, Why it starts returning "Not Found" after a while and basically never recovers.

If I send a curl request to get snapshot state, It reports just fine. If i use the elastic wrapper, It keeps returning Not Found errors.

[1] https://stackoverflow.com/questions/37216300/elasticsearch-circuit-breaking-exception-data-too-large-with-significant-terms

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.