Elasticsearch restoration of a huge dump fails with client connection timeout errors

Hi Team,

I am trying to restore a 50GB elastic dump file into a new elasticsearch cluster running with 3 elasticsearch replicas on a kubernetes cluster.

The elasticsearch is running with a 7.10.2 oss version image and using managed-nfs-storage as storage class. The restoration started without any issue and I started seeing the indices are in yellow state at the very beginning.

Later when I try to check the state of the indices then the API call fails with client request timeout error continuously.

Error:
{"statusCode":502,"error":"Bad Gateway","message":"Client request timeout"}

I don't see any errors from the elasticsearch pod logs but found out that there are some read write errors noticed from the nfs client deployed on the worker nodes.

NFS client errors observed on /var/log/messages file of the worker nodes:

  Jun 21 14:57:48 cstream8-node kernel: NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
  Jun 21 14:57:53 cstream8-node kernel: __nfs4_reclaim_open_state: 603 callbacks suppressed
  Jun 21 14:57:53 cstream8-node kernel: NFS: __nfs4_reclaim_open_state: Lock reclaim failed!

Read Write errors from nfsiostat command executed on worker nodes:

nfs-server:/mnt/k8sMount/elastic-index-persistent-storage-index-1-pvc-ece1d0d2-29c7-4298-9732-4ccee96d29ed mounted on /var/lib/kubelet/pods/19a68
986-2bc7-40ea-b439-1a234493d857/volumes/kubernetes.io~nfs/pvc-ece1d0d2-29c7-4298-9732-4ccee96d29ed:

           ops/s       rpc bklog
         700.248           0.000

read:              ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   5.752         710.568         123.532        0 (0.0%)           1.250          66.305          64.939     1763 (0.8%)
write:             ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   1.961         999.801         509.876        0 (0.0%)           8.403        2147.932        2139.478      813 (1.0%)

I have 3 questions regarding the issue,

  1. Is there any limitation on the datasize of the dump being restored?
  2. Does elasticsearch support NFS as storage class?
  3. Is there any way to restore such a huge dump without hitting any such errors?

Details:
OS - Centos 8 stream
Platform - RKE2 cluster
NFS version - NFSV4

Please let me know if you need any more details.

How are you restoring it ?

  • Is there any limitation on the datasize of the dump being restored?
    --> not sure on the upper limit , but 50GB is really small data set and you should be able to restore easily, based on how powerful your data nodes are .
  • Does elasticsearch support NFS as storage class?
    --> ECK does not come with its own storage mechanism for Elasticsearch data. It is compatible with any Kubernetes storage option. It is recommended to use PersistentVolumes, by configuring the VolumeClaimTemplates section of the Elasticsearch resource.
    refer https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-storage-recommendations.html for more details.
  • Is there any way to restore such a huge dump without hitting any such errors?
    --> Are you trying to query the cluster where the restore is in progress?

Can you provide more context on what you are trying to restore and how? It is not clear.

Did you create a dump file using elasticdump or a similar tool? If yes, this is not an officially supported mode of restoring data to a Elasticsearch cluster, the official and recommended way to restore data into a cluster is using the Snapshot and Restore APIs.

It supports, but it is not recommended to use NFS based for data paths, only for snapshots repositories.

Need more context, as it is not clear what you are trying to restore and if this is a Elasticsearch issue or an infrastructure issue.

1 Like
  • not sure on the upper limit , but 50GB is really small data set and you should be able to restore easily, based on how powerful your data nodes are .
    -> can you please tell me how to check if the data nodes are really capable of handling this request?

  • Are you trying to query the cluster where the restore is in progress?
    -> Yes I am trying to list the indices while the restoration is in progress because the restoration is taking a very long time.

  • Did you create a dump file using elasticdump or a similar tool?
    -> No, I am not using any tools for dump and restore instead I have used elastic APIs in order to take dump and restore the elastic dump,
PUT _snapshot/my_backup
{
    "type": "fs",
    "settings": {
    "location": "/usr/share/elasticsearch/backups",
    "compress": true
    }
}

POST _snapshot/my_backup/snapshot_1/_restore
{
  "indices" : "filebeat-*",
  "ignore_unavailable": true,
  "include_global_state": false,
  "include_aliases": false
}
  • It supports, but it is not recommended to use NFS based for data paths, only for snapshots repositories.

    -> Actually I am using NFS storage for storing both snapshots and data paths of each elastic replica.

  • Need more context, as it is not clear what you are trying to restore and if this is a Elasticsearch issue or an infrastructure issue.
    -> Even I too suspect the NFS storage since there are some read write errors appearing in the nfs statistics.

This version is very old, long past EOL. You must upgrade as a matter of urgency.

This error is not coming from Elasticsearch, it's a timeout imposed by something else. It does not mean that the restore has failed. I don't remember if anything changed in the few years since 7.10 was released, but in supported versions the restore will carry on in the background and eventually complete successfully.

It's a little more subtle than either of these statements, and I think the manual describes it best:

Elasticsearch requires the filesystem to act as if it were backed by a local disk, but this means that it will work correctly on properly-configured remote block devices (e.g. a SAN) and remote filesystems (e.g. NFS) as long as the remote storage behaves no differently from local storage. [...] The performance of an Elasticsearch cluster is often limited by the performance of the underlying storage, so you must ensure that your storage supports acceptable performance. Some remote storage performs very poorly, especially under the kind of load that Elasticsearch imposes, so make sure to benchmark your system carefully before committing to a particular storage architecture.

i.e. you're free to use NFS if you want, but it's on you to make sure it is properly configured and performs adequately. If it doesn't, you need someone with NFS expertise to help you (and this is probably not the best forum to find such help).

Thanks @DavidTurner for the details. Actually I am using 7.10.2 version because this is last version available in the OSS release.

I request your suggestion in using the latest versions of ELK since there are a lot of modifications made to the licensing part.

I have gone through the document but had a doubt in using the latest versions.

  • Are they really free same as OSS?
  • Are there any restrictions that I need to consider while using the latest versions of ELK?

Also I suspect that I might hit the same error with the latest versions of ELK if the issue is from NFS and not from ELK.

can you please share me any pointers in case the issue is with NFS client?

7.10.2 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

See FAQ on 2021 License Change | Elastic for further details. Basically the only restriction is you cannot sell Elasticsearch-as-a-service.

I have no further information beyond the docs link in my previous post. You will need to find some NFS experts to help you, and this isn't the right forum for that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.