HDFS storage options

I spend a couple of time to find out how ES can possibly integrate to HDFS.
We have an ES cluster running on top of YARN and want the cluster to be fail safe, e.g. survive a YARN restart.

My conclusion is:

  • (1) you can mount HDFS as NFS and point ES to a NFS path (downside: slowdown)
  • (2) you can use repository-hdfs and 'manually' care about backup and restore to and from HDFS

Any other options ?
Also i'm yet un-decided on whether to use ES 1.x or 2.x, does it matter in that perspective ?

This will be really slow, to the point where it'd be unusable, and we do not recommend it.

Ok, understood. Don't use option (1) / NFS-HDFS.
So but thats all my options ? There isn't an option (3) where all my data is persisted in HDFS but the nodes operate on a local copy or anything like that !?

Actually you can. One can have HDFS as the primary storage and upload the data from HDSF to ES, where data on ES can exist in local nodes.

And how is that configured ?

What I think @ssatapathy is suggesting is to keep your data in HDFS (primary storage) and load it through Hadoop jobs into ES. ES is using the out of the box configuration, writing data to the local disks/storage.