Is there a Hadoop Inputformat to read ES snapshot files in Hadoop?


I need to import the data in an ElasticSearch cluster, but I may not read the cluster directly. I knew I could restore the snapshot into a cluster. I just wondering if I can read the sanpshot file directly from Hadoop.

Another question. Is repository-hdfs writes in the same format like a local file system?

(Costin Leau) #2

You can read the snapshot but by yourself. Note there are no guarantees of its format or that it will remain the same across versions.

repository-hdfs only exposes HDFS to the Snapshot API - it does not alter or interfere the file format.

(Chenryn) #3

hundreds of day away, is there any new open source repo that implement this?

(Costin Leau) #4

Not that I'm aware of and further more it's not something recommended. It's much faster or easier to just export them data in Json or otherwise since a snapshot contains not just the data but also the ES metadata which is version specific and meant for ES only

(Chenryn) #5

Using scroll API to export data in JSON is much more more slower than the snapshot API.
I just thought maybe run some hadoop jobs by writing some es query could be easier than writing pure batch mapreduce code. And We can save some space if we don't need to transfer data both to es and hadoop, we just need snapshot the cold data to hadoop.

Like the Hunk product of Splunk.

(system) #6