I am evaluating ElasticSearch with hadoop gateway.
We are planning to store one year worth of log data in hadoop and use elasticsearch for indexing an searching real time log data. The data for one year will be few hundred Tera bytes of data.
In testing environment we have 5 node hadoop cluster and 3 node elastic search cluster.
ElasticSearch keeps writing to local disk and it writes to the hadoop gateway as well. Every time when I test, either I ran into memory issues or disk space issues. I resolved memory issue by increasing VM on the machines.
Is there any guideline how much disk space we should allocate on ElasticSearch data nodes?
For instance if there is 200 Tera bytes of data to recover from the hadoop gateway, how much disk space should be avaiable on each data node of ElasticSearch cluster (4 node ElasticSearch cluster)?
In my testing, I found that each elasticSearch data node should have enough disk space to recover all the indices. if there is 100T of data, then each elasticsearch data node (4 node ES cluster) should have 25T of local disk space? This is not scalable in the production. How do we scale this kind of scenario? do we have to add more nodes to the elasticsearch cluster? or is there a way to keep only certain amount of data on the elastsearch data node?
I tried testing compression, but it did not work as expected.
I changed compress configuration by runnin this curl request
curl -XPUT localhost:9200/_settings -d '{"order": {"_source" : {"compress" : true}}}'
index meta data shows that comression set to true. But still it does not compress data. Do I have to have any compression libraries or does it require any other config change?
What is the advantage of using Hadoop gateway (some other shared storage) when Elasticsearch keeps writing data on local disk also. I only see the adavantage is HA.
Please give me some suggestions on how to scale large data.
thanks in advance.