It's recommended to have 3 nodes (at least 2 data/master nodes + one master only node). In which case you'll have a copy of the data in the other nodes.
With Instance Storage, you're right, you'll need to make use of the features provided by Elasticsearch to make sure your data isn't lost.
First, have multiple nodes, preferably at least 3, and make sure to set index.number_of_replicas on each index to 1 or more. This ensures that Elasticsearch will keep copies of your data on multiple nodes - it will keep copies on index.number_of_replicas + 1 nodes. The higher you set that setting on each index, the more nodes that can die without losing data.
Second, use Snapshots to periodically take backups of your data to a storage service like S3. This will ensure that you can recover your data even if your entire cluster dies.
Which EC2 instance you should use depends almost entirely on the workload you'll have and your budget. I can't really give concrete guidance here other than in the most general terms:
A server that's running Elasticsearch and nothing else should typically top out at 64GB of RAM, a little less than 32GB given to Elasticsearch's heap and the rest left for the OS to use for filesystem caching, and if you need more performance beyond that, add more nodes rather than making existing nodes bigger.
Beyond that and the disk IO considerations I mentioned above, the most reliable way to determine your hardware needs is to benchmark. We have a custom tool for benchmarking Elasticsearch called Rally that can be customized pretty easily to replicate your workload.
Thanks Gordon, I am actually using three Master nodes across two availability zones. from a workload point of view, It's a brand new setup. I would take an example of a retail store like "amazon.com". I would be expecting a similar kind of load in about two years. So my EC2s should be able to survive for at least three years. Does these EC2 instances hold true for Data/Coordinating Nodes as well ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.