We have around 5 billion of records per year, retention period is 1 year. General queries will involve a date range selection and around 10 filters and some aggregations to work on. Time taken to run the query should be as minimum as possible. Also it will be queried very frequently.
What will be the hardware requirements for such scenario? I know it depends on various other factors as well but just a general idea of how many index/nodes/shards should be created?
If I have calculated correctly that corresponds to around 2TB of data on disk if we assume it grows proportionally.
If you truly want optimal performance, you generally want all data cached in memory on the host in the file system page cache. That will result in a quite large cluster with a lot of RAM.
If that is not an option, you are probably going to need as fast storage as possible as that often is the limiting factor in Elasticsearch. I would therefore recommend hosts with fast local SSDs.
With respect to node count and exact specification with respect to CPU and RAM, I believe this is something you need to benchmark to find out.
I watched the video and that has cleared many of my doubts.
I will look upon structured data and custom mappings.
One thing is in my scenario new data will be injected on monthly basis and only read operations are to be done daily. In this case creating replica shards will result faster in response comparatively to primary shard? Currently I am having only 1 primary shard and replica shard is unassigned.
Also one node can have max of 16GB RAM so creating multiple nodes sounds right to you maintaing the 1:16 ratio?
Replica shards help with high availability and is also the way to increase query throughput. If you only have a single node, replicas can not be assigned as Elasticsearch will not assign multiple copies of the same shard to a single node.
Usually nodes go up to around 64GB RAM (as we recommend staying below 32GB heap and use 50% of RAM for heap), so I am not sure where you did get this from.
Yes, a minimum of 3 nodes is required in order to achieve high availability.
The amount of heap you require will depend on your data and your query patterns, so you need to test to find out. It is recommend running with as small heap as possible (as long as this does not cause issues) as this generally results in faster GC.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.