Hi all,
In my case, I need to handle 3TB of data in elaticsearch, We have a job to rotate the index daily once. Each day a new index will be created, which is of size 100 GB. For 30 days we have 30 indices( 30 * 100GB =3000GB). we need to create monthly reports on this using aggregation queries(mostly terms aggs). But running aggregation queries on 3 TB of data, leads to client node crash. Could anyone help me with this?
Cluster details:
3 master nodes: 500 vcpu and 1GB physical memory
5 data nodes: 1vcpu and 1GB physical memory
1 client node: 500vcpu and 1GB physical memory
That sounds like very limited resources in terms of CPU, memory and heap for that data size, especially on the data nodes. I am not surprised you are having issues aggregating over the data set. I would recommend gradually increasing resources until the nodes fit your workload.
I do not know what will be required, do would recommend gradually increasing until performance is sufficient. I would probably start with 16GB RAM and 8GB heap for the data nodes and a bit less for the coordinating only node.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.