I would like to get advices about scaling up an Elasticsearch cluster I currently manage. The cluster is used to store logs from many different environments (dev, staging, ...). Each environment gets its own index template, and the indices are created daily (logstash formatting). We currently keep indices for 10 days per environment.
Index size varies between environments.
Our biggest daily index has around 35GB of data, for 140 million documents.
Our smallest daily index has around 100MB of data.
At the moment, in total, the cluster ingests about 250 million documents each day (~3k/s in average).
I would like to configure it to handle 1 billion documents per day (~12k/s).
The data is sent to Elasticsearch using Filebeat and Logstash, from various locations.
Our current setup starts to struggle with the load. We notice different errors, like Logstash losing connection to Elasticsearch, or even Elasticsearch nodes crashing. Search is getting slower and slower, while more and more people are using Kibana to browse logs.
The cluster setup is very basic: 3 nodes (EC2 m4.xlarge), each with 4 vCPUs, 16GB of RAM and 800GB SSD. There's no master node, only data nodes.
In terms of config:
# elasticsearch.yml thread_pool.search.queue_size: 100000 thread_pool.search.size: 20 # jvm.options -Xms8g -Xmx8g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError
My first question is, do you see any issue with our current setup? Any config change that could already improve our performance?
My second question is, what would you recommend for us to scale that cluster to handle 1 billion daily documents?