Currently we have two clusters--one much larger, 15 nodes with 61gb memory each. The second is much smaller, just one node with 8gb memory, but we see very similar behavior with both.
On each cluster, we are maintaining just a single index. On the larger cluster this index contains ~2 billion documents, and on the smaller one it contains ~20 million documents. We use a spark job to write to ElasticSearch, running an hourly job to do a partial update on the existing index and a nightly job to recreate the index from scratch. Each hourly partial update updates the current index with about 5-10% of the total documents in the index (we want to a more real-time architecture, so this is our medium-term solution). Both indexes use 120 shards with 3 replicas, so 360 total.
We use these indexes to serve our web application directly, so avoiding downtime is a big priority. What we find with both clusters, however, is that during the hourly spark job when documents are being indexed, we always hit 75% memory on one or more nodes, then we see cpu spike to around 100% for all data nodes and any requests to ElasticSearch take an unacceptably long time to be served, often 20 or more seconds if they don't time out. Our theory is that GC is being triggered, and because there are a large number of documents in the indexing buffer which cannot be garbage collected that causes the "stop everything" garbage collector to take over (I'm not a JVM expert myself, so this is mostly what I've gathered from research).
We thought that this issue could be solved by simply adding more memory to our clusters, but even when scaling the smaller cluster up 6x (to 3 nodes with 16gb memory each), the JVM memory pressure still sits around 60-70% and the bulk indexing from the spark job consistently causes the CPU spike and failing requests.
Can anyone lend some guidance on where we might be going wrong? Is ElasticSearch just not designed to handle such big indexing operations on a regular basis? Are the number of shards/replicas we're using way off for the hardware it's on? Could it just be that the concurrency in spark is too high, and we need to take it easy a bit?
Happy to provide any more information that may be helpful (apologies if there's something obvious I forgot to provide, it's my first time asking a question on here)