I've been using the es hadoop package for Spark over the past 6 months and I've never been able to sustain a bulk indexing rate over 70k doc per second.
Six months ago with Spark 1.6 and ES 2.4 we had an indexing rate from 10k to 20k. This cluster has dedicated master and client nodes(m3.large) with 6 data nodes ( c3.2xlarge ). This marvel graph shows our indexing rate from one of our test.
Since then we have updated to emr-5.2.0 with Spark 2.0 and es-hadoop to 5.1.1 as well as Elasticsearch 5.2.0. However we have discovered a glass ceiling of ~70k docs indexed per second.
Many of the bulk loading articles state that you need to start off with one node, shard and Spark partition to understand your base line performance. We ran over 30 test and discovered our optimal rate is ~73k per sec with two nodes. Adding nodes simply decreased performance. Here are those test.
So our next test would use a cluster with dedicated master, ingest and data nodes. We're using i2.2xlarge for all of them for the purpose of this test.
During bulk loading the 5 ingest nodes show relatively high spikes in CPU but the 15 minute load average is around 1.5. The data nodes have a higher level of cpu usage 60-80%. I'm also monitoring I/O wait times with sysstat sar and I never see i/o waiting time.
So the bottleneck isn't disk I/O and appears to be CPU. Separating ingest from data nodes helps scale these out independently and it's looking like we need more ingest than data nodes.
However, every time we scale the cluster and attempt to push more data over from Spark we end up getting the same indexing rate.
Other notes:
- We ensure shards are evenly distributed across the nodes
- We have tested the number of shards per node but that has very little effect on performance
- We have been controlling Spark parallelism with dataFrame repartitioning. I believe that is working well.
- We verified that I/O on the Spark side was not an issue.
- On the Spark side we are running 10x r3.2xlarge nodes
- I recently ran a cluster with 11 Spark nodes, 5x ingest, 2x master and 10x data nodes with the same results.
So I'm curious if other people are getting better indexing performance than us?