I have an ES cluster running on cheap spot instances: 1 master, 5 data nodes, 10 shards, and 1 replica. I have a round robin load balancer in front of my cluster.
During ingest, the load is heavily concentrated on two (small) nodes. I can't figure out why this is the case, as I have shards assigned to every node. Can anyone help me troubleshoot this?
Hard to tell from the picture. Are you just importing or do you have an
ingest processor or something? Those nodes look like they have more disk
usage, I wonder if you have a hot spot created by something like
parent/child.
I am just running an import, using a small Spark cluster to shove data into my ES cluster.
The two servers with high load do have less disk space. They were also the first data nodes to join the cluster, although I can't imagine why that matters. I don't have any custom routing.
I'm more of a data scientist than an infra guy, so I'm quite stumped!
I tend to use the hot_threads API to have a look at the guts and see if I
see a smoking gun in situations like this. If you post a gist of that I can
have a look.
Thanks for pointing me in the right direction. I looked into hot_threads, entered a few other rabbit holes, and discovered that two of my shards were stuck "relocating." After I solved that issue, the cluster was able to ingest at 35k documents/sec, with a uniform load distribution.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.