JVM version:
openjdk version "1.8.0_162"
OpenJDK Runtime Environment (build 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.162-b12, mixed mode
OS version
Linux elasticsearch-6-2-3-client-eu-0 4.13.0-1012-gcp #16-Ubuntu SMP Thu Mar 15 12:00:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
The Instances are running within Google Cloud. Client and data node instances are both on n1-standard-16s (16 cpu and 60GB memory).
I'm trying to optimise the indexing rate, which is currently 25k/sec but this needs to be much faster. It seems like one of the possible issues is that one data node is taking on most of the work when bulk importing.
I bulk import in 10MB pieces and I have tried lowering that all the way to 3MB with no changes. A new connection is created to each client and I round robin between them when importing. I have also tried with a single connection rather than creating a new for each client with no difference.
The setup that I have is as follows:
Master Nodes: Three
Client Nodes: Two
Data Nodes: Four
The bulk thread pool size is set to 17 on the client and data nodes.
What is the size of your documents? The number of events per second is generally not a very good measurement of performance as the document size has a significant impact on the amount of work needed to be done per document.
How are you indexing the documents? Do requests go to the client nodes or directly to the data nodes? How many concurrent indexing threads/processes do you use? How many indices and shards are you actively indexing into? Are these evenly distributed across the data nodes?
The size of the documents range pretty wide, some can be less than 1KB others can go to 15KB.
The requests go to the client nodes. I run 15 concurrent processes each of them reusing the same http connections to the client. I make a new http connection to each ES client and round robin between them when making a request. For this instance I have given I am indexing into one index. And the indexes are seem to be evenly distributed across the data nodes.
Have you determined what is limiting performance? Is the above monitoring screenshot from when you are indexing? What does CPU, disk I/O and iowait look like? Are you using nested documents and/or parent/child? Are you specifying the document IDs from the application or allowing Elasticsearch to set them?
Well I'm pretty certain it is to do with the data nodes being unbalanced during a bulk import. I would assume that the load should be distributed across all data nodes rather than giving one data node most of the load?
The screenshot above is during indexing time yes. CPU is fine aside from one data node being given most of the work, I just imported again and got this as a result:
5 shards across 4 data nodes means one of the nodes will have twice the load of the others. Do you by any chance have 2 shards on the node with higher load?
I just dropped down to 2 data nodes using 5 shards and the same problem occurs. One data node is being given most of the load. Any idea why one data node is being given most of the load and it not being balanced?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.