What are you using for your benchmarks? We recommend Rally. Benchmarking is a tricky thing to get right and it's very easy to introduce errors or to measure something different from what you think you're measuring.
Are you indexing via the single client node? If you add a second client node do you get higher throughput? If so, it looks like you're measuring the throughput of a single client node rather than the capacity of the whole cluster.
We are using our own simulator to simulate the documents.
Yes, we are using single coordinator node(client). Also tried with 2 coordinator node, but no luck.
Is it possible to index 100k documents per second? Do we need to change any cluster configuration?
My documents size is around 1kb which contains 20 attributes(key-value pair) in it.
I gone though the kibana monitoring it shows low CPU and disk utilization.
I don't know what was blocking.Please guide me how to analyse it.
If it's not the client node, the next thing I'd suspect is the test harness itself. You're trying to push about 100MB/s of data at the cluster, and it's possible that the test harness you've written just isn't fast enough to do this.
If the bottleneck is Elasticsearch then I'd expect to see tasks building up on the client node. You can see how busy the client node is by looking at things like GET /_nodes/CLIENT-NODE-NAME/stats/thread_pool and GET /_tasks?nodes=CLIENT-NODE-NAME. If the client node looks quiet then the bottleneck is outside the system.
I able to achieve 100K rps using rally (benchmark tool) via bulk API. But using more numbers of single document index request (in parallel) index throughput could not go beyond 1000 rps.
Is bullk API is the only way to achieve 100K rps ? Or something else need to be focused for single document API?
Indexing a single document at a time results in a lot more overhead per document in terms of request processing and syncing to disk and is therefore always going to be considerably slower than using bulk requests.
To get the best throughput (and lower latency too) it's normally best to collect your requests into reasonable-sized batches before indexing them. You say you have a realtime scenario: what exactly are your latency targets?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.