I am using kafka connector to load data from kafka topic into an elasticsearch index. Throughput into kafka topic is around 2000 requests per seconds. But throughput at elasticsearch is quite lower like around 500-800 requests per secs ( 5 primary shards + 1 replica). Any suggestion to improve the indexing rate?
Thanks a lot for replying. One event size is 16kb.
And we have elasticsearch cluster with 6 nodes.. with 3 masters (only one dedicated) ,4 data nodes and 1 client node. Bulk size from connector is currently 2000. I tried changing batch size to 3000, but there are no improvements. While setting replica as zero we get a throughput of around 1200 requests per sec, but we need to set replica as atleast one in production and aiming for indexing throughput of 2000 requests per second.
Do you have monitoring installed? What is the CPU usage and disk I/O looking like on the data nodes during indexing? How many CPU cores does each node have? What type of storage is being used?
How many concurrent connections/threads does the Kafka connector use to index into the cluster?
We are monitoring via X-pack in Kibana. CPU usage is mostly below 50% most of the time on all nodes.. Each node configured with 14 GB RAM and 8 core processor. There are 5 concurrent tasks running to load using kafka connector.
Assuming that you are sending bulk requests to all data nodes, I would recommend increasing the number of parallel indexing threads. Increase slowly and monitor indexing throughput until you see no further gain in throughput. 5 connections/threads sound a bit low for a cluster that size in my opinion.
As your documents are quite large, that does sound a bit big. I would probably recommend going with a smaller bulk size rather than larger, but what you have may also be appropriate. You need to benchmark to know for sure.
I was just going through blogs on improving indexing performance and came across indices.cluster.send_refresh_mapping property. I tried setting it to false in elasticsearch.yml but it shows unknown setting. Could you tell how to set this? We are using Elasticsearch 5.4.0.
I would recommend optimising throughput by benchmarking different bulk sizes and number of concurrent connections before starting to experiment with expert level settings as Elasticsearch generally comes with good defaults. One thing you may however want to change at the index level is the refresh interval.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.