I am using kafka connector to load data from kafka topic into an elasticsearch index. Throughput into kafka topic is around 2000 requests per seconds. But throughput at elasticsearch is quite lower like around 500-800 requests per secs ( 5 primary shards + 1 replica). Any suggestion to improve the indexing rate?
Thanks in advance.
How big is your cluster, and are you sending all your data to just one node or several?
Also, have you tried maybe routing your kafka output throught Logstash?
What its the specification of your cluster? How large are your events? What bulk size is being used? How many concurrent connections?
Thanks a lot for replying. One event size is 16kb.
And we have elasticsearch cluster with 6 nodes.. with 3 masters (only one dedicated) ,4 data nodes and 1 client node. Bulk size from connector is currently 2000. I tried changing batch size to 3000, but there are no improvements. While setting replica as zero we get a throughput of around 1200 requests per sec, but we need to set replica as atleast one in production and aiming for indexing throughput of 2000 requests per second.
Thanks in advance
Do you have monitoring installed? What is the CPU usage and disk I/O looking like on the data nodes during indexing? How many CPU cores does each node have? What type of storage is being used?
How many concurrent connections/threads does the Kafka connector use to index into the cluster?
We are monitoring via X-pack in Kibana. CPU usage is mostly below 50% most of the time on all nodes.. Each node configured with 14 GB RAM and 8 core processor. There are 5 concurrent tasks running to load using kafka connector.
Assuming that you are sending bulk requests to all data nodes, I would recommend increasing the number of parallel indexing threads. Increase slowly and monitor indexing throughput until you see no further gain in throughput. 5 connections/threads sound a bit low for a cluster that size in my opinion.
And what about bulk size? Considering each event is of size 16 KB.. So how much do you suggest for bulk size? Currently it is 2000.
As your documents are quite large, that does sound a bit big. I would probably recommend going with a smaller bulk size rather than larger, but what you have may also be appropriate. You need to benchmark to know for sure.
I was just going through blogs on improving indexing performance and came across indices.cluster.send_refresh_mapping property. I tried setting it to false in elasticsearch.yml but it shows unknown setting. Could you tell how to set this? We are using Elasticsearch 5.4.0.
I would recommend optimising throughput by benchmarking different bulk sizes and number of concurrent connections before starting to experiment with expert level settings as Elasticsearch generally comes with good defaults. One thing you may however want to change at the index level is the refresh interval.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.