we have a microservice application that is getting data from Kafka and importing them into Elasticsearch using BulkImport operation.
The microservice application is running in docker using docker-compose and scaling for parallel and multi-thread.
Elasticsearch (2.4.1) is also running into docker using docker-compose with the following configuration (1master with 4GB javaHeapSize - 1client with 4GB javaHeapSize - 5data with 8GB javaHeapSize - 24shards - 1index~7.89GB).
The VM have 256GB of RAM, 24 CPU (24core), 500GB disk space ext4
We noted that the application is taking 20s between some BulkImport and continuing with others, at the end to import a fullIndex of 7.49GB (6,2Milions hits) is taking 4h40m.. Not what we expected.
We already tried to:
Disable refresh and replicas for initial loads
Setting ulimits higher
Setting scale configuration of threadpools
No luck.
Can we have some suggestion in order to increase indexing speed?
Having a single master-eligible node makes it a single point of failure and is not recommended. Also make sure you are sending data directly to the data nodes so the client node does not become a bottleneck.
Thanks for the fast reply @Christian_Dahlqvist, this env is just a prototype in order to get some metrics useful for prod, were we have 3 clients, 3 master, 12 data nodes..
Anyway good point!
Do you know if client node is limiting number of requests/rate?
Have a look at the Elasticsearch nodes and see if you can identify what is limiting throughput. Elasticsearch is often very disk I/O intensive, so slow storage is a common bottleneck, but it could also be CPU or GC.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.