I have been trying to index data into elasticsearch using bulk index api. I am using elastic4s to construct the bulk query and execute using the client which uses java api under the hood.
I have a 5 node elastic cluster (16GB RAM, 8 cores each)
I have set ES_HEAP to 8G on all nodes.
The data I have is very small, and I use custom routing when bulk indexing. The routing field is set to a city id on which I want to distribute the data. So I am trying to index many different documents pertaining to different cities. I also use the cityId as 'type' when indexing so when searching I can set the type to that particular cityId. Overall I have data for 60000 different cities.
I club several documents for each city (10 - 12) in a single bulk call. If I club more, then the performance deteriorates even faster. The size of each call varies between 20 - 150 kb. I have a total of about 1 million documents to index.
Each call is synchronous, so I wait for each bulk call to return. I use 3 parallel workers to push data to elastic. So each of them are doing similar calls to elastic at the same time.
Most of the data is long / double with only 2 - 3 fields as string.
I have also set refresh interval to -1 , memory.index_buffer_size to 40% , merge.scheduler.max_thread_count to 1 and store.throttle.type to "none". Also I have set replica for the index to 0 for indexing data.
Right at the start I get an index speed of about 150 - 200 documents / sec which quickly deteriorates. Soon after about 10 - 15 min It falls to 40 - 50 documents / seconds and goes even below that sometimes. It picks up randomly to 100 / second in between but keeps falling back to 30 - 40 per sec. Also after 30 min I also see at one particular time the bulk queueu entirely full on all the five nodes, but it recovers from that after some time.
What more can I do to get better performance than this ?
Any pointers are highly appreciated.