Slow bulk indexing with lots of different 'types'

saucam · October 17, 2016, 11:06am

Hi all,

I have been trying to index data into elasticsearch using bulk index api. I am using elastic4s to construct the bulk query and execute using the client which uses java api under the hood.

I have a 5 node elastic cluster (16GB RAM, 8 cores each)
I have set ES_HEAP to 8G on all nodes.

The data I have is very small, and I use custom routing when bulk indexing. The routing field is set to a city id on which I want to distribute the data. So I am trying to index many different documents pertaining to different cities. I also use the cityId as 'type' when indexing so when searching I can set the type to that particular cityId. Overall I have data for 60000 different cities.

I club several documents for each city (10 - 12) in a single bulk call. If I club more, then the performance deteriorates even faster. The size of each call varies between 20 - 150 kb. I have a total of about 1 million documents to index.

Each call is synchronous, so I wait for each bulk call to return. I use 3 parallel workers to push data to elastic. So each of them are doing similar calls to elastic at the same time.

Most of the data is long / double with only 2 - 3 fields as string.

I have also set refresh interval to -1 , memory.index_buffer_size to 40% , merge.scheduler.max_thread_count to 1 and store.throttle.type to "none". Also I have set replica for the index to 0 for indexing data.

Right at the start I get an index speed of about 150 - 200 documents / sec which quickly deteriorates. Soon after about 10 - 15 min It falls to 40 - 50 documents / seconds and goes even below that sometimes. It picks up randomly to 100 / second in between but keeps falling back to 30 - 40 per sec. Also after 30 min I also see at one particular time the bulk queueu entirely full on all the five nodes, but it recovers from that after some time.

What more can I do to get better performance than this ?

Any pointers are highly appreciated.

Best Regards
Yash

Christian_Dahlqvist · October 17, 2016, 11:13am

Try indexing your data using a single type. Having that many different types is a bad practice, and the mappings, and therefore the cluster state, will need to be updated frequently, which will cause bad performance, especially if this is a high cardinality field.

saucam · October 17, 2016, 11:15am

Hi ,

Thanks for the prompt response. Can I still use cityId as routing parameter, and just eliminate the type ? Then while querying I can atleast make use of routing field ?

I have data for 60K unique cities btw.

Best Regards
Yash

Christian_Dahlqvist · October 17, 2016, 11:17am

Yes, you can certainly still use routing.

saucam · October 17, 2016, 11:20am

Thank you !

Also any other parameters I should play with ?

Christian_Dahlqvist · October 17, 2016, 11:25am

You should be able to use a larger bulk size - a good starting point might be around 500.

saucam · October 17, 2016, 11:55am

Thanks Christian,

I definitely get a LOT better performance when using a single type. Will try better bulk size as well

Topic		Replies	Views
Slowly Indexing speed Elasticsearch	26	857	August 18, 2020
Slow bulk indexing performance Elasticsearch	6	1365	December 11, 2018
Java bulk API slows down if client is not closed and reopened Elasticsearch	9	520	July 6, 2017
Bulk indexing performance Elasticsearch	10	4444	February 10, 2017
Very large number of fields in Index leading to slow index rate Elasticsearch	11	7122	June 15, 2017

Slow bulk indexing with lots of different 'types'

Related topics