Slow bulk indexing with lots of different 'types'


(Yash) #1

Hi all,

I have been trying to index data into elasticsearch using bulk index api. I am using elastic4s to construct the bulk query and execute using the client which uses java api under the hood.

I have a 5 node elastic cluster (16GB RAM, 8 cores each)
I have set ES_HEAP to 8G on all nodes.

The data I have is very small, and I use custom routing when bulk indexing. The routing field is set to a city id on which I want to distribute the data. So I am trying to index many different documents pertaining to different cities. I also use the cityId as 'type' when indexing so when searching I can set the type to that particular cityId. Overall I have data for 60000 different cities.

I club several documents for each city (10 - 12) in a single bulk call. If I club more, then the performance deteriorates even faster. The size of each call varies between 20 - 150 kb. I have a total of about 1 million documents to index.

Each call is synchronous, so I wait for each bulk call to return. I use 3 parallel workers to push data to elastic. So each of them are doing similar calls to elastic at the same time.

Most of the data is long / double with only 2 - 3 fields as string.

I have also set refresh interval to -1 , memory.index_buffer_size to 40% , merge.scheduler.max_thread_count to 1 and store.throttle.type to "none". Also I have set replica for the index to 0 for indexing data.

Right at the start I get an index speed of about 150 - 200 documents / sec which quickly deteriorates. Soon after about 10 - 15 min It falls to 40 - 50 documents / seconds and goes even below that sometimes. It picks up randomly to 100 / second in between but keeps falling back to 30 - 40 per sec. Also after 30 min I also see at one particular time the bulk queueu entirely full on all the five nodes, but it recovers from that after some time.

What more can I do to get better performance than this ?

Any pointers are highly appreciated.

Best Regards
Yash


(Christian Dahlqvist) #2

Try indexing your data using a single type. Having that many different types is a bad practice, and the mappings, and therefore the cluster state, will need to be updated frequently, which will cause bad performance, especially if this is a high cardinality field.


(Yash) #3

Hi ,

Thanks for the prompt response. Can I still use cityId as routing parameter, and just eliminate the type ? Then while querying I can atleast make use of routing field ?

I have data for 60K unique cities btw.

Best Regards
Yash


(Christian Dahlqvist) #4

Yes, you can certainly still use routing.


(Yash) #5

Thank you !

Also any other parameters I should play with ?


(Christian Dahlqvist) #6

You should be able to use a larger bulk size - a good starting point might be around 500.


(Yash) #7

Thanks Christian,

I definitely get a LOT better performance when using a single type. Will try better bulk size as well


(system) #8