We have an indexing process that reads data batches from some datasource and index this data on an Elasticsearch cluster using bulk index requests (let's say one bulk per batch). The data in a same batch could go to several indices.
While doing some testing we've noticed an important degradation on the bulk index request performance when the documents in the batch go to several indices. The more mixed the batch is (and therefore the bulk) the more slow the bulk index request is.
Elasticsearch version 2.2. Our cluster has 11 nodes (15gb ram and 8 cpu cores per node). We have around 400 indices.
In some test we did we had documents for every index (400) in a same bulk. Another important detail is that we are performing around 500 bulk index requests per second.
The bulk requests are using default configs (1k documents per bulk) and our mappings are static. Also we have the index auto-creation property disabled.
I think I know what might be happening. Can you take a few hot thread traces while repeatedly executing bulk requests with payloads targeting many indices, and again targeting one or a small number of indices?
The main differences we found between the two traces were pieces like this:
[2017-01-12 15:39:37,595][TRACE][index.shard ] [node-5] [logs_2015-08_performance_0][2] skip writing shard state, has been written before; previous version: [4] current version [4]
[2017-01-12 15:39:37,595][TRACE][index.shard ] [node-5] [logs_2015-08_performance_0][26] skip writing shard state, has been written before; previous version: [4] current version [4]
[2017-01-12 15:39:37,595][TRACE][index.shard ] [node-5] [logs_2015-08_performance_0][18] skip writing shard state, has been written before; previous version: [4] current version [4]
.......
[2017-01-12 15:40:00,371][TRACE][transport.tracer ] [node-5] [2015940][indices:data/write/bulk[s]] sent error response (error: [[node-5][10.75.114.160:9305][indices:data/write/bulk[s][p]]])
[2017-01-12 15:40:00,371][TRACE][transport.tracer ] [node-5] [2015949][indices:data/write/bulk[s]] received request
[2017-01-12 15:40:00,371][TRACE][transport.tracer ] [node-5] [1995643][indices:data/write/bulk[s][p]] received response from [{node-5}{IOKuMzJbRjesv8NInGiuBQ}{10.75.114.160}{10.75.114.160:9
305}{ips_id=6D46:6324, master=false}]
[2017-01-12 15:40:00,375][TRACE][indices.breaker.request ] [request] Adjusted breaker by [16440] bytes, now [16440]
[2017-01-12 15:40:00,375][TRACE][index.store.deletes ] [node-5][logs_2003-10_performance_0][14] StoreDirectory.deleteFile: delete file _2.fdt
[2017-01-12 15:40:00,375][TRACE][transport.tracer ] [node-5] [1995649][indices:data/write/bulk[s][p]] received response from [{node-5}{IOKuMzJbRjesv8NInGiuBQ}{10.75.114.160}{10.75.114.160:9
305}{ips_id=6D46:6324, master=false}]
.......
2017-01-12 15:40:00,416][TRACE][indices.breaker.request ] [request] Adjusted breaker by [-16440] bytes, now [0]
[2017-01-12 15:40:00,417][TRACE][transport.tracer ] [node-5] [1994226][indices:data/write/bulk[s][p]] received response from [{node-5}{IOKuMzJbRjesv8NInGiuBQ}{10.75.114.160}{10.75.114.160:9
305}{ips_id=6D46:6324, master=false}]
[2017-01-12 15:40:00,417][TRACE][indices.breaker.request ] [request] Adjusted breaker by [16440] bytes, now [16440]
[2017-01-12 15:40:00,417][TRACE][transport.tracer ] [node-5] [2004520][indices:data/write/bulk[s]] sent response
[2017-01-12 15:40:00,417][TRACE][indices.breaker.request ] [request] Adjusted breaker by [-16440] bytes, now [0]
[2017-01-12 15:40:00,417][TRACE][transport.tracer ] [node-5] [1993906][indices:data/write/bulk[s][p]] received response from [{node-5}{IOKuMzJbRjesv8NInGiuBQ}{10.75.114.160}{10.75.114.160:9
305}{ips_id=6D46:6324, master=false}]
[2017-01-12 15:40:00,417][TRACE][indices.breaker.request ] [request] Adjusted breaker by [16440] bytes, now [16440]
[2017-01-12 15:40:00,417][TRACE][transport.tracer ] [node-5] [2001951][indices:data/write/bulk[s]] sent response
[2017-01-12 15:40:00,418][DEBUG][indices.memory ] [node-5] recalculating shard indexing buffer, total is [1.4gb] with [195] active shards, each shard set to indexing=[7.7mb], translog=[64
kb]
[2017-01-12 15:40:00,418][DEBUG][index.shard ] [node-5] [logs_2000-12_performance_0][2] updating index_buffer_size from [7.8mb] to [7.7mb]; IndexWriter now using [427204] bytes
[2017-01-12 15:40:00,418][DEBUG][index.shard ] [node-5] [logs_2000-12_performance_0][18] updating index_buffer_size from [7.8mb] to [7.7mb]; IndexWriter now using [431760] bytes
[2017-01-12 15:40:00,418][DEBUG][index.shard ] [node-5] [logs_2000-12_performance_0][26] updating index_buffer_size from [7.8mb] to [7.7mb]; IndexWriter now using [431760] bytes
[2017-01-12 15:40:00,418][DEBUG][index.shard ] [node-5] [logs_2000-12_performance_0][10] updating index_buffer_size from [7.8mb] to [7.7mb]; IndexWriter now using [430140] bytes
[2017-01-12 15:40:00,419][DEBUG][index.shard ] [node-5] [logs_2012-05_performance_0][24] updating index_buffer_size from [7.8mb] to [7.7mb]; IndexWriter now using [425584] bytes
[2017-01-12 15:40:00,419][TRACE][index.engine.lucene.iw ] [node-5] [logs_1998-01_performance_0][16] elasticsearch[node-5][refresh][T#7] DW: publishFlushedSegment seg-private updates=null
[2017-01-12 15:40:00,419][TRACE][index.engine.lucene.iw ] [node-5] [logs_1998-01_performance_0][16] elasticsearch[node-5][refresh][T#7] IW: publishFlushedSegment
Okay, that was not what I'm looking for, but it still gives me information to conclude that it was what I suspected.
I was looking for API responses from the hot threads API. If you can still provide this and simultaneously turn on trace logging for org.elasticsearch.indices.IndexingMemoryController (as this is exactly where my suspicions are), we can nail this down even tighter. It's not necessary though, I think we already have it.
Here's what I think is happening: there is a memory budget that is shared among all index writers for all active shards. If you're writing to 400 indices, that budget is shared across the active shards on these 400 indices. I don't know how many shards you have, but the default budget is 10% of the heap and if you divide that across 400 shards (assuming 1 shard per index) or 2000 shards (assuming the default of 5 shards per index), you are left with a very small budget per shard. These means that the shards are constantly flushing, and this is terrible for performance.
You either need to change your write patterns (so that not all shards are active at the same time) or increase the memory budget (via indices.memory.index_buffer_size; default is 10% of the heap).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.