I'm having data that is very frequently updated, so I use bulk updates (50k documents, ~25MB) to update the data in elasticsearch.
If a document is already present, I use scripted updates (to increase a counter) and if not, I just use the upsert-document.
While this works great on a fresh index (one bulk needs about 15sec), the second bulk (which mostly consists of updates) needs around 3-4 minutes for each bulk.
Elasticsearch is running on a single server (36GB RAM, 20GB Heap Size, 24 cores, dedicated 1GBit/s NIC)
My elasticsearch.yml
<redacted>
## Threading
threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100
# Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 300
# Index pool
threadpool.index.type: fixed
threadpool.index.size: 20
threadpool.index.queue_size: 100
# Indices settings
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb
index.translog.flush_threshold_ops: 50000
</redacted>
Before updating the documents, refresh_interval is set to -1.
I've been monitoring the progress using the great bigdesk plugin and didn't noticed any changes: The threadpool is fine (no queued requests), the GC is not different et cetera.
Do you have any more hints where I could look for this bottleneck? Can I provide further details?
From here you'll have to learn how to read java stack traces and identify hot spots from them. There are two tools available to you right now: the hot_threads api and jstack.
hot_threads attempts to guess which threads are causing trouble and gets you a snapshot of them. It works fine when one action is slow but if you have lots of actions that are slow but faster than the hot_threads window then it doesn't work well and you have to use jstack.
jstack you have to run multiple times yourself and do manual thread classification. That isn't has hard as it sounds - I've done it with sed.
Also have a look at the Elasticsearch logs to see if it is logging messages about merges falling behind. If it is then you might want to have a look at merge throttling.
BTW the %d makes me think you are building the whole json blob that with string substitution. That is probably safe for things like this but you have to be super careful with escaping. Going with a json building library is probably safer.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.