Slow document upsert time

I'm experiencing some unexpectedly high update times for upserting
documents into a single index using a custom river plugin which I wrote.
I've noticed that when I start the river plugin, elapsed time to upsert
documents into my index take on the order of tens of milliseconds with very
little variance in upsert time. The throughput of document upserts hangs
around 2k per minute. However, after the cluster (6 nodes) and the river
have run for a few hours, the variance in update times increases
dramatically, with a fraction of the upserts still taking in the tens of
milliseconds, but other upserts taking tens of seconds (sometimes up to 40
or 50 seconds total). Throughput slowly decreases until it reaches about
100 per minute.

I've seen this thread
(https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/qmCI4sA6lPE)
, but wanted to get another opinion.

The code that we use inside our river plugin to upsert documents looks like
this:

IndexRequestBuilder builder = client.prepareIndex(elasticSearchIndex,
elasticSearchDocumentType, id);

builder.setRefresh(false);

long beforeIndexUpdateTime = System.currentTimeMillis();
builder.setSource(payload).execute().actionGet();
long afterIndexUpdateTime = System.currentTimeMillis();

long elapsedTime = afterIndexUpdateTime - beforeIndexUpdateTime;

When I notice that average upsert times are growing beyond 1 second, I
begin restarting cluster nodes one at a time. As I restart nodes, the
average upsert time slowly drops, until I get to the last node of the
cluster. When I restart the last node, average upsert times immediately
drop back to about 10 ms, and throughput for upserts goes back to normal
(2k per minute). I typically have to do this once per day to keep the river
processing records in an expedient manner.

Something else curious: we have refresh = false for our indexing, and
refresh interval = -1, yet our nodes stats reports that there is time taken
up with refresh operations.

Some facts about our cluster:

  • 6 identical hardware nodes
  • 64 GB ram, 32GB allocated to ES
  • Over time (< 1 day), ES heap space is maxed out and remainder of system
    memory is consumed by OS file cache, which is expected
  • Our ES cluster is under sustained query load all of the time, but it's
    not very high load - a few queries a second typically.
  • Index refresh interval is -1
  • We have not tried bulk insert

Any ideas about what might be going on, or what I could do to prevent these
upsert slowdowns? Is bulk insert the only possible way to alleviate this,
or are there other parameters which could be tuned?

Thanks!
-Taylor

Node stats:
{"cluster_name":"merchant_data_prod_snc1","nodes":{"TzYLULMaScSAQvhwh7EuUQ":{"timestamp":1354560090787,"name":"Stevie
Hunter","transport_address":"inet[/10.20.71.146:9300]","hostname":"merchant-data-esmaster1","indices":{"store":{"size":"23.3gb","size_in_bytes":25025106146,"throttle_time":"0s","throttle_time_in_millis":0},"docs":{"count":5672834,"deleted":1581696},"indexing":{"index_total":35687,"index_time":"54.2s","index_time_in_millis":54298,"index_current":1,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":7397,"time":"1.2s","time_in_millis":1209,"exists_total":7374,"exists_time":"1.2s","exists_time_in_millis":1209,"missing_total":23,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":331360,"query_time":"10.3m","query_time_in_millis":618379,"query_current":0,"fetch_total":16036,"fetch_time":"2s","fetch_time_in_millis":2045,"fetch_current":0},"cache":{"field_evictions":0,"field_size":"227.8mb","field_size_in_bytes":238918776,"filter_count":349,"filter_evictions":0,"filter_size":"2.2mb","filter_size_in_bytes":2398392},"merges":{"current":0,"current_docs":0,"current_size":"0b","current_size_in_bytes":0,"total":4704,"total_time":"5.1m","total_time_in_millis":307563,"total_docs":1404914,"total_size":"4.7gb","total_size_in_bytes":5140763017},"refresh":{"total":20348,"total_time":"5.8m","total_time_in_millis":352872},"flush":{"total":62,"total_time":"2.2m","total_time_in_millis":137203}}},"rukevpq_QYSMprkFwBw4hg":{"timestamp":1354560090788,"name":"Sekhmet","transport_address":"inet[/10.20.71.154:9300]","hostname":"merchant-data-search3","indices":{"store":{"size":"18.4gb","size_in_bytes":19854219732,"throttle_time":"0s","throttle_time_in_millis":0},"docs":{"count":4253940,"deleted":1050026},"indexing":{"index_total":6666,"index_time":"10.9s","index_time_in_millis":10937,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":1515,"time":"256ms","time_in_millis":256,"exists_total":1508,"exists_time":"256ms","exists_time_in_millis":256,"missing_total":7,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":66775,"query_time":"2m","query_time_in_millis":125095,"query_current":0,"fetch_total":3133,"fetch_time":"307ms","fetch_time_in_millis":307,"fetch_current":0},"cache":{"field_evictions":0,"field_size":"88mb","field_size_in_bytes":92364704,"filter_count":137,"filter_evictions":0,"filter_size":"356.9kb","filter_size_in_bytes":365520},"merges":{"current":1,"current_docs":33,"current_size":"296kb","current_size_in_bytes":303191,"total":870,"total_time":"41.6s","total_time_in_millis":41605,"total_docs":252721,"total_size":"849.9mb","total_size_in_bytes":891273472},"refresh":{"total":4258,"total_time":"28.9s","total_time_in_millis":28973},"flush":{"total":5,"total_time":"660ms","total_time_in_millis":660}}},"JxtjB0NxSXm5nFMlVERvlw":{"timestamp":1354560090788,"name":"Hammerhead","transport_address":"inet[/10.20.73.21:9300]","hostname":"merchant-data-esmaster2","indices":{"store":{"size":"40gb","size_in_bytes":42955598450,"throttle_time":"0s","throttle_time_in_millis":0},"docs":{"count":9465644,"deleted":2395535},"indexing":{"index_total":35631,"index_time":"1m","index_time_in_millis":64685,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":8860,"time":"19.8s","time_in_millis":19894,"exists_total":8842,"exists_time":"19.8s","exists_time_in_millis":19894,"missing_total":18,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":390440,"query_time":"14m","query_time_in_millis":843619,"query_current":0,"fetch_total":18766,"fetch_time":"48.1s","fetch_time_in_millis":48148,"fetch_current":0},"cache":{"field_evictions":0,"field_size":"296.1mb","field_size_in_bytes":310580060,"filter_count":264,"filter_evictions":0,"filter_size":"2.3mb","filter_size_in_bytes":2464480},"merges":{"current":0,"current_docs":0,"current_size":"0b","current_size_in_bytes":0,"total":4642,"total_time":"6.7m","total_time_in_millis":402696,"total_docs":1293547,"total_size":"4.7gb","total_size_in_bytes":5113798115},"refresh":{"total":24125,"total_time":"6.4m","total_time_in_millis":385515},"flush":{"total":47,"total_time":"1.4m","total_time_in_millis":86634}}},"mCUKsUBqRnyJ_iHrR0zaYQ":{"timestamp":1354560090787,"name":"Whitemane,
Kofi","transport_address":"inet[/10.20.70.120:9300]","hostname":"merchant-data-search1","indices":{"store":{"size":"30.7gb","size_in_bytes":33065685443,"throttle_time":"0s","throttle_time_in_millis":0},"docs":{"count":7265217,"deleted":1865636},"indexing":{"index_total":24208,"index_time":"36.2s","index_time_in_millis":36276,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":6779,"time":"17.2s","time_in_millis":17295,"exists_total":6752,"exists_time":"17.2s","exists_time_in_millis":17295,"missing_total":27,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":276755,"query_time":"9.1m","query_time_in_millis":547407,"query_current":0,"fetch_total":13301,"fetch_time":"32.3s","fetch_time_in_millis":32368,"fetch_current":0},"cache":{"field_evictions":0,"field_size":"234.1mb","field_size_in_bytes":245493260,"filter_count":167,"filter_evictions":0,"filter_size":"1.1mb","filter_size_in_bytes":1194392},"merges":{"current":0,"current_docs":0,"current_size":"0b","current_size_in_bytes":0,"total":3195,"total_time":"3.2m","total_time_in_millis":193794,"total_docs":864134,"total_size":"3.2gb","total_size_in_bytes":3524762890},"refresh":{"total":17687,"total_time":"2.4m","total_time_in_millis":148159},"flush":{"total":16,"total_time":"4.9s","total_time_in_millis":4992}}},"WBvkgmDcQUG7rYbRhKTGgQ":{"timestamp":1354560090789,"name":"Typhoid
Mary","transport_address":"inet[/10.20.73.22:9300]","hostname":"merchant-data-search4","indices":{"store":{"size":"5.7gb","size_in_bytes":6215870795,"throttle_time":"0s","throttle_time_in_millis":0},"docs":{"count":1464830,"deleted":403712},"indexing":{"index_total":4138,"index_time":"3.3s","index_time_in_millis":3345,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":658,"time":"112ms","time_in_millis":112,"exists_total":654,"exists_time":"112ms","exists_time_in_millis":112,"missing_total":4,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":29127,"query_time":"55.4s","query_time_in_millis":55428,"query_current":0,"fetch_total":1453,"fetch_time":"145ms","fetch_time_in_millis":145,"fetch_current":0},"cache":{"field_evictions":0,"field_size":"59.4mb","field_size_in_bytes":62355984,"filter_count":85,"filter_evictions":0,"filter_size":"240.1kb","filter_size_in_bytes":245896},"merges":{"current":0,"current_docs":0,"current_size":"0b","current_size_in_bytes":0,"total":654,"total_time":"38.8s","total_time_in_millis":38889,"total_docs":155980,"total_size":"503.9mb","total_size_in_bytes":528471197},"refresh":{"total":1986,"total_time":"11.7s","total_time_in_millis":11700},"flush":{"total":6,"total_time":"189ms","total_time_in_millis":189}}},"GzpOGQUnSQunG7lBXOh9aQ":{"timestamp":1354560090787,"name":"Jones,
Gabe","transport_address":"inet[/10.20.70.119:9300]","hostname":"merchant-data-search2","indices":{"store":{"size":"31.2gb","size_in_bytes":33527799828,"throttle_time":"0s","throttle_time_in_millis":0},"docs":{"count":7280148,"deleted":1696965},"indexing":{"index_total":12755,"index_time":"15.7s","index_time_in_millis":15717,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":3320,"time":"3.7s","time_in_millis":3722,"exists_total":3310,"exists_time":"3.7s","exists_time_in_millis":3722,"missing_total":10,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":138146,"query_time":"4.5m","query_time_in_millis":272979,"query_current":0,"fetch_total":6607,"fetch_time":"6.9s","fetch_time_in_millis":6930,"fetch_current":0},"cache":{"field_evictions":0,"field_size":"139.3mb","field_size_in_bytes":146091020,"filter_count":129,"filter_evictions":0,"filter_size":"644.2kb","filter_size_in_bytes":659728},"merges":{"current":0,"current_docs":0,"current_size":"0b","current_size_in_bytes":0,"total":1789,"total_time":"1.9m","total_time_in_millis":115117,"total_docs":405066,"total_size":"1.4gb","total_size_in_bytes":1607732305},"refresh":{"total":8965,"total_time":"1m","total_time_in_millis":63856},"flush":{"total":11,"total_time":"3.9s","total_time_in_millis":3983}}}}}

Index stats:
{"ok":true,"_shards":{"total":60,"successful":60,"failed":0},"_all":{"primaries":{"docs":{"count":8675769,"deleted":1679654},"store":{"size":"38.1gb","size_in_bytes":40998312655,"throttle_time":"0s","throttle_time_in_millis":0},"indexing":{"index_total":0,"index_time":"0s","index_time_in_millis":0,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":0,"time":"0s","time_in_millis":0,"exists_total":0,"exists_time":"0s","exists_time_in_millis":0,"missing_total":0,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":0,"query_time":"0s","query_time_in_millis":0,"query_current":0,"fetch_total":0,"fetch_time":"0s","fetch_time_in_millis":0,"fetch_current":0}},"total":{"docs":{"count":26027307,"deleted":5038962},"store":{"size":"114.5gb","size_in_bytes":122994937965,"throttle_time":"0s","throttle_time_in_millis":0},"indexing":{"index_total":0,"index_time":"0s","index_time_in_millis":0,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":0,"time":"0s","time_in_millis":0,"exists_total":0,"exists_time":"0s","exists_time_in_millis":0,"missing_total":0,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":0,"query_time":"0s","query_time_in_millis":0,"query_current":0,"fetch_total":0,"fetch_time":"0s","fetch_time_in_millis":0,"fetch_current":0}},"indices":{"places_2012_10_24":{"primaries":{"docs":{"count":8675769,"deleted":1679654},"store":{"size":"38.1gb","size_in_bytes":40998312655,"throttle_time":"0s","throttle_time_in_millis":0},"indexing":{"index_total":0,"index_time":"0s","index_time_in_millis":0,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":0,"time":"0s","time_in_millis":0,"exists_total":0,"exists_time":"0s","exists_time_in_millis":0,"missing_total":0,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":0,"query_time":"0s","query_time_in_millis":0,"query_current":0,"fetch_total":0,"fetch_time":"0s","fetch_time_in_millis":0,"fetch_current":0}},"total":{"docs":{"count":26027307,"deleted":5038962},"store":{"size":"114.5gb","size_in_bytes":122994937965,"throttle_time":"0s","throttle_time_in_millis":0},"indexing":{"index_total":0,"index_time":"0s","index_time_in_millis":0,"index_current":0,"delete_total":0,"delete_time":"0s","delete_time_in_millis":0,"delete_current":0},"get":{"total":0,"time":"0s","time_in_millis":0,"exists_total":0,"exists_time":"0s","exists_time_in_millis":0,"missing_total":0,"missing_time":"0s","missing_time_in_millis":0,"current":0},"search":{"query_total":0,"query_time":"0s","query_time_in_millis":0,"query_current":0,"fetch_total":0,"fetch_time":"0s","fetch_time_in_millis":0,"fetch_current":0}}}}}}

--

You are using very large heaps. From what it seems, you suffer from long
garbage collection (GC) runs. GC runs can not be influenced by changing
inserts to bulk inserts, or changed refresh settings.

I would like to recommend the following steps for better diagnostics:

  • take a fraction of your data you are indexing, configure a test cluster
    with a single node with a small heap, index your data into this node with
    your river setup
  • enable debug logging, enable Java heap logging
  • monitor your test node, there are tools like BigDesk out there
  • just like in each Java app, analyze what is on your heap, how the heap
    grows, eventually take heap snapshots, use tools like jvisualvm
  • watch out for memory leaks, especially in your custom river

For more about understanding the Java Virtual machine settings, the large
heap challenge, and some strategies against performance degradation, see
also

http://jprante.github.com/2012/11/28/Elasticsearch-Java-Virtual-Machine-settings-explained.html

Best regards,

Jörg

--