Elasticsearch 5 stops during long indexing

Hi all,

I'm indexing 2B small documents on a single node with ES 5.4.

Documents are organized in ~3K indexes, for a total of 2TB. The indexes' occupancy varies from few KB to hundreds of GB and they are sharded in order to keep each shard under 5GB.

I'm indexing with 14 threads concurrently on different indexes with bulk requests of 2K documents each. The server has 16CPU and 32GB of RAM and no reads are performed during this indexing process.

Relevant ES configurations are:

  • Xms12g -Xmx12g in jvm.options
  • bootstrap.memory_lock:true
  • indices.memory.index_buffer_size: 95%
  • MAX_OPEN_FILES=1000000

After 50~100M indexed documents ES starts to slow down drastically.

In the logs I read:

[2017-05-11T16:44:24,751][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][2141] overhead, spent [13.5s] collecting in the last [14.3s]
[2017-05-11T16:44:40,323][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][old][2143][337] duration [14s], collections [1]/[14.5s], total [14s]/[54.2m], memory [11.8gb]->[11.4gb]/[11.8gb], all_pools {[young] [865.3mb]->[533.8mb]/[865.3mb]}{[survivor] [75mb]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-05-11T16:44:40,323][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][2143] overhead, spent [14s] collecting in the last [14.5s]
[2017-05-11T16:44:53,004][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][old][2145][338] duration [10.8s], collections [1]/[11.6s], total [10.8s]/[54.4m], memory [11.8gb]->[11.4gb]/[11.8gb], all_pools {[young] [865.3mb]->[538.9mb]/[865.3mb]}{[survivor] [44.9mb]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-05-11T16:44:53,004][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][2145] overhead, spent [10.8s] collecting in the last [11.6s]
[2017-05-11T16:45:05,141][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][old][2147][339] duration [11s], collections [1]/[11.1s], total [11s]/[54.6m], memory [11.8gb]->[11.4gb]/[11.8gb], all_pools {[young] [865.3mb]->[558.3mb]/[865.3mb]}{[survivor] [103.1mb]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-05-11T16:45:05,142][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][2147] overhead, spent [11s] collecting in the last [11.1s]
[2017-05-11T16:45:19,928][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][old][2149][340] duration [13s], collections [1]/[13.7s], total [13s]/[54.8m], memory [11.8gb]->[11.5gb]/[11.8gb], all_pools {[young] [865.3mb]->[570.1mb]/[865.3mb]}{[survivor] [48.9mb]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-05-11T16:45:19,928][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][2149] overhead, spent [13s] collecting in the last [13.7s]
[2017-05-11T16:45:35,926][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][old][2152][341] duration [13.7s], collections [1]/[13.8s], total [13.7s]/[55.1m], memory [11.8gb]->[11.5gb]/[11.8gb], all_pools {[young] [865.3mb]->[575mb]/[865.3mb]}{[survivor] [104.6mb]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-05-11T16:45:35,931][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][2152] overhead, spent [13.7s] collecting in the last [13.8s]
[2017-05-11T16:45:49,919][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][old][2154][342] duration [12.8s], collections [1]/[12.9s], total [12.8s]/[55.3m], memory [11.8gb]->[11.5gb]/[11.8gb], all_pools {[young] [865.3mb]->[577.1mb]/[865.3mb]}{[survivor] [102.3mb]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}
[2017-05-11T16:45:49,919][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][2154] overhead, spent [12.8s] collecting in the last [12.9s]
[2017-05-11T16:46:03,976][WARN ][o.e.m.j.JvmGcMonitorService] [VT0Xr1c] [gc][old][2156][343] duration [12.1s], collections [1]/[13s], total [12.1s]/[55.5m], memory [11.8gb]->[11.5gb]/[11.8gb], all_pools {[young] [865.3mb]->[601mb]/[865.3mb]}{[survivor] [50.9mb]->[0b]/[108.1mb]}{[old] [10.9gb]->[10.9gb]/[10.9gb]}

Looking at the end dates of the GC, they all are quite the same of the start dates of the next GC. These GCs have a duration of more than 10seconds and it seems that they are blocking ES to prevent Java from going out of memory. As soon as ES halts for GC, the bulk requests go in timeout (30 seconds or more) and the indexing process fails.

With ES 2.4 I hadn't this problem. I'm using ES 5.4 because I read that now it splits better the heap among the indexes.

Am I doing anything wrong? What should I change in order to keep the performance high during the entire indexing process?

PS: the same question is posted on Stackoverflow: Elasticsearch 5 stops during long indexing

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.