we use elasticsearch to store log data.
Currently we have 16 data nodes and 3 master.
Each data node is 24 core cpu, 2.6Tx12 disk, and 32G memory.
We use half of the memory for elasticsearch.
We use index name like bussness-2018.01.01. When the shards number grow to 27000, the gc begins like
[2018-01-08T13:32:13,437][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295560] overhead, spent [15.4s] collecting in the last [16.2s]
[2018-01-08T13:32:53,044][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295585][122299] duration [15.3s], collections [2]/[15.4s], total [15.3s]/[12.2h], memory [13.7gb]->[12.4gb]/[13.8gb], all_pools {[young] [1.1gb]->[2.4mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [12.4gb]->[12.4gb]/[12.5gb]}
[2018-01-08T13:32:53,044][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295585] overhead, spent [15.3s] collecting in the last [15.4s]
[2018-01-08T13:33:16,008][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295593][122300] duration [14.4s], collections [1]/[15.9s], total [14.4s]/[12.2h], memory [13.6gb]->[12.7gb]/[13.8gb], all_pools {[young] [1gb]->[253.4mb]/[1.1gb]}{[survivor] [131.8mb]->[0b]/[149.7mb]}{[old] [12.4gb]->[12.4gb]/[12.5gb]}
[2018-01-08T13:33:16,023][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295593] overhead, spent [14.9s] collecting in the last [15.9s]
[2018-01-08T13:33:33,060][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295596][122301] duration [14.7s], collections [1]/[15s], total [14.7s]/[12.2h], memory [13.4gb]->[12.2gb]/[13.8gb], all_pools {[young] [859.2mb]->[2.5mb]/[1.1gb]}{[survivor] [113.3mb]->[0b]/[149.7mb]}{[old] [12.5gb]->[12.1gb]/[12.5gb]}
[2018-01-08T13:33:33,060][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295596] overhead, spent [14.7s] collecting in the last [15s]
[2018-01-08T13:35:32,803][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295700][122316] duration [15.8s], collections [1]/[16.2s], total [15.8s]/[12.2h], memory [13.4gb]->[12.5gb]/[13.8gb], all_pools {[young] [985.1mb]->[163.6mb]/[1.1gb]}{[survivor] [125mb]->[0b]/[149.7mb]}{[old] [12.4gb]->[12.4gb]/[12.5gb]}
[2018-01-08T13:35:32,804][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295700] overhead, spent [15.8s] collecting in the last [16.2s]
[2018-01-08T13:35:49,828][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295703][122317] duration [14.7s], collections [1]/[15s], total [14.7s]/[12.2h], memory [13gb]->[12.1gb]/[13.8gb], all_pools {[young] [448.8mb]->[27mb]/[1.1gb]}{[survivor] [116.6mb]->[0b]/[149.7mb]}{[old] [12.4gb]->[12.1gb]/[12.5gb]}
Then we delete some indices, it's 17000, everything is ok.
But two days later, the same problem occurs.
We have to merge some small indices into a big one. Then we down the number of shards to 9000.
It went well for about 3 days. gc problems occurs again.
We delete some indices to make it 7000, Now it's ok.
How should we improve the gc performace ? We cann't continue to delete the indices.
And this is the jvm of a new restart data node.
{
"_nodes" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"cluster_name" : "myesdb",
"nodes" : {
"pw4f0oWfT9KGSGi84AKm4w" : {
"timestamp" : 1515421426980,
"name" : "es-data-m-9",
"transport_address" : "10.83.56.10:9300",
"host" : "10.83.56.10",
"ip" : "10.83.56.10:9300",
"roles" : [
"data",
"ingest"
],
"jvm" : {
"timestamp" : 1515421426981,
"uptime_in_millis" : 11791868,
"mem" : {
"heap_used_in_bytes" : 10351075064,
"heap_used_percent" : 69,
"heap_committed_in_bytes" : 14875361280,
"heap_max_in_bytes" : 14875361280,
"non_heap_used_in_bytes" : 145224888,
"non_heap_committed_in_bytes" : 152145920,
"pools" : {
"young" : {
"used_in_bytes" : 988229576,
"max_in_bytes" : 1256259584,
"peak_used_in_bytes" : 1256259584,
"peak_max_in_bytes" : 1256259584
},
"survivor" : {
"used_in_bytes" : 154718320,
"max_in_bytes" : 157024256,
"peak_used_in_bytes" : 157024256,
"peak_max_in_bytes" : 157024256
},
"old" : {
"used_in_bytes" : 9208127168,
"max_in_bytes" : 13462077440,
"peak_used_in_bytes" : 13088728336,
"peak_max_in_bytes" : 13462077440
}
}
},
"threads" : {
"count" : 272,
"peak_count" : 314
},
"gc" : {
"collectors" : {
"young" : {
"collection_count" : 2845,
"collection_time_in_millis" : 168440
},
"old" : {
"collection_count" : 981,
"collection_time_in_millis" : 202767
}
}
},
"buffer_pools" : {
"direct" : {
"count" : 274,
"used_in_bytes" : 815798639,
"total_capacity_in_bytes" : 815798638
},
"mapped" : {
"count" : 22171,
"used_in_bytes" : 1962403489862,
"total_capacity_in_bytes" : 1962403489862
}
},
"classes" : {
"current_loaded_count" : 12300,
"total_loaded_count" : 12511,
"total_unloaded_count" : 211
}
}
}
}
}