Uncontrolled merge process on few indices

shshnk28 · December 10, 2018, 6:41am

Hi ,
We have a 7 node cluster running on ES version 5.2.1. We have observed that ES internal merge process sometimes goes berserk for couple of indices, runs for more than 3 days and consumes the server resources (like CPU) till they max out. I am interested in knowing that what might be causing such uncontrolled merge. Both of these indices have certain other types as their child.
Here is the output of hot threads api.

https://pastebin.com/A0ZcF3P3

Here are the index stats

https://pastebin.com/KB2hT0Yv

Are there some pointers on where I should be looking ? Thanks

Christian_Dahlqvist · December 10, 2018, 7:05am

What is the specification of the cluster? What kind of storage are you using? How frequently do you refresh the index?

shshnk28 · December 10, 2018, 7:24am

Hi, Thanks for your response. Here is the output of cluster stats command.
{
"_nodes" : {
"total" : 8,
"successful" : 8,
"failed" : 0
},
"cluster_name" : "SearchCluster",
"timestamp" : 1544426303031,
"status" : "green",
"indices" : {
"count" : 30,
"shards" : {
"total" : 68,
"primaries" : 68,
"replication" : 0.0,
"index" : {
"shards" : {
"min" : 1,
"max" : 12,
"avg" : 2.2666666666666666
},
"primaries" : {
"min" : 1,
"max" : 12,
"avg" : 2.2666666666666666
},
"replication" : {
"min" : 0.0,
"max" : 0.0,
"avg" : 0.0
}
}
},
"docs" : {
"count" : 8034382162,
"deleted" : 446758535
},
"store" : {
"size" : "1.4tb",
"size_in_bytes" : 1548871164733,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "1.1gb",
"memory_size_in_bytes" : 1231750200,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "10.7gb",
"memory_size_in_bytes" : 11498162014,
"total_count" : 8145076647,
"hit_count" : 1328346575,
"miss_count" : 6816730072,
"cache_size" : 170525,
"cache_count" : 122852859,
"evictions" : 122682334
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 1691,
"memory" : "3.1gb",
"memory_in_bytes" : 3360026726,
"terms_memory" : "2.7gb",
"terms_memory_in_bytes" : 2899292132,
"stored_fields_memory" : "128.1mb",
"stored_fields_memory_in_bytes" : 134352368,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "5mb",
"norms_memory_in_bytes" : 5327360,
"points_memory" : "54.9mb",
"points_memory_in_bytes" : 57603434,
"doc_values_memory" : "251.2mb",
"doc_values_memory_in_bytes" : 263451432,
"index_writer_memory" : "294.1mb",
"index_writer_memory_in_bytes" : 308432515,
"version_map_memory" : "1.7mb",
"version_map_memory_in_bytes" : 1839883,
"fixed_bit_set" : "1.5gb",
"fixed_bit_set_memory_in_bytes" : 1650422272,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 8,
"data" : 7,
"coordinating_only" : 1,
"master" : 7,
"ingest" : 7
},
"versions" : [
"5.2.1"
],
"os" : {
"available_processors" : 88,
"allocated_processors" : 88,
"names" : [
{
"name" : "Linux",
"count" : 8
}
],
"mem" : {
"total" : "239.9gb",
"total_in_bytes" : 257681920000,
"free" : "4.1gb",
"free_in_bytes" : 4407902208,
"used" : "235.8gb",
"used_in_bytes" : 253274017792,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 121
},
"open_file_descriptors" : {
"min" : 441,
"max" : 1125,
"avg" : 629
}
},
"jvm" : {
"max_uptime" : "96.7d",
"max_uptime_in_millis" : 8360773114,
"versions" : [
{
"version" : "1.8.0_161",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.161-b14",
"vm_vendor" : "Oracle Corporation",
"count" : 1
}
],
"mem" : {
"heap_used" : "64.7gb",
"heap_used_in_bytes" : 69473495008,
"heap_max" : "145.3gb",
"heap_max_in_bytes" : 156103606272
},
"threads" : 1073
},
"fs" : {
"total" : "4.1tb",
"total_in_bytes" : 4596416278528,
"free" : "2.7tb",
"free_in_bytes" : 3020499034112,
"available" : "2.5tb",
"available_in_bytes" : 2797994000384
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"netty4" : 8
},
"http_types" : {
"netty4" : 8
}
}
}
}

We are using EBS on AWS(SSDs) for the disks. Checked the disk stats using iowait. They do not show any kind of saturation.
As for the refresh interval we have not changed the default ones that comes with ES. On one of the troublesome indices we receive ~50 bulk indexing request per minute which would translate to ~100 documents per minute.

Christian_Dahlqvist · December 10, 2018, 7:27am

What kind of storage are you using? How frequently do you refresh the index?

shshnk28 · December 10, 2018, 7:30am

Hi, I have updated my answer above.

Additionally, Continuously hitting hot threads endpoint reveals these stack traces as common. Could this be the culprit .

org.apache.lucene.util.LongValues.get(LongValues.java:45)
org.apache.lucene.index.SingletonSortedNumericDocValues.setDocument(SingletonSortedNumericDocValues.java:52)
org.apache.lucene.codecs.DocValuesConsumer$SortedNumericDocValuesSub.nextDoc(DocValuesConsumer.java:449)
org.apache.lucene.index.DocIDMerger$SequentialDocIDMerger.next(DocIDMerger.java:100)
org.apache.lucene.codecs.DocValuesConsumer$3$1.setNext(DocValuesConsumer.java:511)
org.apache.lucene.codecs.DocValuesConsumer$3$1.hasNext(DocValuesConsumer.java:491)
org.apache.lucene.codecs.DocValuesConsumer.isSingleValued(DocValuesConsumer.java:998)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addSortedNumericField(Lucene54DocValuesConsumer.java:586)
org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)

*****************************************************************************8

org.apache.lucene.codecs.DocValuesConsumer$10$1.next(DocValuesConsumer.java:1024)
org.apache.lucene.codecs.DocValuesConsumer$10$1.next(DocValuesConsumer.java:1015)
java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1812)
java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
java.util.stream.StreamSpliterators$WrappingSpliterator$$Lambda$1481/1465281058.getAsBoolean(Unknown Source)
java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:169)
java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addNumericField(Lucene54DocValuesConsumer.java:105)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addNumericField(Lucene54DocValuesConsumer.java:296)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addNumericField(Lucene54DocValuesConsumer.java:89)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addSortedNumericField(Lucene54DocValuesConsumer.java:589)

shshnk28 · December 11, 2018, 10:30am

@Christian_Dahlqvist request you to have a look . is there any additional info I can provide,

shshnk28 · December 17, 2018, 4:20am

@dadoonet as an active member if you take a look on this it would be helpful for me.

Christian_Dahlqvist · December 17, 2018, 6:11am

Are you performing a lot of updates? Do you have any non-default node or index settings?

shshnk28 · December 17, 2018, 9:25am

We are not performing a lot of updates. As I mentioned in above thread we are indexing ~100 documents per minute.
There is nothing out of ordinary in our index settings . I am putting it here
{
"index.name.primary" : {
"settings" : {
"index" : {
"routing" : {
"allocation" : {
"include" : {
"type" : "primary"
}
}
},
"mapping" : {
"ignore_malformed" : "true"
},
"refresh_interval" : "1s",
"number_of_shards" : "1",
"provided_name" : "index.name.primary",
"max_result_window" : "1000000",
"mapper" : {
"dynamic" : "false"
},
"creation_date" : "1505657680807",
"number_of_replicas" : "0",
"uuid" : "LMCLskY4S4iPooQKY2pIzQ",
"version" : {
"created" : "5020199"
}
}
}
}
}

Node setting are also default.
If you check the above thread then threre are two methods in hot_threads api which catch my attention those being mergeSortedNumericField and addSortedNumericField. These are present in mostly all the stack traces. Can these lead us to the issue?

Christian_Dahlqvist · December 17, 2018, 9:37am

What type of storage do you have?

shshnk28 · December 17, 2018, 11:19am

We are using general purpose (gp2) volumes on AWS EBS

system · January 14, 2019, 11:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU Usage across entire cluster Elasticsearch	9	1133	April 11, 2022
High CPU usage because lucene merge thread Elasticsearch	6	1817	July 15, 2021
Excessive merging/small segment sizes Elasticsearch	26	3687	July 6, 2017
Need help in undertsanding Elastic Hot Threads API output Elasticsearch	17	543	December 22, 2022
Indexes are keep increasing after traffic towards elastic has stoped Elasticsearch	20	629	September 20, 2021

Uncontrolled merge process on few indices

Related topics