Uncontrolled merge process on few indices

Hi ,
We have a 7 node cluster running on ES version 5.2.1. We have observed that ES internal merge process sometimes goes berserk for couple of indices, runs for more than 3 days and consumes the server resources (like CPU) till they max out. I am interested in knowing that what might be causing such uncontrolled merge. Both of these indices have certain other types as their child.
Here is the output of hot threads api.

https://pastebin.com/A0ZcF3P3

Here are the index stats

https://pastebin.com/KB2hT0Yv

Are there some pointers on where I should be looking ? Thanks

What is the specification of the cluster? What kind of storage are you using? How frequently do you refresh the index?

Hi, Thanks for your response. Here is the output of cluster stats command.
{
"_nodes" : {
"total" : 8,
"successful" : 8,
"failed" : 0
},
"cluster_name" : "SearchCluster",
"timestamp" : 1544426303031,
"status" : "green",
"indices" : {
"count" : 30,
"shards" : {
"total" : 68,
"primaries" : 68,
"replication" : 0.0,
"index" : {
"shards" : {
"min" : 1,
"max" : 12,
"avg" : 2.2666666666666666
},
"primaries" : {
"min" : 1,
"max" : 12,
"avg" : 2.2666666666666666
},
"replication" : {
"min" : 0.0,
"max" : 0.0,
"avg" : 0.0
}
}
},
"docs" : {
"count" : 8034382162,
"deleted" : 446758535
},
"store" : {
"size" : "1.4tb",
"size_in_bytes" : 1548871164733,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "1.1gb",
"memory_size_in_bytes" : 1231750200,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "10.7gb",
"memory_size_in_bytes" : 11498162014,
"total_count" : 8145076647,
"hit_count" : 1328346575,
"miss_count" : 6816730072,
"cache_size" : 170525,
"cache_count" : 122852859,
"evictions" : 122682334
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 1691,
"memory" : "3.1gb",
"memory_in_bytes" : 3360026726,
"terms_memory" : "2.7gb",
"terms_memory_in_bytes" : 2899292132,
"stored_fields_memory" : "128.1mb",
"stored_fields_memory_in_bytes" : 134352368,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "5mb",
"norms_memory_in_bytes" : 5327360,
"points_memory" : "54.9mb",
"points_memory_in_bytes" : 57603434,
"doc_values_memory" : "251.2mb",
"doc_values_memory_in_bytes" : 263451432,
"index_writer_memory" : "294.1mb",
"index_writer_memory_in_bytes" : 308432515,
"version_map_memory" : "1.7mb",
"version_map_memory_in_bytes" : 1839883,
"fixed_bit_set" : "1.5gb",
"fixed_bit_set_memory_in_bytes" : 1650422272,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 8,
"data" : 7,
"coordinating_only" : 1,
"master" : 7,
"ingest" : 7
},
"versions" : [
"5.2.1"
],
"os" : {
"available_processors" : 88,
"allocated_processors" : 88,
"names" : [
{
"name" : "Linux",
"count" : 8
}
],
"mem" : {
"total" : "239.9gb",
"total_in_bytes" : 257681920000,
"free" : "4.1gb",
"free_in_bytes" : 4407902208,
"used" : "235.8gb",
"used_in_bytes" : 253274017792,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 121
},
"open_file_descriptors" : {
"min" : 441,
"max" : 1125,
"avg" : 629
}
},
"jvm" : {
"max_uptime" : "96.7d",
"max_uptime_in_millis" : 8360773114,
"versions" : [
{
"version" : "1.8.0_161",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.161-b14",
"vm_vendor" : "Oracle Corporation",
"count" : 1
}
],
"mem" : {
"heap_used" : "64.7gb",
"heap_used_in_bytes" : 69473495008,
"heap_max" : "145.3gb",
"heap_max_in_bytes" : 156103606272
},
"threads" : 1073
},
"fs" : {
"total" : "4.1tb",
"total_in_bytes" : 4596416278528,
"free" : "2.7tb",
"free_in_bytes" : 3020499034112,
"available" : "2.5tb",
"available_in_bytes" : 2797994000384
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"netty4" : 8
},
"http_types" : {
"netty4" : 8
}
}
}
}

  1. We are using EBS on AWS(SSDs) for the disks. Checked the disk stats using iowait. They do not show any kind of saturation.

  2. As for the refresh interval we have not changed the default ones that comes with ES. On one of the troublesome indices we receive ~50 bulk indexing request per minute which would translate to ~100 documents per minute.

What kind of storage are you using? How frequently do you refresh the index?

Hi, I have updated my answer above.

Additionally, Continuously hitting hot threads endpoint reveals these stack traces as common. Could this be the culprit .

org.apache.lucene.util.LongValues.get(LongValues.java:45)
org.apache.lucene.index.SingletonSortedNumericDocValues.setDocument(SingletonSortedNumericDocValues.java:52)
org.apache.lucene.codecs.DocValuesConsumer$SortedNumericDocValuesSub.nextDoc(DocValuesConsumer.java:449)
org.apache.lucene.index.DocIDMerger$SequentialDocIDMerger.next(DocIDMerger.java:100)
org.apache.lucene.codecs.DocValuesConsumer$3$1.setNext(DocValuesConsumer.java:511)
org.apache.lucene.codecs.DocValuesConsumer$3$1.hasNext(DocValuesConsumer.java:491)
org.apache.lucene.codecs.DocValuesConsumer.isSingleValued(DocValuesConsumer.java:998)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addSortedNumericField(Lucene54DocValuesConsumer.java:586)
org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)

*****************************************************************************8

org.apache.lucene.codecs.DocValuesConsumer$10$1.next(DocValuesConsumer.java:1024)
org.apache.lucene.codecs.DocValuesConsumer$10$1.next(DocValuesConsumer.java:1015)
java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1812)
java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
java.util.stream.StreamSpliterators$WrappingSpliterator$$Lambda$1481/1465281058.getAsBoolean(Unknown Source)
java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:169)
java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addNumericField(Lucene54DocValuesConsumer.java:105)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addNumericField(Lucene54DocValuesConsumer.java:296)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addNumericField(Lucene54DocValuesConsumer.java:89)
org.apache.lucene.codecs.lucene54.Lucene54DocValuesConsumer.addSortedNumericField(Lucene54DocValuesConsumer.java:589)

@Christian_Dahlqvist request you to have a look . is there any additional info I can provide,

@dadoonet as an active member if you take a look on this it would be helpful for me.

Are you performing a lot of updates? Do you have any non-default node or index settings?

We are not performing a lot of updates. As I mentioned in above thread we are indexing ~100 documents per minute.
There is nothing out of ordinary in our index settings . I am putting it here
{
"index.name.primary" : {
"settings" : {
"index" : {
"routing" : {
"allocation" : {
"include" : {
"type" : "primary"
}
}
},
"mapping" : {
"ignore_malformed" : "true"
},
"refresh_interval" : "1s",
"number_of_shards" : "1",
"provided_name" : "index.name.primary",
"max_result_window" : "1000000",
"mapper" : {
"dynamic" : "false"
},
"creation_date" : "1505657680807",
"number_of_replicas" : "0",
"uuid" : "LMCLskY4S4iPooQKY2pIzQ",
"version" : {
"created" : "5020199"
}
}
}
}
}

Node setting are also default.
If you check the above thread then threre are two methods in hot_threads api which catch my attention those being mergeSortedNumericField and addSortedNumericField. These are present in mostly all the stack traces. Can these lead us to the issue?

What type of storage do you have?

We are using general purpose (gp2) volumes on AWS EBS

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.