Hi,
Background:
For the past 2 months, we had 4 servers in 18 servers cluster, that went out of order (3 times longer GCs)
After deep digging into the cluster's memory usage, we noticed the _parent field caching is high exactly in the problematic servers.
_parent field caching is the result of the parent-child relationship. ElasticSearch holds parent-child mapping in memory (one to many), which keeps every parent’s _id string in memory, and the corresponding children.
We tried –
• Analyzing children/parent spread across shards, and detect to an anomaly. No anomaly was found.
• Remove the data for the problematic servers and make it replicate.
• Remove the entire server and recreate it.
• Switch shards between a problematic server and a healthy one.
None of the above worked. What made the red light pop up was the last try, which made the healthy server problematic but did not make the problematic healthy. I started wondering if this cache behaves differently from the rest of the field data, and not cleaned up ever.
I searched for clean cache API for the _parent field, and found one, and run it against the cluster. This, as expected, cleaned this cache.
I was expecting the cache to spike again when a parent-child query will occur, but it didn’t happen.
Since yesterday, we had thousands of parent-child queries, but none of them raised the cache.
Can you please help me this case deeper?
- ElasticSearch version 1.6.0
Thank you.