Elasticsearch performance issue (possibly too large filter cache)

I've noticed a performance issue that's stayed with us since version 19.8.
(We're currently using 0.90.0.) Elasticsearch operates normally for a
while, but the CPU utilization steadily increases on the nodes it's
installed on until it pegs at about 40-50% CPU utilization. (This is
averaged over time by our monitors.) At that point, after a day or so,
nodes begin to start timing out and dropping out of the cluster.

Just to cut to the chase, after getting some useful output from
hot_threads, I suspect our filter cache size is too large. The filter cache
size is set to 20%, and Elasticsearch is allocated 11G of memory for the
filter cache. Nodes that are struggling have higher filter cache sizes. Is
it reasonable to turn the filter cache size down? Will it struggle anyway
as it reaches this new, lower maximum? I suspect the answers to the
previous questions are "yes" and "no" respectively, but I thought I'd ask
before I cranked things down. These are production machines, so I can't
just clear the filter cache on these machines to observe the behavior at
this moment. But I will soon.

A full restart of every node in the cluster fixes things, at which point
the CPU utilization goes down to about where it started. After a few weeks
or months, the process repeats. As the data on the nodes grows, the time
before a restart is required decreases, but more nodes cut the time down.
At the moment, we have 22625304 documents, or about 77.1gb of data
(replicated twice to 231.6gb). We have this divided into 12 shards
(replicated twice to 36 shards) over 10 nodes.

I see a few changes in the 0.90.5 version of IndicesFilterCache.java, which
appears to be the trouble. Those appear to include refactors, though, and
it sure would be nice to know if the problem has been fixed in later
versions before updating.

I say that IndicesFilterCache.java is likely the culprit because I see that
the 'generic' thread is always consuming about 20% of the CPU on an
affected machine, node 11:

Meanwhile, it's also about 20% on another machine experiencing trouble and
about 11% on another one that is approaching trouble. We get vastly
different results with each query to hot_threads, but here is one that
shows the 'generic' thread as the hottest thread for node 11:

Every time 'generic' shows up as the hottest thread, that is where I've
seen it.

Just for completeness, here are two other responses from the hot_threads
API for node 11:


We've left the merge settings at the default, turned the refresh interval
down to 120s, and have the following merge settings:

index:
translog:
flush_threshold_ops: 5000
flush_threshold_size: 200mb
flush_threshold_period: 60s

Thanks for any guidance or ideas as I investigate.

Best,
Josh Bronson

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Oh, and just to clarify, 20+% means 20+% utilization of 4 cores on an AWS
instance. I suspect a single thread is pegging its core, and that that is
causing the issue.

On Monday, October 21, 2013 2:47:43 PM UTC-5, Josh Bronson wrote:

I've noticed a performance issue that's stayed with us since version 19.8.
(We're currently using 0.90.0.) Elasticsearch operates normally for a
while, but the CPU utilization steadily increases on the nodes it's
installed on until it pegs at about 40-50% CPU utilization. (This is
averaged over time by our monitors.) At that point, after a day or so,
nodes begin to start timing out and dropping out of the cluster.

Just to cut to the chase, after getting some useful output from
hot_threads, I suspect our filter cache size is too large. The filter cache
size is set to 20%, and Elasticsearch is allocated 11G of memory for the
filter cache. Nodes that are struggling have higher filter cache sizes. Is
it reasonable to turn the filter cache size down? Will it struggle anyway
as it reaches this new, lower maximum? I suspect the answers to the
previous questions are "yes" and "no" respectively, but I thought I'd ask
before I cranked things down. These are production machines, so I can't
just clear the filter cache on these machines to observe the behavior at
this moment. But I will soon.

A full restart of every node in the cluster fixes things, at which point
the CPU utilization goes down to about where it started. After a few weeks
or months, the process repeats. As the data on the nodes grows, the time
before a restart is required decreases, but more nodes cut the time down.
At the moment, we have 22625304 documents, or about 77.1gb of data
(replicated twice to 231.6gb). We have this divided into 12 shards
(replicated twice to 36 shards) over 10 nodes.

I see a few changes in the 0.90.5 version of IndicesFilterCache.java,
which appears to be the trouble. Those appear to include refactors, though,
and it sure would be nice to know if the problem has been fixed in later
versions before updating.

I say that IndicesFilterCache.java is likely the culprit because I see
that the 'generic' thread is always consuming about 20% of the CPU on an
affected machine, node 11:

https://gist.github.com/joshbronson/7089089#file-ps-txt

Meanwhile, it's also about 20% on another machine experiencing trouble and
about 11% on another one that is approaching trouble. We get vastly
different results with each query to hot_threads, but here is one that
shows the 'generic' thread as the hottest thread for node 11:

https://gist.github.com/joshbronson/7089089#file-first_run-txt

Every time 'generic' shows up as the hottest thread, that is where I've
seen it.

Just for completeness, here are two other responses from the hot_threads
API for node 11:

https://gist.github.com/joshbronson/7089089#file-second_run-txt
https://gist.github.com/joshbronson/7089089#file-third_run-txt

We've left the merge settings at the default, turned the refresh interval
down to 120s, and have the following merge settings:

index:
translog:
flush_threshold_ops: 5000
flush_threshold_size: 200mb
flush_threshold_period: 60s

Thanks for any guidance or ideas as I investigate.

Best,
Josh Bronson

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yup, the filter cache was too large. Cranking it down to 700m brought the
CPU utilization down across the cluster within about a minute. I'm not sure
how filter caches get dirty or why they must be cleaned in a
single-threaded manner, but in case anyone else has the same problem, make
sure your cache isn't too large!

On Monday, October 21, 2013 2:56:46 PM UTC-5, Josh Bronson wrote:

Oh, and just to clarify, 20+% means 20+% utilization of 4 cores on an AWS
instance. I suspect a single thread is pegging its core, and that that is
causing the issue.

On Monday, October 21, 2013 2:47:43 PM UTC-5, Josh Bronson wrote:

I've noticed a performance issue that's stayed with us since version
19.8. (We're currently using 0.90.0.) Elasticsearch operates normally for a
while, but the CPU utilization steadily increases on the nodes it's
installed on until it pegs at about 40-50% CPU utilization. (This is
averaged over time by our monitors.) At that point, after a day or so,
nodes begin to start timing out and dropping out of the cluster.

Just to cut to the chase, after getting some useful output from
hot_threads, I suspect our filter cache size is too large. The filter cache
size is set to 20%, and Elasticsearch is allocated 11G of memory for the
filter cache. Nodes that are struggling have higher filter cache sizes. Is
it reasonable to turn the filter cache size down? Will it struggle anyway
as it reaches this new, lower maximum? I suspect the answers to the
previous questions are "yes" and "no" respectively, but I thought I'd ask
before I cranked things down. These are production machines, so I can't
just clear the filter cache on these machines to observe the behavior at
this moment. But I will soon.

A full restart of every node in the cluster fixes things, at which point
the CPU utilization goes down to about where it started. After a few weeks
or months, the process repeats. As the data on the nodes grows, the time
before a restart is required decreases, but more nodes cut the time down.
At the moment, we have 22625304 documents, or about 77.1gb of data
(replicated twice to 231.6gb). We have this divided into 12 shards
(replicated twice to 36 shards) over 10 nodes.

I see a few changes in the 0.90.5 version of IndicesFilterCache.java,
which appears to be the trouble. Those appear to include refactors, though,
and it sure would be nice to know if the problem has been fixed in later
versions before updating.

I say that IndicesFilterCache.java is likely the culprit because I see
that the 'generic' thread is always consuming about 20% of the CPU on an
affected machine, node 11:

https://gist.github.com/joshbronson/7089089#file-ps-txt

Meanwhile, it's also about 20% on another machine experiencing trouble
and about 11% on another one that is approaching trouble. We get vastly
different results with each query to hot_threads, but here is one that
shows the 'generic' thread as the hottest thread for node 11:

https://gist.github.com/joshbronson/7089089#file-first_run-txt

Every time 'generic' shows up as the hottest thread, that is where I've
seen it.

Just for completeness, here are two other responses from the hot_threads
API for node 11:

https://gist.github.com/joshbronson/7089089#file-second_run-txt
https://gist.github.com/joshbronson/7089089#file-third_run-txt

We've left the merge settings at the default, turned the refresh interval
down to 120s, and have the following merge settings:

index:
translog:
flush_threshold_ops: 5000
flush_threshold_size: 200mb
flush_threshold_period: 60s

Thanks for any guidance or ideas as I investigate.

Best,
Josh Bronson

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.