Elasticsearch performance issue (possibly too large filter cache)

Josh_Bronson_2 · October 21, 2013, 7:47pm

I've noticed a performance issue that's stayed with us since version 19.8.
(We're currently using 0.90.0.) Elasticsearch operates normally for a
while, but the CPU utilization steadily increases on the nodes it's
installed on until it pegs at about 40-50% CPU utilization. (This is
averaged over time by our monitors.) At that point, after a day or so,
nodes begin to start timing out and dropping out of the cluster.

Just to cut to the chase, after getting some useful output from
hot_threads, I suspect our filter cache size is too large. The filter cache
size is set to 20%, and Elasticsearch is allocated 11G of memory for the
filter cache. Nodes that are struggling have higher filter cache sizes. Is
it reasonable to turn the filter cache size down? Will it struggle anyway
as it reaches this new, lower maximum? I suspect the answers to the
previous questions are "yes" and "no" respectively, but I thought I'd ask
before I cranked things down. These are production machines, so I can't
just clear the filter cache on these machines to observe the behavior at
this moment. But I will soon.

A full restart of every node in the cluster fixes things, at which point
the CPU utilization goes down to about where it started. After a few weeks
or months, the process repeats. As the data on the nodes grows, the time
before a restart is required decreases, but more nodes cut the time down.
At the moment, we have 22625304 documents, or about 77.1gb of data
(replicated twice to 231.6gb). We have this divided into 12 shards
(replicated twice to 36 shards) over 10 nodes.

I see a few changes in the 0.90.5 version of IndicesFilterCache.java, which
appears to be the trouble. Those appear to include refactors, though, and
it sure would be nice to know if the problem has been fixed in later
versions before updating.

I say that IndicesFilterCache.java is likely the culprit because I see that
the 'generic' thread is always consuming about 20% of the CPU on an
affected machine, node 11:

gist.github.com

https://gist.github.com/joshbronson/7089089#file-ps-txt

first_run.txt

::: [starkblue-elasticsearch-11][KPMMjz5sQjCVcfO-ydKdsA][inet[/10.164.0.39:9300]]{ironfan_facet=elasticsearch, ironfan_facet_idx=11, ironfan_name=starkblue-elasticsearch-11, ironfan_cluster=starkblue, max_local_storage_nodes=1, master=true}

   32.0% (160ms out of 500ms) cpu usage by thread 'elasticsearch[starkblue-elasticsearch-11][generic][T#19592]'
     4/10 snapshots sharing following 8 elements
       org.elasticsearch.common.cache.LocalCache$HashIterator.nextInTable(LocalCache.java:4372)21873
       org.elasticsearch.common.cache.LocalCache$HashIterator.advance(LocalCache.java:4337)
       org.elasticsearch.common.cache.LocalCache$HashIterator.nextEntry(LocalCache.java:4412)
       org.elasticsearch.common.cache.LocalCache$KeyIterator.next(LocalCache.java:4428)
       org.elasticsearch.indices.cache.filter.IndicesFilterCache$ReaderCleaner$1.run(IndicesFilterCache.java:184)
       java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

This file has been truncated. show original

ps.txt

$ ps -e HO ppid,lwp,%cpu --sort %cpu | grep java | tail
21873  7970 22057  0.6 S ?        02:41:05 java
21873  7970 22058  0.6 S ?        02:41:31 java
21873  7970 22059  0.6 S ?        02:43:32 java
21873  7970 22056  0.6 S ?        02:43:38 java
21873  7970 21881  0.6 S ?        03:04:33 java
21873  7970 21882  0.6 S ?        03:04:33 java
21873  7970 21883  0.6 S ?        03:04:33 java
21873  7970 21880  0.6 S ?        03:04:34 java
21873  7970 21884  2.9 R ?        13:14:20 java

This file has been truncated. show original

second_run.txt

::: [starkblue-elasticsearch-11][KPMMjz5sQjCVcfO-ydKdsA][inet[/10.164.0.39:9300]]{ironfan_facet=elasticsearch, ironfan_facet_idx=11, ironfan_name=starkblue-elasticsearch-11, ironfan_cluster=starkblue, max_local_storage_nodes=1, master=true}

   18.0% (90ms out of 500ms) cpu usage by thread 'elasticsearch[starkblue-elasticsearch-11][refresh][T#2]'
     10/10 snapshots sharing following 8 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:702)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:615)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.poll(LinkedTransferQueue.java:1117)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)

This file has been truncated. show original

There are more than three files. show original

Meanwhile, it's also about 20% on another machine experiencing trouble and
about 11% on another one that is approaching trouble. We get vastly
different results with each query to hot_threads, but here is one that
shows the 'generic' thread as the hottest thread for node 11:

gist.github.com

https://gist.github.com/joshbronson/7089089#file-first_run-txt

first_run.txt

::: [starkblue-elasticsearch-11][KPMMjz5sQjCVcfO-ydKdsA][inet[/10.164.0.39:9300]]{ironfan_facet=elasticsearch, ironfan_facet_idx=11, ironfan_name=starkblue-elasticsearch-11, ironfan_cluster=starkblue, max_local_storage_nodes=1, master=true}

   32.0% (160ms out of 500ms) cpu usage by thread 'elasticsearch[starkblue-elasticsearch-11][generic][T#19592]'
     4/10 snapshots sharing following 8 elements
       org.elasticsearch.common.cache.LocalCache$HashIterator.nextInTable(LocalCache.java:4372)21873
       org.elasticsearch.common.cache.LocalCache$HashIterator.advance(LocalCache.java:4337)
       org.elasticsearch.common.cache.LocalCache$HashIterator.nextEntry(LocalCache.java:4412)
       org.elasticsearch.common.cache.LocalCache$KeyIterator.next(LocalCache.java:4428)
       org.elasticsearch.indices.cache.filter.IndicesFilterCache$ReaderCleaner$1.run(IndicesFilterCache.java:184)
       java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

This file has been truncated. show original

ps.txt

$ ps -e HO ppid,lwp,%cpu --sort %cpu | grep java | tail
21873  7970 22057  0.6 S ?        02:41:05 java
21873  7970 22058  0.6 S ?        02:41:31 java
21873  7970 22059  0.6 S ?        02:43:32 java
21873  7970 22056  0.6 S ?        02:43:38 java
21873  7970 21881  0.6 S ?        03:04:33 java
21873  7970 21882  0.6 S ?        03:04:33 java
21873  7970 21883  0.6 S ?        03:04:33 java
21873  7970 21880  0.6 S ?        03:04:34 java
21873  7970 21884  2.9 R ?        13:14:20 java

This file has been truncated. show original

second_run.txt

::: [starkblue-elasticsearch-11][KPMMjz5sQjCVcfO-ydKdsA][inet[/10.164.0.39:9300]]{ironfan_facet=elasticsearch, ironfan_facet_idx=11, ironfan_name=starkblue-elasticsearch-11, ironfan_cluster=starkblue, max_local_storage_nodes=1, master=true}

   18.0% (90ms out of 500ms) cpu usage by thread 'elasticsearch[starkblue-elasticsearch-11][refresh][T#2]'
     10/10 snapshots sharing following 8 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:702)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:615)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.poll(LinkedTransferQueue.java:1117)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)

This file has been truncated. show original

There are more than three files. show original

Every time 'generic' shows up as the hottest thread, that is where I've
seen it.

Just for completeness, here are two other responses from the hot_threads
API for node 11:

gist.github.com

https://gist.github.com/joshbronson/7089089#file-second_run-txt

first_run.txt

::: [starkblue-elasticsearch-11][KPMMjz5sQjCVcfO-ydKdsA][inet[/10.164.0.39:9300]]{ironfan_facet=elasticsearch, ironfan_facet_idx=11, ironfan_name=starkblue-elasticsearch-11, ironfan_cluster=starkblue, max_local_storage_nodes=1, master=true}

   32.0% (160ms out of 500ms) cpu usage by thread 'elasticsearch[starkblue-elasticsearch-11][generic][T#19592]'
     4/10 snapshots sharing following 8 elements
       org.elasticsearch.common.cache.LocalCache$HashIterator.nextInTable(LocalCache.java:4372)21873
       org.elasticsearch.common.cache.LocalCache$HashIterator.advance(LocalCache.java:4337)
       org.elasticsearch.common.cache.LocalCache$HashIterator.nextEntry(LocalCache.java:4412)
       org.elasticsearch.common.cache.LocalCache$KeyIterator.next(LocalCache.java:4428)
       org.elasticsearch.indices.cache.filter.IndicesFilterCache$ReaderCleaner$1.run(IndicesFilterCache.java:184)
       java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

This file has been truncated. show original

ps.txt

$ ps -e HO ppid,lwp,%cpu --sort %cpu | grep java | tail
21873  7970 22057  0.6 S ?        02:41:05 java
21873  7970 22058  0.6 S ?        02:41:31 java
21873  7970 22059  0.6 S ?        02:43:32 java
21873  7970 22056  0.6 S ?        02:43:38 java
21873  7970 21881  0.6 S ?        03:04:33 java
21873  7970 21882  0.6 S ?        03:04:33 java
21873  7970 21883  0.6 S ?        03:04:33 java
21873  7970 21880  0.6 S ?        03:04:34 java
21873  7970 21884  2.9 R ?        13:14:20 java

This file has been truncated. show original

second_run.txt

::: [starkblue-elasticsearch-11][KPMMjz5sQjCVcfO-ydKdsA][inet[/10.164.0.39:9300]]{ironfan_facet=elasticsearch, ironfan_facet_idx=11, ironfan_name=starkblue-elasticsearch-11, ironfan_cluster=starkblue, max_local_storage_nodes=1, master=true}

   18.0% (90ms out of 500ms) cpu usage by thread 'elasticsearch[starkblue-elasticsearch-11][refresh][T#2]'
     10/10 snapshots sharing following 8 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:702)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:615)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.poll(LinkedTransferQueue.java:1117)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)

This file has been truncated. show original

There are more than three files. show original

gist.github.com

https://gist.github.com/joshbronson/7089089#file-third_run-txt

first_run.txt

::: [starkblue-elasticsearch-11][KPMMjz5sQjCVcfO-ydKdsA][inet[/10.164.0.39:9300]]{ironfan_facet=elasticsearch, ironfan_facet_idx=11, ironfan_name=starkblue-elasticsearch-11, ironfan_cluster=starkblue, max_local_storage_nodes=1, master=true}

   32.0% (160ms out of 500ms) cpu usage by thread 'elasticsearch[starkblue-elasticsearch-11][generic][T#19592]'
     4/10 snapshots sharing following 8 elements
       org.elasticsearch.common.cache.LocalCache$HashIterator.nextInTable(LocalCache.java:4372)21873
       org.elasticsearch.common.cache.LocalCache$HashIterator.advance(LocalCache.java:4337)
       org.elasticsearch.common.cache.LocalCache$HashIterator.nextEntry(LocalCache.java:4412)
       org.elasticsearch.common.cache.LocalCache$KeyIterator.next(LocalCache.java:4428)
       org.elasticsearch.indices.cache.filter.IndicesFilterCache$ReaderCleaner$1.run(IndicesFilterCache.java:184)
       java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

This file has been truncated. show original

ps.txt

$ ps -e HO ppid,lwp,%cpu --sort %cpu | grep java | tail
21873  7970 22057  0.6 S ?        02:41:05 java
21873  7970 22058  0.6 S ?        02:41:31 java
21873  7970 22059  0.6 S ?        02:43:32 java
21873  7970 22056  0.6 S ?        02:43:38 java
21873  7970 21881  0.6 S ?        03:04:33 java
21873  7970 21882  0.6 S ?        03:04:33 java
21873  7970 21883  0.6 S ?        03:04:33 java
21873  7970 21880  0.6 S ?        03:04:34 java
21873  7970 21884  2.9 R ?        13:14:20 java

This file has been truncated. show original

second_run.txt

::: [starkblue-elasticsearch-11][KPMMjz5sQjCVcfO-ydKdsA][inet[/10.164.0.39:9300]]{ironfan_facet=elasticsearch, ironfan_facet_idx=11, ironfan_name=starkblue-elasticsearch-11, ironfan_cluster=starkblue, max_local_storage_nodes=1, master=true}

   18.0% (90ms out of 500ms) cpu usage by thread 'elasticsearch[starkblue-elasticsearch-11][refresh][T#2]'
     10/10 snapshots sharing following 8 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:702)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:615)
       org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.poll(LinkedTransferQueue.java:1117)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)

This file has been truncated. show original

There are more than three files. show original

We've left the merge settings at the default, turned the refresh interval
down to 120s, and have the following merge settings:

index:
translog:
flush_threshold_ops: 5000
flush_threshold_size: 200mb
flush_threshold_period: 60s

Thanks for any guidance or ideas as I investigate.

Best,
Josh Bronson

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Josh_Bronson_2 · October 21, 2013, 7:56pm

Oh, and just to clarify, 20+% means 20+% utilization of 4 cores on an AWS
instance. I suspect a single thread is pegging its core, and that that is
causing the issue.

On Monday, October 21, 2013 2:47:43 PM UTC-5, Josh Bronson wrote:

I've noticed a performance issue that's stayed with us since version 19.8.
(We're currently using 0.90.0.) Elasticsearch operates normally for a
while, but the CPU utilization steadily increases on the nodes it's
installed on until it pegs at about 40-50% CPU utilization. (This is
averaged over time by our monitors.) At that point, after a day or so,
nodes begin to start timing out and dropping out of the cluster.

Just to cut to the chase, after getting some useful output from
hot_threads, I suspect our filter cache size is too large. The filter cache
size is set to 20%, and Elasticsearch is allocated 11G of memory for the
filter cache. Nodes that are struggling have higher filter cache sizes. Is
it reasonable to turn the filter cache size down? Will it struggle anyway
as it reaches this new, lower maximum? I suspect the answers to the
previous questions are "yes" and "no" respectively, but I thought I'd ask
before I cranked things down. These are production machines, so I can't
just clear the filter cache on these machines to observe the behavior at
this moment. But I will soon.

A full restart of every node in the cluster fixes things, at which point
the CPU utilization goes down to about where it started. After a few weeks
or months, the process repeats. As the data on the nodes grows, the time
before a restart is required decreases, but more nodes cut the time down.
At the moment, we have 22625304 documents, or about 77.1gb of data
(replicated twice to 231.6gb). We have this divided into 12 shards
(replicated twice to 36 shards) over 10 nodes.

I see a few changes in the 0.90.5 version of IndicesFilterCache.java,
which appears to be the trouble. Those appear to include refactors, though,
and it sure would be nice to know if the problem has been fixed in later
versions before updating.

I say that IndicesFilterCache.java is likely the culprit because I see
that the 'generic' thread is always consuming about 20% of the CPU on an
affected machine, node 11:

https://gist.github.com/joshbronson/7089089#file-ps-txt

Meanwhile, it's also about 20% on another machine experiencing trouble and
about 11% on another one that is approaching trouble. We get vastly
different results with each query to hot_threads, but here is one that
shows the 'generic' thread as the hottest thread for node 11:

https://gist.github.com/joshbronson/7089089#file-first_run-txt

Every time 'generic' shows up as the hottest thread, that is where I've
seen it.

Just for completeness, here are two other responses from the hot_threads
API for node 11:

https://gist.github.com/joshbronson/7089089#file-second_run-txt
https://gist.github.com/joshbronson/7089089#file-third_run-txt

We've left the merge settings at the default, turned the refresh interval
down to 120s, and have the following merge settings:

index:
translog:
flush_threshold_ops: 5000
flush_threshold_size: 200mb
flush_threshold_period: 60s

Thanks for any guidance or ideas as I investigate.

Best,
Josh Bronson

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Josh_Bronson_2 · October 22, 2013, 1:48am

Yup, the filter cache was too large. Cranking it down to 700m brought the
CPU utilization down across the cluster within about a minute. I'm not sure
how filter caches get dirty or why they must be cleaned in a
single-threaded manner, but in case anyone else has the same problem, make
sure your cache isn't too large!

On Monday, October 21, 2013 2:56:46 PM UTC-5, Josh Bronson wrote:

Oh, and just to clarify, 20+% means 20+% utilization of 4 cores on an AWS
instance. I suspect a single thread is pegging its core, and that that is
causing the issue.

On Monday, October 21, 2013 2:47:43 PM UTC-5, Josh Bronson wrote:

I've noticed a performance issue that's stayed with us since version
19.8. (We're currently using 0.90.0.) Elasticsearch operates normally for a
while, but the CPU utilization steadily increases on the nodes it's
installed on until it pegs at about 40-50% CPU utilization. (This is
averaged over time by our monitors.) At that point, after a day or so,
nodes begin to start timing out and dropping out of the cluster.

Just to cut to the chase, after getting some useful output from
hot_threads, I suspect our filter cache size is too large. The filter cache
size is set to 20%, and Elasticsearch is allocated 11G of memory for the
filter cache. Nodes that are struggling have higher filter cache sizes. Is
it reasonable to turn the filter cache size down? Will it struggle anyway
as it reaches this new, lower maximum? I suspect the answers to the
previous questions are "yes" and "no" respectively, but I thought I'd ask
before I cranked things down. These are production machines, so I can't
just clear the filter cache on these machines to observe the behavior at
this moment. But I will soon.

A full restart of every node in the cluster fixes things, at which point
the CPU utilization goes down to about where it started. After a few weeks
or months, the process repeats. As the data on the nodes grows, the time
before a restart is required decreases, but more nodes cut the time down.
At the moment, we have 22625304 documents, or about 77.1gb of data
(replicated twice to 231.6gb). We have this divided into 12 shards
(replicated twice to 36 shards) over 10 nodes.

I see a few changes in the 0.90.5 version of IndicesFilterCache.java,
which appears to be the trouble. Those appear to include refactors, though,
and it sure would be nice to know if the problem has been fixed in later
versions before updating.

I say that IndicesFilterCache.java is likely the culprit because I see
that the 'generic' thread is always consuming about 20% of the CPU on an
affected machine, node 11:

https://gist.github.com/joshbronson/7089089#file-ps-txt

Meanwhile, it's also about 20% on another machine experiencing trouble
and about 11% on another one that is approaching trouble. We get vastly
different results with each query to hot_threads, but here is one that
shows the 'generic' thread as the hottest thread for node 11:

https://gist.github.com/joshbronson/7089089#file-first_run-txt

Every time 'generic' shows up as the hottest thread, that is where I've
seen it.

Just for completeness, here are two other responses from the hot_threads
API for node 11:

https://gist.github.com/joshbronson/7089089#file-second_run-txt
https://gist.github.com/joshbronson/7089089#file-third_run-txt

We've left the merge settings at the default, turned the refresh interval
down to 120s, and have the following merge settings:

index:
translog:
flush_threshold_ops: 5000
flush_threshold_size: 200mb
flush_threshold_period: 60s

Thanks for any guidance or ideas as I investigate.

Best,
Josh Bronson

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Indices.cache.filter.size limit not enforce? Elasticsearch	8	1137	July 6, 2017
ElasticSearch OutOfMemory Exceptions Elasticsearch	8	403	July 6, 2017
Memory Usage Elasticsearch	4	313	July 6, 2017
Suspect GC sync'ed between nodes cause simultaneous performance hit Elasticsearch	14	542	July 6, 2017
How to monitor for filter cache churn? Elasticsearch	6	2612	July 6, 2017

Elasticsearch performance issue (possibly too large filter cache)

Related topics