In preparation for upgrading our Prod environment, we've built up our Dev environment to have more processors and memory and are testing out our heavy watcher load to see if all is well there before doing the same upgrade to processors/memory in Prod.
As of now, we have 325+ watchers running in prod every 2 minutes; in Dev that number is closer to 250.
The problem is that in Dev, I'm seeing the circuit breaking exception:
[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.max_compilations_rate] setting.
Both our Dev & Prod settings for this:
Our script.max_compilations_rate: 75/5m
cache.max_size: 100
cache.expire: 0
Prod doesn't get these errors at all. Dev is 3 nodes and Prod is 6 nodes, though Prod has less processors (1 per node) than Dev (4), so I would expect Prod to have these problems.
scripting requires a bit of explanation before digging into the errors.
First, the script.max_compilations_rate script is a per node setting. Each node counts its compilations and if it hits that limit, exits with this error rate. Second, there is also an in-memory script cache with a size of 100, with an LRU expiry. A lot of different scripts may just keep this cache churning. So that the max compilation rate is hit pretty quick.
Are those watches on your two systems the same? Are they using the same scripts? You can compare the cache evictions of both clusters using the nodes stats. What do they look like?
Setting the log level for org.elasticsearch.script to DEBUG (which can be done dynamically, see here), you will see the reason for script cache expiry in the logs.
Just to be sure, you have the same number of shards for the .watches on each cluster and it is the same version?
Our nodes in an environment are created from the same definition, so all settings amongst the nodes are the same.
The watchers that are installed in each environment are of the same exact design, using the same stored script.
Our .watches settings (from GET /_cluster/state) for both environments:
"settings" : {
"index" : {
"format" : "6",
"number_of_shards" : "1",
"priority" : "800",
"auto_expand_replicas" : "0-1",
"number_of_replicas" : "0"
}
The version for elasticsearch is 6.5.4 for both environments.
The main difference between our Prod & Dev environments are:
Prod: 6 nodes, 1 proc.
Dev: 3 nodes, 4 procs.
Do you think the 6 nodes are better distributed for replicas in our Prod than our Dev, despite the high procs in Dev?
the number of processors is an important one, as this decides about the size of the thread pool. With one processor you have a thread pool size of 5, with 4 cores you get 20.
The distribution is the same, once you have at least two nodes, you should have one primary and one replica index, so that each setup has watches executed on two nodes.
Do you have other queries/ingest processor that also make use of scripting in one of the setups, but not in the other?
No, if anything, we have less in Dev than in Prod. We’re thinking that the number of replicas in Dev may be the cause of our concern (with the lower node count).
Here is our Dev:
.watches 0 r STARTED 301 1.7mb x.x.x.x rSN194T
.watches 0 p STARTED 301 1.6mb x.x.x.x 1ig2Q3r
Prod:
.watches 0 p STARTED 391 3.1mb x.x.x.x 8P_x2aS
.watches 0 r STARTED 391 3.3mb x.x.x.x 3niQSo1
.watches 0 r STARTED 391 3.2mb x.x.x.x WwzteRf
.watches 0 r STARTED 391 3.2mb x.x.x.x VN63xLs
.watches 0 r STARTED 391 3.1mb x.x.x.x qbVE1BH
.watches 0 r STARTED 391 3.1mb x.x.x.x S8UGpES
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.