Unfounded script recompilations

In preparation for upgrading our Prod environment, we've built up our Dev environment to have more processors and memory and are testing out our heavy watcher load to see if all is well there before doing the same upgrade to processors/memory in Prod.

As of now, we have 325+ watchers running in prod every 2 minutes; in Dev that number is closer to 250.

The problem is that in Dev, I'm seeing the circuit breaking exception:
[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.max_compilations_rate] setting.

Both our Dev & Prod settings for this:
Our script.max_compilations_rate: 75/5m
cache.max_size: 100
cache.expire: 0

Prod doesn't get these errors at all. Dev is 3 nodes and Prod is 6 nodes, though Prod has less processors (1 per node) than Dev (4), so I would expect Prod to have these problems.

Does anybody know what I should check next?

Hey Matt,

scripting requires a bit of explanation before digging into the errors.

First, the script.max_compilations_rate script is a per node setting. Each node counts its compilations and if it hits that limit, exits with this error rate. Second, there is also an in-memory script cache with a size of 100, with an LRU expiry. A lot of different scripts may just keep this cache churning. So that the max compilation rate is hit pretty quick.

Are those watches on your two systems the same? Are they using the same scripts? You can compare the cache evictions of both clusters using the nodes stats. What do they look like?

Setting the log level for org.elasticsearch.script to DEBUG (which can be done dynamically, see here), you will see the reason for script cache expiry in the logs.

Just to be sure, you have the same number of shards for the .watches on each cluster and it is the same version?

--Alex

Our nodes in an environment are created from the same definition, so all settings amongst the nodes are the same.

The watchers that are installed in each environment are of the same exact design, using the same stored script.

Our .watches settings (from GET /_cluster/state) for both environments:
"settings" : {
"index" : {
"format" : "6",
"number_of_shards" : "1",
"priority" : "800",
"auto_expand_replicas" : "0-1",
"number_of_replicas" : "0"
}
The version for elasticsearch is 6.5.4 for both environments.

The main difference between our Prod & Dev environments are:
Prod: 6 nodes, 1 proc.
Dev: 3 nodes, 4 procs.

Do you think the 6 nodes are better distributed for replicas in our Prod than our Dev, despite the high procs in Dev?

the number of processors is an important one, as this decides about the size of the thread pool. With one processor you have a thread pool size of 5, with 4 cores you get 20.

The distribution is the same, once you have at least two nodes, you should have one primary and one replica index, so that each setup has watches executed on two nodes.

Do you have other queries/ingest processor that also make use of scripting in one of the setups, but not in the other?

No, if anything, we have less in Dev than in Prod. We’re thinking that the number of replicas in Dev may be the cause of our concern (with the lower node count).

Matt McGovern

Comcast Business Service Operations

1717 Arch St., Phila., PA 19103 | 23.112A

Phone: 267.260.0322

Slack: matt_mcgovern

how many shards of the .watches index are in both clusters? Try

GET /_cat/shards/.watches`

Here is our Dev:
.watches 0 r STARTED 301 1.7mb x.x.x.x rSN194T
.watches 0 p STARTED 301 1.6mb x.x.x.x 1ig2Q3r

Prod:
.watches 0 p STARTED 391 3.1mb x.x.x.x 8P_x2aS
.watches 0 r STARTED 391 3.3mb x.x.x.x 3niQSo1
.watches 0 r STARTED 391 3.2mb x.x.x.x WwzteRf
.watches 0 r STARTED 391 3.2mb x.x.x.x VN63xLs
.watches 0 r STARTED 391 3.1mb x.x.x.x qbVE1BH
.watches 0 r STARTED 391 3.1mb x.x.x.x S8UGpES