I'm tuning the fielddata cache of an ElasticSearch cluster by following the wiki Limiting Memory Usage. I found that the fielddata circuit breaker (indices.breaker.fielddata.limit) was not working as explained. I was expecting it to block queries over the limit but it seems to control the amount of the total fielddata cache instead.
- ElasticSearch version: 1.7 (old legacy system...)
- 8 hosts in the fleet
- Using doc_value for all non analyzed fields
- Java Heap memory size: 4 GB
- indices.fielddata.cache.size = "2gb";
- indices.breaker.fielddata.limit = "200mb"; // I set it low for testing purpose
- Cleared the cache by
curl -XPOST 'http://localhost:9200/_cache/clear?fielddata=true'
- Issued a simple sort request on an analyzed field A through Kibana. It successfully loaded about 193.05 MB of fielddata on average.
curl -XGET 'localhost:9200/_cat/fielddata?v&pretty'
- Cleared the cache again.
- Issued a sort request on another analyzed field B through Kibana. It successfully loaded about 193.9 MB of fielddata on average. This showed that neither of the requests loaded data above the 200 MB limit.
- Without clearing the cache, issued the same request in step 2. The cluster returned partial data and a Shard Failure Error complaining about the field data is too large. If I check the cache size, field A is only partially loaded on some hosts and the sum of the cache size on field A and B is close to 200 MB.
Why was the query in step 5 blocked even though the query did not exceed the limit of 200 MB specified in the circuit breaker? It seems that circuit breaker is limiting how much data the cluster can load into fielddata? Isn't that controlled by indices.fielddata.cache.size instead?
Besides there are two sections I found confusing in the wiki:
If the estimated query size is larger than the limit, the circuit breaker is tripped and the query will be aborted and return an exception. This happens before data is loaded, which means that you won’t hit an OutOfMemoryException.
Another experiment I did for this is to clear the cache, lower the circuit breaker limit to 10 mb and run the same request in step 2. I got the shard failure error and the total fielddata cache size from the cat command showed there were around 9.5 mb of data. So the data still load?
However, with the default settings, the fielddata from the old indices is never evicted! fielddata will just keep on growing until you trip the fielddata circuit breaker (see Circuit Breaker), which will prevent you from loading any more fielddata.
So the circuit breaker controls how much in total the cluster can load on fielddata? I thought it was about limiting how much a query can use:
The circuit breaker estimates the memory requirements of a query by introspecting the fields involved (their type, cardinality, size, and so forth). It then checks to see whether loading the required fielddata would push the total fielddata size over the configured percentage of the heap.
Could you please help me understand these configurations? Did I misunderstand something?