_id is consuming a lot of the fielddata memory

Hello,

I have a 3 node cluster (16 vCPU, 64 GB of RAM, 3 Tb of data per node, JVM Heap at 30GB) with 450 indices (1 primary shard and 1 replica per indice).

Following an upgrade from 6.7 to 6.8, the activation of TLS on Transport and HTTP and the activation of security (native authentication), we started seeing circuit breaking exceptions in the elastic logs.
After some investigations I found out that the JVM Heap is mainly used by fielddata, and most of the fielddata memory is used by the "_id" field :

GET _cat/fielddata?v&fields=*&s=size

-QKIH1UCRaKUmZSddRj6YQ x.x.x.x x.x.x.x NodeA type.raw                                5kb
JlIPM63SQ-OrLjcsr3q-yg y.y.y.y y.y.y.y NodeC type.raw                              5.2kb
Oj0_TCGcSWac4Zk-vhe3hA z.z.z.z z.z.z.z NodeB type.raw                              6.7kb
-QKIH1UCRaKUmZSddRj6YQ x.x.x.x x.x.x.x NodeA shard.state                             7kb
Oj0_TCGcSWac4Zk-vhe3hA z.z.z.z z.z.z.z NodeB shard.state                           8.1kb
JlIPM63SQ-OrLjcsr3q-yg y.y.y.y y.y.y.y NodeC shard.index                          21.9kb
Oj0_TCGcSWac4Zk-vhe3hA z.z.z.z z.z.z.z NodeB src_ip                               41.5kb
-QKIH1UCRaKUmZSddRj6YQ x.x.x.x x.x.x.x NodeA shard.index                          41.7kb
Oj0_TCGcSWac4Zk-vhe3hA z.z.z.z z.z.z.z NodeB shard.index                          42.5kb
-QKIH1UCRaKUmZSddRj6YQ x.x.x.x x.x.x.x NodeA src_ip                               97.8kb
JlIPM63SQ-OrLjcsr3q-yg y.y.y.y y.y.y.y NodeC src_ip                              103.2kb
-QKIH1UCRaKUmZSddRj6YQ x.x.x.x x.x.x.x NodeA _id                                  23.9gb
Oj0_TCGcSWac4Zk-vhe3hA z.z.z.z z.z.z.z NodeB _id                                    24gb
JlIPM63SQ-OrLjcsr3q-yg y.y.y.y y.y.y.y NodeC _id                                    24gb

Is this a normal behavior ? How can I decrease the memory used ?

The third node was added to the cluster recently to try to split the load but it doesn't change anything.
I have a lot of fields in my indexes, would decreasing the number of fields change that ?

Thanks

Antoine

Have you used the _id field for sorting or aggregations? If so, it's recommended not to do that.

Hi David,

We checked our searches, visualizations and dashboards and didn't find any sorting or aggregations using the _id field in them.
We are using ElastAlert https://github.com/Yelp/elastalert to query the logs we are ingesting in ElasticSearch and none of our ElastAlert rules are using it either.

Also, I don't know if this information will be of any use, but when I restart the node it takes a while for the _id fielddata memory to build up.
I'm restarting each node twice a day to free the memory

Does the buildup correspond with shards being allocated to the node after its restart, or does it take longer than the shard allocation?

It takes longer, almost an hour and a half from what I can say from the monitoring graphs

This is consistent with using the _id field in sorting or aggregations.

I don't have any great ideas for tracking down the source of those searches. Maybe a good start would be to use the slow log to log all searches.

Does it only happen with certain indices?

We found what causes the issue : when an ElastAlert rule matches, we add a link to Kibana in the alert with the _id of the log that matched the rule. When someone clicks on the link, _id values are loaded in the JVM Heap.
I don't think that copying the _id value in another field would change that as we have a lot of logs, at some point Elasticsearch will have to load these values to search in them. And doc_values would require to read this information from the disk so I guess performance will not be great either.

What happens from Elasticsearch's point of view between "someone clicks on the link" and "_id values are loaded in the JVM heap"? A search?

I recommend validating guesses of that nature with a proper experiment.

TIL we have an API for clearing caches which includes field data. It isn't a good long-term fix but it is a lot less disruptive than restarting nodes to clear this memory usage.

Hi David,

The link goes to the Discover tab in Kibana so yes a search is performed at that time.

I configured the "Logs" app in Kibana to display the logs that are in our logstash-* indexes and performed a few search on _id, I do not have the issue that way so we will modify the links in ElastAlert to use this app.

1 Like

Thanks David, I used it a few times and it worked great ! Yes indeed it is a lot less disruptive and also a lot quicker than restarting the nodes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.