Summary: After loading facets, CPU runs high at idle even with a fair
bit of memory left - is this expected? And can someone sanity check my
field cache sizes?
My pseudo-operational cluster consists of 3 nodes, each with 10GB of
memory allocated (and mlockall set - in any case I have no swap space)
When my entire document set is loaded (2.5M top-level documents,
various embedded and nested sub-objects), bigdesk shows the store
sizes as: 6.1GB, 4.3GB and 3.7GB (these sizes vary from re-index to re-
index as you'd expect, but that sort of distribution is typical). Each
shard has 1 replica (60 shards in total).
The fun starts when I start to run facets.
First off, I perform a term facet over all geo elements stored in the
documents (the "total" field = 2163442). In terms of field cache, the
3 nodes then have the following usage: 2.8GB, 2.5GB, 1GB
First question: does that seem about right for the field cache taken
up by 2M 14-character strings? That comes out as ~3KB per geo field
instance, much more per unique token value.
Next I perform a term facet on a different string field (average
length ~64B), part of a "nested" object ("total"=1809227). This
increases the field cache usage to: 6.3GB, 3.3GB and 2.8GB.
Looking at "top" on the three nodes, they have the following memory
usage (Virt/Res in GB): 10.1/8.1, 10.1/6.2, 10.1/5.8.
The node with 10.1/8.1 now uses 40% CPU at idle. Nothing is logged
(debug level), and none of the timed events in nodes stats show
anything unusual. Queries still return normally, though performance is
noticeably degraded (the average search time jumps by 1s or so).
I tried running "index.cache.field.type" as both "resident" (ie
default) and "soft". The behavior seems the same in both cases.
For what it's worth, the "offending" node contains all 10 shards from
the largest index, though only 3 of them are "primary".
Second question: Is this expected behavior?
Bonus questions: If so, does it just indicate it's time to add a new
node or do I have any other options left? Is there any other way of
detecting this state other than monitoring CPU trends (ie to send an
alert email out)?
Many thanks for any insight anyone can provide!