During some experiment with fielddata vs doc_values I encountered a weird case. In my earlier mapping, I didn't use doc values at all. In my new mapping, I've added doc_values: true
to all fields in my mapping, except analyzed string fields and booleans (not supported until 2.0).
So in details, here is how I proceeded:
Before reindexing all my data, I restarted my ES 1.7 cluster fresh and ran a query with sorting, aggregations and script fields to "warm up" the fielddata cache. Then I queried the /fielddata
endpoint to have an idea of the fielddata cache usage. It looked something like this:
curl -XGET 'localhost:9200/_cat/fielddata?v&fields=*'
id host ip node total items.desc.raw more_fields...
rKX7... myhost 192.168.1.100 Doom 32.9mb 2.3mb ...
As you can see, the field items.desc.raw
used 2.3mb of heap space. items
is of type nested
and contains a string multi-field with a not_analyzed
sub-field called raw
. In short, the mapping of that nested field looks like this:
"items": {
"type": "nested",
"properties": {
"desc": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
After adding doc_values: true
to items.desc.raw
, reindexing the whole index and running some aggregations, sorting and scripting again to warm up the fielddata cache, I queried the /fielddata
endpoint again and here was the result:
curl -XGET 'localhost:9200/_cat/fielddata?v&fields=*'
id host ip node total items.desc.raw some_bools...
tAB5... myhost 192.168.1.100 Yack 2.1mb 9.2kb ...
So the fielddata usage has indeed been drastically lowered (which is good), the only fields I see are boolean fields (i.e. some_bools
above) which was expected, but to my surprise my nested not_analyzed
string field also appeared, but with a much lower space usage.
What could be the cause of items.desc.raw
still appearing in the fielddata cache?