Not_analyzed field with doc_values still in fielddata cache


(Val Crettaz) #1

During some experiment with fielddata vs doc_values I encountered a weird case. In my earlier mapping, I didn't use doc values at all. In my new mapping, I've added doc_values: true to all fields in my mapping, except analyzed string fields and booleans (not supported until 2.0).

So in details, here is how I proceeded:

Before reindexing all my data, I restarted my ES 1.7 cluster fresh and ran a query with sorting, aggregations and script fields to "warm up" the fielddata cache. Then I queried the /fielddata endpoint to have an idea of the fielddata cache usage. It looked something like this:

curl -XGET 'localhost:9200/_cat/fielddata?v&fields=*'

id      host   ip            node  total  items.desc.raw more_fields...
rKX7... myhost 192.168.1.100 Doom  32.9mb 2.3mb          ...

As you can see, the field items.desc.raw used 2.3mb of heap space. items is of type nested and contains a string multi-field with a not_analyzed sub-field called raw. In short, the mapping of that nested field looks like this:

    "items": {
      "type": "nested",
      "properties": {
        "desc": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }

After adding doc_values: true to items.desc.raw, reindexing the whole index and running some aggregations, sorting and scripting again to warm up the fielddata cache, I queried the /fielddata endpoint again and here was the result:

curl -XGET 'localhost:9200/_cat/fielddata?v&fields=*'

id      host   ip            node  total  items.desc.raw some_bools...
tAB5... myhost 192.168.1.100 Yack  2.1mb  9.2kb          ...

So the fielddata usage has indeed been drastically lowered (which is good), the only fields I see are boolean fields (i.e. some_bools above) which was expected, but to my surprise my nested not_analyzed string field also appeared, but with a much lower space usage.

What could be the cause of items.desc.raw still appearing in the fielddata cache?


Doc values are not enabled by default
(Colin Goodheart-Smithe) #2

When using Doc Values on a not_analyzed String field you may still get some field data usage from global ordinals. This is a data structure that assigns a number (ordinal) to each term in the index for that field to save using excess memory by having multiple copies of the String value of the field when doing calculations. Global ordinals cannot be included in Doc Values as they need to be computed at query time by running over all the terms currently in the field assigning each a unique number. This would explain why you still see a small amount of field data usage even when you are using doc values for a not_analyzed String field.

Hope that helps


(Val Crettaz) #3

Thanks @colings86, that definitely helps indeed. Somehow I missed the global ordinals bit, but that all makes sense now. Thanks much again.


(system) #4