Fielddata loaded for terms aggregation on not_analyzed string field

Hi,

I am using version 2.3.3 of Elasticsearch and am wondering why doing a terms aggregation on unanalyzed string field results in fielddata being loaded (the docs indicate that this should not happen).

The field in question has the following mapping:

...
"patentStatus": {
  "type": "string",
  "index": "not_analyzed"
},
...

And the terms aggregation looks like this:

    <query omitted for brevity>,
    "aggs": {
        "patentStatus": {
            "terms": {
                "field": "invention.patentStatus",
                "size": 10,
                "min_doc_count": 1
            }
        }
    }
}

After running this aggregation, I see about 20 mb of fielddata loaded for each participating shard.

I am aware of how global_ordinals can consume memory which shows up Elasticsearch's API as if it were fielddata. However, I don't think the global ordinal mapping is the cause of the memory consumption since there are only two distinct values for the invention.patentStatus field (so I wouldn't think the global ordinal mapping would consume that much memory to map only two values).

I also believe this query is causing fielddata to load because I tried reindexing the documents into another index with the same mapping, except I added the following:

{
  "fielddata": {
    "format": "disabled"
  } 
}

When I ran the terms aggregation against that index, I got an exception that loading fielddata was disabled.

So my questions:

  • Why is the terms aggregation loading fielddata? Are the elasticsearch docs wrong?
  • How can i mitigate that? Is there something I need to add to either the mapping or the aggregation to cause it to use doc_values?

Thank you!

UPDATE:

I am seeing the same behavior on ES 6.5.1. The portion of the mapping looks like:

...
"patentStatus": {
  "type": "keyword"
},
...

Running the same aggregation causes the same spike in fielddata. However, I'm able to confirm that docvalues are created for this particular field with the following query:

{
  "script_fields": {
    "test_docvalues": {
      "script": {
        "lang":   "painless",
        "source": "doc['invention.patentCitations']"
      }
    }
  }
}

So I'm not really sure what to make of all this...the field is mapped to use doc values and according to that script query, doc values are created for that field. However, aggregating on that field seems to cause an inexplicable spike in fielddata. Can anyone clarify why this might be? It doesn't match what I would expect in the documentation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.