Doing aggregations using flattened_fields is slow

We consume and store data containing a field filter_properties containing key => value fields.
Up until now we use dynamic mapping to store that data but that's creating a lot of unique fields.
So I decided to try the flattened_field type.

I now have a mapping looking like this:

    {
      "mappings" : {
        "dynamic_templates" : [
          {
            "filterprops" : {
              "path_match" : "filter_properties.*",
              "mapping" : {
                "fields" : {
                  "analyzed" : {
                    "normalizer" : "sort_normalizer",
                    "type" : "keyword"
                  }
                },
                "index" : true,
                "norms" : false,
                "type" : "keyword"
              }
            }
          }
       ],
       "properties": {
          "filters" : {
            "dynamic" : "strict",
            "properties" : {
              "analyzed" : {
                "type" : "flattened",
               "eager_global_ordinals": true
              },
              "raw" : {
                "type" : "flattened",
               "eager_global_ordinals": true
              }
            }
          }
       }
      }
    }

But doing aggregations using the new filters fields is "very" slow.
Doing a query like this (using the flattened_field) takes between 200/300 ms

POST my_index/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "foo": {
      "terms": {
        "field": "filters.raw.QBC",
        "size": 300
      }
    }
  }
}

Where this query on the same index takes between 20/30 ms

POST my_index/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "foo": {
      "terms": {
        "field": "filters_propertied.QBC",
        "size": 300
      }
    }
  }
}

Clearly I'm missing something here but I don't see it. Anyone have any pointers for me?

When you index concrete fields you have dedicated data structures on disk. e.g. given fields obj.foo and obj.bar you have 2 different doc value data structures you can effectively index directly into via the Lucene doc id:

obj.foo : ["a", , , , , "b", , , ]
obj.bar : ["c", , "a", , , ,]

So if your query matches docs 1, 4 and 7 you can index efficiently into the values for field obj.foo and retrieve the single values for this field.

With a flattened field the field names and values are combined so the above data structure is logically more like this on disk:

    obj:  [
                  [....]
                  ["obj.baz:other", "obj.foo:a", "obj.z:blah" , ...],
                  [....],
                  ["obj.blah:misc", "obj.foo:b", "obj.z:etc", ... ],
                  [...]
                  ["obj.baz:other", "obj.foo:c", "obj.z:blah" , ...],
    ] 

You can still index into this structure based on doc ID but the doc values aren't single fields. They are arrays of all fieldnames+values for a doc that you need to iterate across and filter for the obj.foo field name you're aggregating on. This is slower.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.