Facets - How to do propper Terms aggregations

ros · January 31, 2017, 1:43pm

Given an index with documents that have a brand property, we need to create a term aggregation that is case insensitive.

Data size

28 indices
each with 10.000 documents

Index definition

Please note that the use of fielddata

PUT demo_products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "brand": {
          "type": "text",
          "analyzer": "my_custom_analyzer",
          "fielddata": true,
        }
      }
    }
  }
}

Data

POST demo_products/product
{
  "brand": "New York Jets"
}

POST demo_products/product
{
  "brand": "new york jets"
}

POST demo_products/product
{
  "brand": "Washington Redskins"
}

Query

GET demo_products/product/_search
{
  "size": 0,
  "aggs": {
    "brand_facet": {
      "terms": {
        "field": "brand"
      }
    }
  }
}

Query result

"aggregations": {
    "brand_facet": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "new york jets",
          "doc_count": 2
        },
        {
          "key": "washington redskins",
          "doc_count": 1
        }
      ]
    }
  }

This is great - but lowercased
We could use a top_hits sub aggregation to the propper cased version of the field

What to do?
If we use keyword instead of text we end up the 2 buckets for New York Jets because of the differences in casing.

We're concerned about the performance implications by using fielddata. However if fielddata is disabled we get the dreaded "Fielddata is disabled on text fields by default."

Any other tips to resolve this - or should we not be so concerned about fielddata?

ros · February 3, 2017, 1:30pm

To answer my own question. As of ES 5.2 keyword normalizers is the way to go.

PUT demo_products
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "filter": [ "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "brand": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

Response

  "aggregations" : {
    "brand_facet" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "new york jets",
          "doc_count" : 2
        },
        {
          "key" : "washington redskins",
          "doc_count" : 1
        }
      ]
    }
  }

polyfractal · February 3, 2017, 10:24pm

Yep! Confirming that 5.2 normalizers is exactly the right situation in this case

system · March 3, 2017, 10:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Exact match with case insensitivity Elasticsearch	9	40114	August 22, 2017
Keyword type: aggregation case insensitive Elasticsearch	5	1936	May 19, 2017
Elasticsearch support both case sensitive & insensitive Elasticsearch	4	13791	February 17, 2020
How to make terms aggregation case insensitive? Elasticsearch	2	8983	October 16, 2017
Aggregations on fields indexed as text (Almost) painless Elasticsearch	1	507	January 18, 2019

Facets - How to do propper Terms aggregations

Related topics