Facets - How to do propper Terms aggregations


#1

Given an index with documents that have a brand property, we need to create a term aggregation that is case insensitive.

Data size

  • 28 indices
  • each with 10.000 documents

Index definition

Please note that the use of fielddata

PUT demo_products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "brand": {
          "type": "text",
          "analyzer": "my_custom_analyzer",
          "fielddata": true,
        }
      }
    }
  }
}

Data

POST demo_products/product
{
  "brand": "New York Jets"
}

POST demo_products/product
{
  "brand": "new york jets"
}

POST demo_products/product
{
  "brand": "Washington Redskins"
}

Query

GET demo_products/product/_search
{
  "size": 0,
  "aggs": {
    "brand_facet": {
      "terms": {
        "field": "brand"
      }
    }
  }
}

Query result

"aggregations": {
    "brand_facet": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "new york jets",
          "doc_count": 2
        },
        {
          "key": "washington redskins",
          "doc_count": 1
        }
      ]
    }
  }

This is great - but lowercased :frowning:
We could use a top_hits sub aggregation to the propper cased version of the field

What to do?
If we use keyword instead of text we end up the 2 buckets for New York Jets because of the differences in casing.

We're concerned about the performance implications by using fielddata. However if fielddata is disabled we get the dreaded "Fielddata is disabled on text fields by default."

Any other tips to resolve this - or should we not be so concerned about fielddata?


#2

To answer my own question. As of ES 5.2 keyword normalizers is the way to go.

PUT demo_products
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "filter": [ "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "brand": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

Response

  "aggregations" : {
    "brand_facet" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "new york jets",
          "doc_count" : 2
        },
        {
          "key" : "washington redskins",
          "doc_count" : 1
        }
      ]
    }
  }

(Zachary Tong) #3

Yep! Confirming that 5.2 normalizers is exactly the right situation in this case :slight_smile:


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.