Given an index with documents that have a brand property, we need to create a term aggregation that is case insensitive.
Data size
- 28 indices
- each with 10.000 documents
Index definition
Please note that the use of fielddata
PUT demo_products
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"product": {
"properties": {
"brand": {
"type": "text",
"analyzer": "my_custom_analyzer",
"fielddata": true,
}
}
}
}
}
Data
POST demo_products/product
{
"brand": "New York Jets"
}
POST demo_products/product
{
"brand": "new york jets"
}
POST demo_products/product
{
"brand": "Washington Redskins"
}
Query
GET demo_products/product/_search
{
"size": 0,
"aggs": {
"brand_facet": {
"terms": {
"field": "brand"
}
}
}
}
Query result
"aggregations": {
"brand_facet": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new york jets",
"doc_count": 2
},
{
"key": "washington redskins",
"doc_count": 1
}
]
}
}
This is great - but lowercased
We could use a top_hits sub aggregation to the propper cased version of the field
What to do?
If we use keyword instead of text we end up the 2 buckets for New York Jets because of the differences in casing.
We're concerned about the performance implications by using fielddata. However if fielddata is disabled we get the dreaded "Fielddata is disabled on text fields by default."
Any other tips to resolve this - or should we not be so concerned about fielddata?