Terms Aggregation with include filter

ayushsangani · May 25, 2016, 5:31pm

Hey all,

ES version: 2.3.2 (recently upgraded)

I'm doing terms aggregations on not_analyzed string field using include filter.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2

Is it possible to add a regex flag which performs "CASE_INSENSITIVE" terms aggregation on string field?

PUT /my_index
{
  "mappings": {
    "user": {
      "properties": {
        "name": { 
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

Aggregation Query:

POST /my_index/_search
{
 "aggregations": {
    "name_regex_terms_agg": {
      "terms": {
        "field": "name.raw",
        "size": 1000,
        "shard_size": 100000,
        "include": "adam.*|.*\\sadam.*"
      }
    }
  }
}

Is it possible to add a regex flag which performs "CASE_INSENSITIVE" terms aggregation on string field?

Like Terms aggregation should look for both Adam or adam.

Please let me know if there is any other information required.

Thanks for the help.

msimos · May 26, 2016, 12:09am

Hi,

You could try something like this:

{
  "size": 0, 
    "aggs" : {
        "buckets" : {
            "terms" : {
                "script" : "doc['name.raw'].value.toLowerCase()"
            }
        }
    }
}

Otherwise you could index the field using the keyword analyzer and the lowecase tokenizer to emit 1 token that is all lowercase. Then create an aggregation on that field. That will probably be faster then using a script to lowercase the field at query time.

ayushsangani · May 26, 2016, 4:21pm

Thanks for the reply Mike.
I like your second option, but that would require me to reindex all the documents and plus it will return keys in terms aggregation lowercased(which is not desired).

I'm still not sure why Terms Aggregation include filter doesn't have CASE_INSENSITIVE regex flag?

msimos · May 26, 2016, 5:44pm

Refer to this breaking change:

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/breaking_20_aggregation_changes.html#_including_excluding_terms

The flags parameter is no longer supported.

ayushsangani · May 26, 2016, 6:12pm

Ohh ok thanks!

So I found a workaround to use regex queries on analyzed fields and do terms aggregation on not_analyzed field.

{
  "aggregations": {
    "name_regex_query": {
      "filter": {
        "regexp": {
          "name": {
            "value": "aust.*|.*\\saust.*",
            "flags_value": 65535
          }
        }
      },
      "aggregations": {
        "name_raw_terms_agg": {
          "terms": {
            "field": "name.raw",
            "size": 1000,
            "shard_size": 100000
          }
        }
      }
    }
  }
}

Curious to know if this has any performance impact?

subbu.nv · September 30, 2016, 9:06am

So are you getting the case insensitive results on the regex query?

ayushsangani · October 4, 2016, 1:53pm

@subbu.nv Yeah I'm able to get case insensitive results by using keyword analyzer and the lowecase tokenizer.
Note that flags in regex query is removed in ES 2.X.

GregAtPareto · March 1, 2017, 9:59pm

I have a similar situation in that I want to do a case insensitive aggregation on a keyword field, so the idea of keyword analyzer and lowercase tokenizer makes sense. However, the syntax for setting this for a field seems to be elusive for me.

First approach was this:

PUT authors
{
  "mappings": {
	"famousbooks": {
	  "properties": {
		"Author": {
		  "type": "text",
		  "fields": {
			"use_lowercase": {
			  "type": "text",
			  "analyzer": "keyword",
			  "tokenizer": "lowercase"
			}
		  }
		}
	  }
	}
  }
}

But this fails with

"error": {
"root_cause": [
  {
    "type": "mapper_parsing_exception",
    "reason": "Mapping definition for [fields] has unsupported parameters:  [tokenizer : lowercase]"
  }
],

So next step is a custom analyzer:

PUT authors
{
  "settings": {
	"analysis": {
	  "analyzer": {
		"myLowercase": {
		  "type": "custom",
		  "tokenizer": "keyword",
		  "filter" : ["lowercase"]
		}
	  }
	}
  },
  "mappings": {
	"famousbooks": {
	  "properties": {
		"Author": {
		  "type": "text",
		  "fields": {
			"use_lowercase": {
			  "type": "text",
			  "analyzer": "myLowercase"
			}
		  }
		}
	  }
	}
  }
}

So the aggregation query:

GET authors/famousbooks/_search
{
  "size": 0,
  "aggs": {
	"authors-aggs": {
	  "terms": {
		"field": "Author.use_lowercase"
	  }
	}
  }
}

returns an error as follows:

"error": {
"root_cause": [
  {
    "type": "illegal_argument_exception",
    "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
  }
],

So to make this work I evidently need to enable Fielddata despite the dire warnings of turning this on... it seems like a big stick to use for what seemingly is a simple thing.

I'm hoping I am missing something obvious here, though.

Thx in advance for the help!

GregAtPareto · March 1, 2017, 10:21pm

Of course I discover a solution soon after I post...

Normalizers! HelpFound in another post on this forum.

PUT authors
{
  "settings": {
    "analysis": {
      "normalizer": {
        "myLowercase": {
          "type": "custom",
          "filter": [ "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "famousbooks": {
      "properties": {
        "Author": {
          "type": "keyword",
          "normalizer": "myLowercase"
        }
      }
    }
  }
}

Martijn_Laarman · September 22, 2017, 10:27am

For future googlers:

"terms": {
    "field": "_index",
    "size": 10,
    "exclude" : "__BAD__",
    "script" : {
        "source" : "if (_value =~ /(^|.*\\s+)Index/i) { return _value } else { return '__BAD__' }",
        "lang" : "painless"
    }
}

Note that you have to enable regex in painless as its disabled by default, read the caveats of doing so here:

https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-examples.html#modules-scripting-painless-regex

Topic		Replies	Views
Elasticsearch: Parameter "include" isn't case insensitive Elasticsearch	4	360	November 29, 2018
Elasticsearch TERM regex aggregation Elasticsearch	1	740	March 14, 2017
Case Insensitive aggregation not working Elasticsearch	5	330	April 8, 2024
Case insensitive search and doc_values Elasticsearch	3	1285	July 5, 2017
Keyword type: aggregation case insensitive Elasticsearch	5	1889	May 19, 2017

Terms Aggregation with include filter

Related topics