Terms Aggregation with include filter


(Ayush Sangani) #1

Hey all,

ES version: 2.3.2 (recently upgraded)

I'm doing terms aggregations on not_analyzed string field using include filter.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2

Is it possible to add a regex flag which performs "CASE_INSENSITIVE" terms aggregation on string field?

PUT /my_index
{
  "mappings": {
    "user": {
      "properties": {
        "name": { 
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

Aggregation Query:

POST /my_index/_search
{
 "aggregations": {
    "name_regex_terms_agg": {
      "terms": {
        "field": "name.raw",
        "size": 1000,
        "shard_size": 100000,
        "include": "adam.*|.*\\sadam.*"
      }
    }
  }
}

Is it possible to add a regex flag which performs "CASE_INSENSITIVE" terms aggregation on string field?

Like Terms aggregation should look for both Adam or adam.

Please let me know if there is any other information required.

Thanks for the help.


(Mike Simos) #2

Hi,

You could try something like this:

{
  "size": 0, 
    "aggs" : {
        "buckets" : {
            "terms" : {
                "script" : "doc['name.raw'].value.toLowerCase()"
            }
        }
    }
}

Otherwise you could index the field using the keyword analyzer and the lowecase tokenizer to emit 1 token that is all lowercase. Then create an aggregation on that field. That will probably be faster then using a script to lowercase the field at query time.


(Ayush Sangani) #3

Thanks for the reply Mike.
I like your second option, but that would require me to reindex all the documents and plus it will return keys in terms aggregation lowercased(which is not desired).

I'm still not sure why Terms Aggregation include filter doesn't have CASE_INSENSITIVE regex flag?


(Mike Simos) #4

Refer to this breaking change:

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/breaking_20_aggregation_changes.html#_including_excluding_terms

The flags parameter is no longer supported.


(Ayush Sangani) #5

Ohh ok thanks!

So I found a workaround to use regex queries on analyzed fields and do terms aggregation on not_analyzed field.

{
  "aggregations": {
    "name_regex_query": {
      "filter": {
        "regexp": {
          "name": {
            "value": "aust.*|.*\\saust.*",
            "flags_value": 65535
          }
        }
      },
      "aggregations": {
        "name_raw_terms_agg": {
          "terms": {
            "field": "name.raw",
            "size": 1000,
            "shard_size": 100000
          }
        }
      }
    }
  }
}

Curious to know if this has any performance impact?


(Subbu v) #6

So are you getting the case insensitive results on the regex query?


(Ayush Sangani) #7

@subbu.nv Yeah I'm able to get case insensitive results by using keyword analyzer and the lowecase tokenizer.
Note that flags in regex query is removed in ES 2.X.


(Greg Strauss) #8

I have a similar situation in that I want to do a case insensitive aggregation on a keyword field, so the idea of keyword analyzer and lowercase tokenizer makes sense. However, the syntax for setting this for a field seems to be elusive for me.

First approach was this:

PUT authors
{
  "mappings": {
	"famousbooks": {
	  "properties": {
		"Author": {
		  "type": "text",
		  "fields": {
			"use_lowercase": {
			  "type": "text",
			  "analyzer": "keyword",
			  "tokenizer": "lowercase"
			}
		  }
		}
	  }
	}
  }
}

But this fails with

"error": {
"root_cause": [
  {
    "type": "mapper_parsing_exception",
    "reason": "Mapping definition for [fields] has unsupported parameters:  [tokenizer : lowercase]"
  }
],

So next step is a custom analyzer:

PUT authors
{
  "settings": {
	"analysis": {
	  "analyzer": {
		"myLowercase": {
		  "type": "custom",
		  "tokenizer": "keyword",
		  "filter" : ["lowercase"]
		}
	  }
	}
  },
  "mappings": {
	"famousbooks": {
	  "properties": {
		"Author": {
		  "type": "text",
		  "fields": {
			"use_lowercase": {
			  "type": "text",
			  "analyzer": "myLowercase"
			}
		  }
		}
	  }
	}
  }
}

So the aggregation query:

GET authors/famousbooks/_search
{
  "size": 0,
  "aggs": {
	"authors-aggs": {
	  "terms": {
		"field": "Author.use_lowercase"
	  }
	}
  }
}

returns an error as follows:

"error": {
"root_cause": [
  {
    "type": "illegal_argument_exception",
    "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
  }
],

So to make this work I evidently need to enable Fielddata despite the dire warnings of turning this on... it seems like a big stick to use for what seemingly is a simple thing.

I'm hoping I am missing something obvious here, though.

Thx in advance for the help!


(Greg Strauss) #9

Of course I discover a solution soon after I post...

Normalizers! HelpFound in another post on this forum.

PUT authors
{
  "settings": {
    "analysis": {
      "normalizer": {
        "myLowercase": {
          "type": "custom",
          "filter": [ "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "famousbooks": {
      "properties": {
        "Author": {
          "type": "keyword",
          "normalizer": "myLowercase"
        }
      }
    }
  }
}

(system) #10

(Martijn Laarman) #11

For future googlers:

"terms": {
    "field": "_index",
    "size": 10,
    "exclude" : "__BAD__",
    "script" : {
        "source" : "if (_value =~ /(^|.*\\s+)Index/i) { return _value } else { return '__BAD__' }",
        "lang" : "painless"
    }
}

Note that you have to enable regex in painless as its disabled by default, read the caveats of doing so here:

https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-examples.html#modules-scripting-painless-regex