Using the Truncate filter on keywords


(Arik A) #1

Hi,
I'm using Elasticsearch 5.5.1.
In a defined template, a keyword field is added to each text field, currently mainly for sorting.
These texts could be long, and keywords have limited size. I know there's the ignore_above option, but then there's no keyword and the results aren't sorted properly.
To be able to sort, I want to truncate the keyword field.
I tried to add the truncate filter to the keyword's normalizer definition, but then I get the error "Custom normalizer [keyword_norm] may not use filter [keyword_truncate]"
Relevant part of the template definition:

{
  "template": "xyz*",
  "settings": {
    "index": {
      "analysis": {
        "filter":{
        	"keyword_truncate": {
        		"type": "truncate",
        		"length": 256
        	}
        },
        "normalizer": {
          "keyword_norm": {
            "type": "custom",
            "filter": ["lowercase", "keyword_truncate"]
          }
        }
      }
    }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": [
        {
          "strings_as_text": {
            "match_mapping_type": "string",
            "mapping": {
              "analyzer": "standard",
              "type": "text",
              "fielddata":true,
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "normalizer": "keyword_norm"
                }
              }
            }
....

Is this the correct way to use it? And is there a way to make it work?


(Alexander Reelsen) #2

only certain filters can be used in normalizers, and the truncate one is not falling into this list, see https://www.elastic.co/guide/en/elasticsearch/reference/6.4/analysis-normalizers.html

You could do this using a char filter, something like this

GET _analyze
{
  "text": "aaaaaaaaaa",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "^(.{5})(.*)$",
      "replacement": "$1"
    }
  ]
}

Note: There may be more performant ideas, this just came to my mind.


(Arik A) #3

Thanks for your reply,
This does truncate short texts, but for texts that are longer than the keyword size limit, there's still an error while indexing. Apparently it checks the length before applying the character filter.

Document contains at least one immense term in field=\"description.keyword\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.

Is there a way to make it work?


(Alexander Reelsen) #4

you could modify the JSON before indexing (instead of doing just changing the field that is stored in lucene), by using a gsub processor or a script processor


(Arik A) #5

Thank you,
If I understand correctly, these will change the values on the actual document. I'm only interested in changing the keyword field under the text field, and not the text itself.
Is there a way to do so?


(Mohammed Fayaz) #6

I am also looking for solution to this problem. Elasticsearch 6 has increased ignore_above limit https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html. But underlying lucene limit is still 32766 and elasticsearch fails with "Document contains at least one immense term in" error. It seems elasticsearch community has no solution to this problem : ElasticSearch 6.2.4 java.lang.IllegalArgumentException: Document contains at least one immense term.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.