Using the Truncate filter on keywords

Arik_A · October 22, 2018, 8:04pm

Hi,
I'm using Elasticsearch 5.5.1.
In a defined template, a keyword field is added to each text field, currently mainly for sorting.
These texts could be long, and keywords have limited size. I know there's the ignore_above option, but then there's no keyword and the results aren't sorted properly.
To be able to sort, I want to truncate the keyword field.
I tried to add the truncate filter to the keyword's normalizer definition, but then I get the error "Custom normalizer [keyword_norm] may not use filter [keyword_truncate]"
Relevant part of the template definition:

{
  "template": "xyz*",
  "settings": {
    "index": {
      "analysis": {
        "filter":{
        	"keyword_truncate": {
        		"type": "truncate",
        		"length": 256
        	}
        },
        "normalizer": {
          "keyword_norm": {
            "type": "custom",
            "filter": ["lowercase", "keyword_truncate"]
          }
        }
      }
    }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": [
        {
          "strings_as_text": {
            "match_mapping_type": "string",
            "mapping": {
              "analyzer": "standard",
              "type": "text",
              "fielddata":true,
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "normalizer": "keyword_norm"
                }
              }
            }
....

Is this the correct way to use it? And is there a way to make it work?

spinscale · October 23, 2018, 8:20am

only certain filters can be used in normalizers, and the truncate one is not falling into this list, see https://www.elastic.co/guide/en/elasticsearch/reference/6.4/analysis-normalizers.html

You could do this using a char filter, something like this

GET _analyze
{
  "text": "aaaaaaaaaa",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "^(.{5})(.*)$",
      "replacement": "$1"
    }
  ]
}

Note: There may be more performant ideas, this just came to my mind.

Arik_A · October 23, 2018, 11:57am

Thanks for your reply,
This does truncate short texts, but for texts that are longer than the keyword size limit, there's still an error while indexing. Apparently it checks the length before applying the character filter.

Document contains at least one immense term in field=\"description.keyword\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.

Is there a way to make it work?

spinscale · October 23, 2018, 12:03pm

you could modify the JSON before indexing (instead of doing just changing the field that is stored in lucene), by using a gsub processor or a script processor

Arik_A · October 23, 2018, 1:31pm

Thank you,
If I understand correctly, these will change the values on the actual document. I'm only interested in changing the keyword field under the text field, and not the text itself.
Is there a way to do so?

Fayaz · November 13, 2018, 7:33am

I am also looking for solution to this problem. Elasticsearch 6 has increased ignore_above limit https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html. But underlying lucene limit is still 32766 and elasticsearch fails with "Document contains at least one immense term in" error. It seems elasticsearch community has no solution to this problem : ElasticSearch 6.2.4 java.lang.IllegalArgumentException: Document contains at least one immense term.

system · December 11, 2018, 7:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Possible to use truncate filter on keyword subfield to limit length? Elasticsearch	1	596	October 25, 2018
Truncate keyword to specific length and store Elasticsearch	6	2321	September 11, 2017
Is KEYWORD data type analyzed as well? Elasticsearch	3	1530	February 14, 2017
Cannot use custom analyzer for keyword Elasticsearch	4	4274	September 3, 2017
Fielddata on a custom analyzer that is of keyword tokenizer Elasticsearch	1	260	March 17, 2023

Using the Truncate filter on keywords

Related topics