Analyzers and Content Length

mrbarret · April 2, 2020, 4:58pm

I have a wonderful little document and index that I use to ingest document. Unfortunately, a few of the documents have long fields and the ingest process fails for those documents.

An example document that I have contains

 {
     "PostBody" : {
         "type" : "text",
         "search_analyzer": "simple",
         "analyzer": "analyzer_startswith",
         "fields" : {
           "keyword" : {
             "type" : "keyword",
             "ignore_above" : 256
           },
           "ending": {
             "type": "text",
             "search_analyzer": "simple",
             "analyzer": "analyzer_endswith"
           },
           "cloud" : {
             "type" : "text",
             "analyzer" : "my_stop_analyzer",
             "search_analyzer" : "my_stop_analyzer",
             "fielddata" : true
           }
         }
       }
 }

and my index's settings are

 {
     "index" : {
       "number_of_shards" : "5",
       "number_of_replicas" : "1",
       "analysis": {
         "analyzer": {
           "analyzer_startswith" : {
             "tokenizer": "keyword",
             "filter": "lowercase"
           },
           "analyzer_endswith" : {
             "tokenizer": "keyword",
             "filter" : [
               "lowercase",
               "reverse"
             ]
           },
           "my_stop_analyzer" : {
             "type" : "stop",
             "stopwords_path" : "/etc/elasticsearch/word_cloud_stopwords.txt",
             "filter" : ["lowercase"]
           }
         }
       }
     }
 }

Now, the error message I am receiving is

 {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field=\"PostBody\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[78, 117, 108, 108, 97, 109, 32, 118, 97, 114, 105, 117, 115, 46, 32, 78, 117, 108, 108, 97, 32, 102, 97, 99, 105, 108, 105, 115, 105, 46]...', original message: bytes can be at most 32766 in length; got 37887"}],"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field=\"PostBody\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[78, 117, 108, 108, 97, 109, 32, 118, 97, 114, 105, 117, 115, 46, 32, 78, 117, 108, 108, 97, 32, 102, 97, 99, 105, 108, 105, 115, 105, 46]...', original message: bytes can be at most 32766 in length; got 37887","caused_by":{"type":"max_bytes_length_exceeded_exception","reason":"bytes can be at most 32766 in length; got 37887"}},"status":400}

I have tried any number of ways around this and have not found one that preserves the functionality AND mitigates the max length exception. Do you know how I can alter the index mapping or analyzers to accommodate such large documents?

spinscale · April 3, 2020, 2:11pm

The ignore_above mapping configuration has a hint why this happens and what you could do about it...

hope this helps!

mrbarret · April 3, 2020, 2:23pm

Thanks, Alexander. I included ignore_above: 256 in the keyword analyzers and that didn't change or remove the error message.

spinscale · April 3, 2020, 2:44pm

sorry, I misread your snippet. ignore_above only works for the keyword field mapper, but you seem to have that kind of strings in fields you would like to have searchable.

I actually don't know on top of my head, if there is any workaround for text fields - which I suppose this is thrown now by? Can you just index a document with the text field set and see if this happens?

mrbarret · April 3, 2020, 5:11pm

Alexander, If I strip the PostBody mapping down to

{
  "PostBody" : {
      "type": "text"
  }
}

then ingestion succeeds.

This also leads to successful ingestion.

     "PostBody" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }

And this also leads to successful ingestion

      "PostBody" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type": "keyword",
            "ignore_above": 256
          },
          "cloud" : {
            "type" : "text",
            "analyzer" : "my_stop_analyzer",
            "search_analyzer" : "my_stop_analyzer",
            "fielddata" : true
          }
        }
      }

Unfortunately, if I add the ending multi-field or the two root analyzers then things explode with the same error message mentioned above, in the origin post.

Ideas? (Thanks for your help, by the way!)

spinscale · April 6, 2020, 1:25pm

Nothing that comes to mind immediately in the analyzing chain (this is not my area of best knowledge though). One of the things you could do, is to have an ingest processor, that checks. for a certain length of a field and maybe change it's field name or delete its contents to not index it (this would only act on a total field length and not on a term length though and also cost some more performance)?

mrbarret · April 6, 2020, 1:39pm

Alexander, thanks. If you think of anything please do let me know.
We're aiming to have zero data loss on ingest.

Matthew_Isett · April 9, 2020, 2:42pm

Very interested in this error output -->

the first immense term is: '[78, 117, 108, 108, 97, 109, 32, 118, 97

Are these really the tokens? 78, 117, 108, 108 -- this looks like the binary data representation - which would also explain the massiveness of the object. It is the encoding of binary, base64 of pdf?

If that is the case - we have a new sparse and dense vector data object.

system · May 7, 2020, 2:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Please correct the analyzer to not produce such terms Elasticsearch	2	2745	July 5, 2017
Ignore_above is not working with analyzer and multi fields Elasticsearch	3	547	August 28, 2018
Indexing very long word Elasticsearch	1	486	April 22, 2020
UTF8 encoding is longer than the max length 32766 Elasticsearch	4	17613	July 6, 2017
IllegalArgumentException: Document contains at least one immense term in field=“abc”.(whose UTF8 encoding is longer than the max length 32766) Elasticsearch	3	2418	September 11, 2017

Analyzers and Content Length

Related topics