Analyzers and Content Length

I have a wonderful little document and index that I use to ingest document. Unfortunately, a few of the documents have long fields and the ingest process fails for those documents.

An example document that I have contains

 {
     "PostBody" : {
         "type" : "text",
         "search_analyzer": "simple",
         "analyzer": "analyzer_startswith",
         "fields" : {
           "keyword" : {
             "type" : "keyword",
             "ignore_above" : 256
           },
           "ending": {
             "type": "text",
             "search_analyzer": "simple",
             "analyzer": "analyzer_endswith"
           },
           "cloud" : {
             "type" : "text",
             "analyzer" : "my_stop_analyzer",
             "search_analyzer" : "my_stop_analyzer",
             "fielddata" : true
           }
         }
       }
 }

and my index's settings are

 {
     "index" : {
       "number_of_shards" : "5",
       "number_of_replicas" : "1",
       "analysis": {
         "analyzer": {
           "analyzer_startswith" : {
             "tokenizer": "keyword",
             "filter": "lowercase"
           },
           "analyzer_endswith" : {
             "tokenizer": "keyword",
             "filter" : [
               "lowercase",
               "reverse"
             ]
           },
           "my_stop_analyzer" : {
             "type" : "stop",
             "stopwords_path" : "/etc/elasticsearch/word_cloud_stopwords.txt",
             "filter" : ["lowercase"]
           }
         }
       }
     }
 }

Now, the error message I am receiving is

 {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field=\"PostBody\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[78, 117, 108, 108, 97, 109, 32, 118, 97, 114, 105, 117, 115, 46, 32, 78, 117, 108, 108, 97, 32, 102, 97, 99, 105, 108, 105, 115, 105, 46]...', original message: bytes can be at most 32766 in length; got 37887"}],"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field=\"PostBody\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[78, 117, 108, 108, 97, 109, 32, 118, 97, 114, 105, 117, 115, 46, 32, 78, 117, 108, 108, 97, 32, 102, 97, 99, 105, 108, 105, 115, 105, 46]...', original message: bytes can be at most 32766 in length; got 37887","caused_by":{"type":"max_bytes_length_exceeded_exception","reason":"bytes can be at most 32766 in length; got 37887"}},"status":400}

I have tried any number of ways around this and have not found one that preserves the functionality AND mitigates the max length exception. Do you know how I can alter the index mapping or analyzers to accommodate such large documents?

The ignore_above mapping configuration has a hint why this happens and what you could do about it...

hope this helps!

Thanks, Alexander. I included ignore_above: 256 in the keyword analyzers and that didn't change or remove the error message. :confused:

sorry, I misread your snippet. ignore_above only works for the keyword field mapper, but you seem to have that kind of strings in fields you would like to have searchable.

I actually don't know on top of my head, if there is any workaround for text fields - which I suppose this is thrown now by? Can you just index a document with the text field set and see if this happens?

Alexander, If I strip the PostBody mapping down to

{
  "PostBody" : {
      "type": "text"
  }
}

then ingestion succeeds.

This also leads to successful ingestion.

     "PostBody" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }

And this also leads to successful ingestion

      "PostBody" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type": "keyword",
            "ignore_above": 256
          },
          "cloud" : {
            "type" : "text",
            "analyzer" : "my_stop_analyzer",
            "search_analyzer" : "my_stop_analyzer",
            "fielddata" : true
          }
        }
      }

Unfortunately, if I add the ending multi-field or the two root analyzers then things explode with the same error message mentioned above, in the origin post.

Ideas? (Thanks for your help, by the way!)

Nothing that comes to mind immediately in the analyzing chain (this is not my area of best knowledge though). One of the things you could do, is to have an ingest processor, that checks. for a certain length of a field and maybe change it's field name or delete its contents to not index it (this would only act on a total field length and not on a term length though and also cost some more performance)?

Alexander, thanks. If you think of anything please do let me know.
We're aiming to have zero data loss on ingest. :confused:

Very interested in this error output -->

the first immense term is: '[78, 117, 108, 108, 97, 109, 32, 118, 97

Are these really the tokens? 78, 117, 108, 108 -- this looks like the binary data representation - which would also explain the massiveness of the object. It is the encoding of binary, base64 of pdf?

If that is the case - we have a new sparse and dense vector data object.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.