I've seen a historical post(closed) without resolution on this - apologies if I'm whipping a dead horse.
tldr; implementation of a Custom Analyzer leads to Ingestion-Pipeline Errors of "Document contains at least one immense term in ......"
Not applying the custom Analyzer to the field does not generate this error. Assigned ignore_above to this field is not an option due to its "type": "text"
Recently had to establish a custom analyzer: (truncated a bit for this post)
"settings": {
"mapping": {
"total_fields": {
"limit": "2000"
},
"ignore_malformed": "true"
},
"analysis": {
"filter": {
"hyphen_delimiter": {
"split_on_numerics": "false",
"generate_word_parts": "true",
"preserve_original": "true",
"catenate_words": "false",
"generate_number_parts": "true",
"catenate_all": "false",
"split_on_case_change": "false",
"type": "word_delimiter",
"catenate_numbers": "false",
"stem_english_possessive": "false"
}
},
"analyzer": {
"slwhitespace": {
"filter": [
"lowercase",
"asciifolding",
"hyphen_delimiter"
],
"type": "custom",
"tokenizer": "punctuation"
}
},
"tokenizer": {
"punctuation": {
"pattern": "[ ~`>=<;:,'&%#!"\?\+\*\|\{\}\[\]\(\)\\\^\$\r\n\t]",
"type": "pattern"
}
}
}
When Applied Mapping References the Custom Analyzer:
"documentcontent": {
"properties": {
"content": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "slwhitespace"
}
}
}
A test document containing an exceptionally long string results in the following error:
{ServerError: 400Type: illegal_argument_exception Reason: "Document contains at least one immense term in field="documentcontent.content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[49, 50, 103, 101, 114, 115, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120]...', original message: bytes can be at most 32766 in length; got 1853700" CausedBy: "Type: max_bytes_length_exceeded_exception Reason: "bytes can be at most 32766 in length; got 1853700""}
When no Analyzer is specified:
"documentcontent": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
No error is generated, field is ignored as expected.
Any recommendations or insight into this behavior would be greatly appreciated.
-N
note - Running an older version of ES, 5.4.1