IllegalArgumentException: Document contains at least one immense term in field=“abc”.(whose UTF8 encoding is longer than the max length 32766)

I have a field, the mapping of which looks like the following

new TextProperty
{
Name = "allContent",
Analyzer = "contentanalyzer",
IndexOptions = IndexOptions.Offsets,
Norms = false
}
The analyzer I am using is a custom one- basically tries to do a regex parsing. And _source is enabled in my case.

During indexing, i get the following exception java.lang.IllegalArgumentException: Document contains at least one immense term in field="allContent" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[48, 120, 52, 100, 53, 97, 57, 48, 48, 48...]...', original message: bytes can be at most 32766 in length; got 627762

I see, the ignore_above option could have been used had this field been a keyword type. But the field here is of type text.

What would be the solution here ?

TIA ~Divya
[https://stackoverflow.com/questions/45653094/illegalargumentexception-document-contains-at-least-one-immense-term-in-field]

Basically, it would be helpful if there is some way, where i could say ignore the term that has more than supported byte length.

Hello Divya,

One way to ensure this is that the regex should be created in a way that it
runs only certain number of iterations. This will be particularly useful
when working with unknowns content, since unbounded regex matches will lead
to catastrophic failures like OOM etc. In practice Regex should not run
unlimited number of times, esp in prod.

Check this link about limiting the matches to a limited length.

here is an example .

Regex: ((class) {1,3060}(\w{1,10})[ ]{1,3060})\{
matches any content like class MyClassName { and groups the keywords and the class Name.

Check explanation here. https://regex101.com/
H2H
~Rohit

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.