Indexing Large Documents in ES

Hello ES Team,

I am using elastic search version 6.5.4 .

When I crowl 11,000 documents Crawling stops at 555 and I got following exception.

Exception :-' [2020-03-24T23:38:48,325][DEBUG][o.e.a.b.TransportShardBulkAction] [ESC-CND-EXTSH01] [documentsearchindex_pub-itsupport][1] failed to execute bulk item (index) index {[documentsearchindex_pub-itsupport][_doc][38730], source[n/a, actual length: [43.7kb], max length: 2kb]}
java.lang.IllegalArgumentException: Document contains at least one immense term in field="documentTextPages.lowercase" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[13, 10, 13, 10, 32, 32, 32, 32, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32]...', original message: bytes can be at most 32766 i[2020-03-24T23:38:48,325][DEBUG][o.e.a.b.TransportShardBulkAction] [ESC-CND-EXTSH01] [documentsearchindex_pub-itsupport][1] failed to execute bulk item (index) index {[documentsearchindex_pub-itsupport][_doc][38730], source[n/a, actual length: [43.7kb], max length: 2kb]}
java.lang.IllegalArgumentException: Document contains at least one immense term in field="documentTextPages.lowercase" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[13, 10, 13, 10, 32, 32, 32, 32, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32]...', original message: bytes can be at most 32766 in length; got 40977 n length; got 40977 '

Can you please help me how I handle above exception and crawl all documents successfully.

I would appreciate your suggestions.

Let me know if u require more information.

Thanks ,
Shweta

Your choice of analyzer chops text up into individual tokens (think "words") for the search index.
It has produced one token that is bigger than Lucene's limit of 32,766 for a single token size.

Ordinarily this might happen if someone stored something unusual like a base64 encoded image in the text of a document which had no spaces. This would be treated as one word and potentially blow limits. In your case I see the 13,10, 32 etc content is all whitespace so I'm wondering why your Analyzer hasn't chosen to throw this away. We'd need to see your mappings/analyzer config to refine this .

1 Like

Hello,

Thank You.

Below is my mapping code .
{
"mappings":{
"_doc":{
"properties":{
"documentId":{
"type":"integer"
},
"documentName":{
"type":"text",
"fields":{
"lowercase":{
"type":"text",
"analyzer":"lowercase_analyzer"
}
}
},
"documentType":{
"type":"text",
"fields":{
"lowercase":{
"type":"text",
"analyzer":"lowercase_analyzer"
}
}
},
"documentTextPages":{
"type":"text",
"fields":{
"lowercase":{
"type":"text",
"analyzer":"lowercase_analyzer"

                    }
                }
            },
            "meetingID":{  
                "type":"text"
            },
            "meetingTypeID":{  
                "type":"text"
            },
            "meetingTypeName":{  
                "type":"text",
                "fields":{  
                    "lowercase":{  
                        "type":"text",
                        "analyzer":"lowercase_analyzer" 
                    }
                }
            },
             "portalID":{  
                "type":"text"
            },
            "locationTitle":{  
                "type":"text",
                "fields":{  
                    "lowercase":{  
                        "type":"text",
                        "analyzer":"lowercase_analyzer"
                    }
                }
            },
            "meetingStartDate":{  
                "type":"date"
            },
            "meetingEndDate":{  
                "type":"date"
            }
        }
    }
},
"settings":{  
    "index":{  
        "number_of_shards":"5",
        "analysis":{  
            "filter":{  
                "lowercase_token_filter":{  
                    "type":"lowercase"
                }
            },
            "analyzer":{  
                "lowercase_analyzer":{  
                    "filter":[  
                        "lowercase_token_filter"
                    ],
                    "type":"custom",
                    "tokenizer":"keyword"
                },
                "autocomplete":{  
                    "tokenizer":"autocomplete",
                    "filter":[  
                        "lowercase"
                    ]
                }
            },
            "tokenizer":{  
                "autocomplete":{  
                    "type":"edge_ngram",
                    "min_gram":2,
                    "max_gram":10,
                    "token_chars":[  
                        "letter",
						"digit"
                    ]
                }
            }
        }
    },
    "number_of_replicas":"0"
}

}

I believe exception comes in documentTextPages field .

Do I need to change anything in above mapping?

Thanks.

Iā€™d need to see your settings for the analyzer definition but I imagine it looks something like this where thereā€™s no tokenizing of the value into multiple words.
Try the ā€œstandardā€ analyzer?

I have used above analyzer in mapping which I have shared above.

Can you please suggest me where I can use standard analyzer and which standard analyzer because I am stuck here and what kind of setting you need to see.

I am appreciating your help . Kindly please suggest me how I can cop up with issue.

You havenā€™t shown the index ā€˜settingsā€™ - only the ā€˜mappingsā€™ .

I have already attach index settings with mapping.
Anyway here is my settings.

"settings":{
"index":{
"number_of_shards":"5",
"analysis":{
"filter":{
"lowercase_token_filter":{
"type":"lowercase"
}
},
"analyzer":{
"lowercase_analyzer":{
"filter":[
"lowercase_token_filter"
],
"type":"custom",
"tokenizer":"keyword"
},
"autocomplete":{
"tokenizer":"autocomplete",
"filter":[
"lowercase"
]
}
},
"tokenizer":{
"autocomplete":{
"type":"edge_ngram",
"min_gram":2,
"max_gram":10,
"token_chars":[
"letter",
"digit"
]
}
}
}
},
"number_of_replicas":"0"
}
}

Let me know if anything else require.

Thanks.

My bad - the formatting was weird on my phone and I couldnā€™t see the settings in your original post. In your analyzer config you are using the ā€˜keywordā€™ tokenizer which does no tokenisation at all. It keeps the value as one single token, hence your problem with big documents. Pick a different tokenizer that suits your needs.

Thank You for your help.

Can you suggest me which tokenizer I should use for large documents.

I have one question upgrading latest version 7.8 can resolve my problem?

Why do you feel the need to use anything other than the standard analyzer? It is the default for text fields and already lowercases values. Iā€™m not clear what problem youā€™re trying to solve with the use of a custom analyzer.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.