Indexing Large Documents in ES

Shweta_Patel · March 27, 2020, 10:15am

Hello ES Team,

I am using elastic search version 6.5.4 .

When I crowl 11,000 documents Crawling stops at 555 and I got following exception.

Exception :-' [2020-03-24T23:38:48,325][DEBUG][o.e.a.b.TransportShardBulkAction] [ESC-CND-EXTSH01] [documentsearchindex_pub-itsupport][1] failed to execute bulk item (index) index {[documentsearchindex_pub-itsupport][_doc][38730], source[n/a, actual length: [43.7kb], max length: 2kb]}
java.lang.IllegalArgumentException: Document contains at least one immense term in field="documentTextPages.lowercase" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[13, 10, 13, 10, 32, 32, 32, 32, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32]...', original message: bytes can be at most 32766 i[2020-03-24T23:38:48,325][DEBUG][o.e.a.b.TransportShardBulkAction] [ESC-CND-EXTSH01] [documentsearchindex_pub-itsupport][1] failed to execute bulk item (index) index {[documentsearchindex_pub-itsupport][_doc][38730], source[n/a, actual length: [43.7kb], max length: 2kb]}
java.lang.IllegalArgumentException: Document contains at least one immense term in field="documentTextPages.lowercase" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[13, 10, 13, 10, 32, 32, 32, 32, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32]...', original message: bytes can be at most 32766 in length; got 40977 n length; got 40977 '

Can you please help me how I handle above exception and crawl all documents successfully.

I would appreciate your suggestions.

Let me know if u require more information.

Thanks ,
Shweta

Mark_Harwood · March 27, 2020, 11:57am

Your choice of analyzer chops text up into individual tokens (think "words") for the search index.
It has produced one token that is bigger than Lucene's limit of 32,766 for a single token size.

Ordinarily this might happen if someone stored something unusual like a base64 encoded image in the text of a document which had no spaces. This would be treated as one word and potentially blow limits. In your case I see the 13,10, 32 etc content is all whitespace so I'm wondering why your Analyzer hasn't chosen to throw this away. We'd need to see your mappings/analyzer config to refine this .

Shweta_Patel · March 27, 2020, 12:10pm

Hello,

Thank You.

Below is my mapping code .
{
"mappings":{
"_doc":{
"properties":{
"documentId":{
"type":"integer"
},
"documentName":{
"type":"text",
"fields":{
"lowercase":{
"type":"text",
"analyzer":"lowercase_analyzer"
}
}
},
"documentType":{
"type":"text",
"fields":{
"lowercase":{
"type":"text",
"analyzer":"lowercase_analyzer"
}
}
},
"documentTextPages":{
"type":"text",
"fields":{
"lowercase":{
"type":"text",
"analyzer":"lowercase_analyzer"

                    }
                }
            },
            "meetingID":{  
                "type":"text"
            },
            "meetingTypeID":{  
                "type":"text"
            },
            "meetingTypeName":{  
                "type":"text",
                "fields":{  
                    "lowercase":{  
                        "type":"text",
                        "analyzer":"lowercase_analyzer" 
                    }
                }
            },
             "portalID":{  
                "type":"text"
            },
            "locationTitle":{  
                "type":"text",
                "fields":{  
                    "lowercase":{  
                        "type":"text",
                        "analyzer":"lowercase_analyzer"
                    }
                }
            },
            "meetingStartDate":{  
                "type":"date"
            },
            "meetingEndDate":{  
                "type":"date"
            }
        }
    }
},
"settings":{  
    "index":{  
        "number_of_shards":"5",
        "analysis":{  
            "filter":{  
                "lowercase_token_filter":{  
                    "type":"lowercase"
                }
            },
            "analyzer":{  
                "lowercase_analyzer":{  
                    "filter":[  
                        "lowercase_token_filter"
                    ],
                    "type":"custom",
                    "tokenizer":"keyword"
                },
                "autocomplete":{  
                    "tokenizer":"autocomplete",
                    "filter":[  
                        "lowercase"
                    ]
                }
            },
            "tokenizer":{  
                "autocomplete":{  
                    "type":"edge_ngram",
                    "min_gram":2,
                    "max_gram":10,
                    "token_chars":[  
                        "letter",
						"digit"
                    ]
                }
            }
        }
    },
    "number_of_replicas":"0"
}

}

I believe exception comes in documentTextPages field .

Do I need to change anything in above mapping?

Thanks.

Mark_Harwood · March 28, 2020, 8:40am

I’d need to see your settings for the analyzer definition but I imagine it looks something like this where there’s no tokenizing of the value into multiple words.
Try the “standard” analyzer?

Shweta_Patel · March 28, 2020, 1:01pm

I have used above analyzer in mapping which I have shared above.

Can you please suggest me where I can use standard analyzer and which standard analyzer because I am stuck here and what kind of setting you need to see.

I am appreciating your help . Kindly please suggest me how I can cop up with issue.

Mark_Harwood · March 28, 2020, 4:17pm

You haven’t shown the index ‘settings’ - only the ‘mappings’ .

Shweta_Patel · March 29, 2020, 4:11am

I have already attach index settings with mapping.
Anyway here is my settings.

"settings":{
"index":{
"number_of_shards":"5",
"analysis":{
"filter":{
"lowercase_token_filter":{
"type":"lowercase"
}
},
"analyzer":{
"lowercase_analyzer":{
"filter":[
"lowercase_token_filter"
],
"type":"custom",
"tokenizer":"keyword"
},
"autocomplete":{
"tokenizer":"autocomplete",
"filter":[
"lowercase"
]
}
},
"tokenizer":{
"autocomplete":{
"type":"edge_ngram",
"min_gram":2,
"max_gram":10,
"token_chars":[
"letter",
"digit"
]
}
}
}
},
"number_of_replicas":"0"
}
}

Let me know if anything else require.

Thanks.

Mark_Harwood · March 29, 2020, 8:24am

My bad - the formatting was weird on my phone and I couldn’t see the settings in your original post. In your analyzer config you are using the ‘keyword’ tokenizer which does no tokenisation at all. It keeps the value as one single token, hence your problem with big documents. Pick a different tokenizer that suits your needs.

Shweta_Patel · March 29, 2020, 12:03pm

Thank You for your help.

Can you suggest me which tokenizer I should use for large documents.

I have one question upgrading latest version 7.8 can resolve my problem?

Mark_Harwood · March 29, 2020, 12:41pm

Why do you feel the need to use anything other than the standard analyzer? It is the default for text fields and already lowercases values. I’m not clear what problem you’re trying to solve with the use of a custom analyzer.

system · April 26, 2020, 12:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Document contains at least one immense term error Elasticsearch language-clients	3	2139	June 2, 2022
Error: Document contains at least one immense term in field Elasticsearch	2	24002	December 16, 2016
Max_bytes_length_exceeded_exception Reason Elasticsearch	1	828	March 8, 2022
IllegalArgumentException: Document contains at least one immense term in field=“abc”.(whose UTF8 encoding is longer than the max length 32766) Elasticsearch	3	2418	September 11, 2017
UTF8 encoding is longer than the max length 32766 Elasticsearch	4	17612	July 6, 2017

Indexing Large Documents in ES

Related topics