Error "String length exceeds the maximum length (5000000)" when transferring a large document to the attachment pipeline

I'm using Elasticsearch 8.9.1
Using python, I send an 8MB xlsx document to the Elasticsearch index via attachment pipeline. But the error "String length (5046272) exceeds the maximum length (5000000)" appears.
For example, documents with a size of 1MB do not cause an error.
How can I increase the maximum string length or otherwise avoid this error?

Welcome!

Could you share the full stack trace please?

That's one of the reason I prefer doing the extraction on the local machine before sending the data to Elasticsearch. This is what FSCrawler is doing.

But the error "String length (5046272) exceeds the maximum length (5000000)" appears.

The default limit of 5mb have been increased to 20mb. See:

It's part of Jackson 2.15.1.
But AFAICS 2.15.1 is not yet part of 8.10.4...

Although it is supposed to be 2gb....

I added a comment here:

The full error message is below. I don't know how to get full stack trace.

{'root_cause': [{'type': 'document_parsing_exception', 'reason': "[-1:21] failed to parse field [attachment.content] of type [text] in document with id 'X5mWECsLu1Sp_VdsN7Bcng=='. Could not parse field value preview,"}], 'type': 'document_parsing_exception', 'reason': "[-1:21] failed to parse field [attachment.content] of type [text] in document with id 'X5mWECsLu1Sp_VdsN7Bcng=='. Could not parse field value preview,", 'caused_by': {'type': 'stream_constraints_exception', 'reason': 'String length (5046272) exceeds the maximum length (5000000)'}}

To avoid the problem described, I'm using Elasticsearch 8.7.1. This version handles 5MB documents successfully, but 44MB documents cause the error below.

{'root_cause': [{'type': 'parse_exception', 'reason': 'Error parsing document in field [data]'}], 'type': 'parse_exception', 'reason': 'Error parsing document in field [data]', 'caused_by': {'type': 'tika_exception', 'reason': 'Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$2@5c3c5810', 'caused_by': {'type': 'record_format_exception', 'reason': 'Tried to allocate an array of length 108,463,468, but the maximum length for this record type is 100,000,000.\nIf the file is not corrupt and not large, please open an issue on bugzilla to request \nincreasing the maximum allowable size for this record type.\nYou can set a higher override value with IOUtils.setByteArrayMaxOverride()'}}}

Since I need to index very large MS Word/Excel documents, I would like to solve this problem too.

You can't do that with the ingest attachment plugin IMO. But good news, we just released this in 8.11: Content extraction | Enterprise Search documentation [8.11] | Elastic

I did not test it yet. And I don't know how you can directly use this service if possible.

If it still does not work for you, may be give a try to FSCrawler.

It should be in Elasticsearch logs.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.