Hello we are encountering issues with large documents in Elasticsearch, we index text extracted content from PDF/Word documents and then search on those in an enterprise search scenario.
We are using Elastic Cloud which has a 100MB limit per document, the best solution we've found is to split the textual content into chunks, index multiple documents and then combine them when searching via aggregates.
What I want to know, is there some out of the box configuration we can perform on the index so WE don't have to do this?
e.g. I just want to give Elasticsearch my very large text content and then it can split the documents for indexing, and then merge the documents for searching as it sees fit?
Do you mean that you have one pdf document where the extracted text is more than 100mb?
Or are you using bulk API and sending more than one document?
The text extraction is done before sending to Elasticsearch, right? You are not also sending the PDF file as a BASE64 content, right?
Its enterprise search, so users search for document content, the search results is just a link back to the document in the original context, Elastic used to offer something similar with the "Workplace Search" product but it was limited to 100kb
Yeah I don't think we can change this setting. It's really a "dangerous" one IMO in terms of node stability.
@catmanjan what is the size on disk on the PDF source file?
they are large government documents
That looks huge! Not sure how a human can actually read such a document...
I guess you can not share one of those documents, right?
Are you trying to just index the text or also running some vectorization? I'm curious about the current mapping of your documents. Could you share that?.
just in passing, I happened to read this today in the documentation:
In certain situations it can make sense to store a field. For instance, if you have a document with a title , a date , and a very large content field, you may want to retrieve just the title and the date without having to extract those fields from a large _source field
The "very large content field" reminded me of this thread. Possibly an optimization you have already, or could be very helpful in some scenarios.
You can close the thread by accepting one of the answers, and good luck with your project.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.