Dealing with indexing large singular documents

Hello we are encountering issues with large documents in Elasticsearch, we index text extracted content from PDF/Word documents and then search on those in an enterprise search scenario.

We are using Elastic Cloud which has a 100MB limit per document, the best solution we've found is to split the textual content into chunks, index multiple documents and then combine them when searching via aggregates.

What I want to know, is there some out of the box configuration we can perform on the index so WE don't have to do this?

e.g. I just want to give Elasticsearch my very large text content and then it can split the documents for indexing, and then merge the documents for searching as it sees fit?

Do you mean that you have one pdf document where the extracted text is more than 100mb?

Or are you using bulk API and sending more than one document?
The text extraction is done before sending to Elasticsearch, right? You are not also sending the PDF file as a BASE64 content, right?

Yes, customers have single PDF/Word/other files whose textual content after extraction is over 100MB, usually long manuals, legal texts, etc

We use Tika to extract the text content and send it to Elasticsearch via the Elasticsearch .NET API (the new one)

Text file of the Bible is around 5MB.

so you extracted text is often bigger than roughly 20 bibles.

And you want to index that as a single document?

Sorry for not proposing anything to help, but exactly how do you intend to structure such a document? What’s the mapping?

Yes they are large government documents

Its enterprise search, so users search for document content, the search results is just a link back to the document in the original context, Elastic used to offer something similar with the "Workplace Search" product but it was limited to 100kb

a (google) search for 100mb in elastic.co led me to

which has

http.max_content_length

I dont know if increasing that would help with your specific issue.

Thanks yes we found that one - if we were hosting elasticsearch ourself I believe this would be the best options, but it seems Elastic Cloud ignores that setting: Edit Elasticsearch user settings | Elasticsearch Service Documentation | Elastic

I overlooked the bit that you were using Elastic Cloud. Apologies.

I can make an educated guess what answer might be, but ... no point in guessing as someone from Elastic would need to confirm. Good luck.

Yeah I don't think we can change this setting. It's really a "dangerous" one IMO in terms of node stability.

@catmanjan what is the size on disk on the PDF source file?

they are large government documents

That looks huge! Not sure how a human can actually read such a document... :stuck_out_tongue:

I guess you can not share one of those documents, right?
Are you trying to just index the text or also running some vectorization? I'm curious about the current mapping of your documents. Could you share that?.

Its around 290MB, I think its basically just a bunch of manuals stuck together but I'm not actually allowed to see the contents...

Here is the model we use:

namespace Model
{
    /// <summary>
    /// This class defines the schema of documents stored in Elasticsearch.
    /// </summary>
    [ElasticsearchType(IdProperty = nameof(Uid))]
    public class ElasticDocument
    {
        [Keyword(Name = "_allow_permissions")]
        public IEnumerable<string>? AllowPermissions { get; set; }

        [Keyword(Name = "_deny_permissions")]
        public IEnumerable<string>? DenyPermissions { get; set; }

        [Keyword(Name = "_security_keys")]
        public IEnumerable<string>? SecurityKeys { get; set; }

        [Text(Name = "title")]
        public string? Title { get; set; }

        [Text(Name = "body")]
        public string? Body { get; set; }

        [Keyword(Name = "reference", Normalizer = "lowercase")]
        public string? Reference { get; set; }

        [Keyword(Name = "url", Normalizer = "lowercase")]
        public string? Url { get; set; }

        [Keyword(Name = "edit_url", Normalizer = "lowercase")]
        public string? EditUrl { get; set; }

        [Date(Name = "created_at")]
        public DateTime CreatedAt { get; set; }

        [Date(Name = "updated_at")]
        public DateTime UpdatedAt { get; set; }

        [Keyword(Name = "type", Normalizer = "lowercase")]
        public string? Type { get; set; }

        [Keyword(Name = "hash")]
        public string? Hash { get; set; }

        [Object(Name = "created_by")]
        public ElasticUserField? CreatedBy { get; set; }

        [Object(Name = "updated_by")]
        public ElasticUserField? UpdatedBy { get; set; }

        [Keyword(Name = "mime_type", Normalizer = "lowercase")]
        public string? MimeType { get; set; }

        [Keyword(Name = "message_id")]
        public string? MessageId { get; set; }

        [Keyword(Name = "extension", Normalizer = "lowercase")]
        public string? Extension { get; set; }

        [Keyword(Name = "icon")]
        public string? Icon { get; set; }

        [Keyword(Name = "size")]
        public long? Size { get; set; }

        [Keyword(Name = "breadcrumb")]
        public string? Breadcrumb { get; set; }

        [Keyword(Name = "repository_type")]
        public RepositoryType RepositoryType { get; set; }
    }

}

So I guess the only way to solve this is what you said at the beginning : splitting the content into multiple parts.

just in passing, I happened to read this today in the documentation:

In certain situations it can make sense to store a field. For instance, if you have a document with a title , a date , and a very large content field, you may want to retrieve just the title and the date without having to extract those fields from a large _source field

The "very large content field" reminded me of this thread. Possibly an optimization you have already, or could be very helpful in some scenarios.

You can close the thread by accepting one of the answers, and good luck with your project.