Dealing with indexing large singular documents

catmanjan · February 10, 2025, 12:06am

Hello we are encountering issues with large documents in Elasticsearch, we index text extracted content from PDF/Word documents and then search on those in an enterprise search scenario.

We are using Elastic Cloud which has a 100MB limit per document, the best solution we've found is to split the textual content into chunks, index multiple documents and then combine them when searching via aggregates.

What I want to know, is there some out of the box configuration we can perform on the index so WE don't have to do this?

e.g. I just want to give Elasticsearch my very large text content and then it can split the documents for indexing, and then merge the documents for searching as it sees fit?

dadoonet · February 10, 2025, 1:26am

Do you mean that you have one pdf document where the extracted text is more than 100mb?

Or are you using bulk API and sending more than one document?
The text extraction is done before sending to Elasticsearch, right? You are not also sending the PDF file as a BASE64 content, right?

catmanjan · February 10, 2025, 1:47am

Yes, customers have single PDF/Word/other files whose textual content after extraction is over 100MB, usually long manuals, legal texts, etc

We use Tika to extract the text content and send it to Elasticsearch via the Elasticsearch .NET API (the new one)

RainTown · February 10, 2025, 2:50am

Text file of the Bible is around 5MB.

so you extracted text is often bigger than roughly 20 bibles.

And you want to index that as a single document?

Sorry for not proposing anything to help, but exactly how do you intend to structure such a document? What’s the mapping?

catmanjan · February 10, 2025, 3:03am

Yes they are large government documents

Its enterprise search, so users search for document content, the search results is just a link back to the document in the original context, Elastic used to offer something similar with the "Workplace Search" product but it was limited to 100kb

RainTown · February 10, 2025, 3:16am

a (google) search for 100mb in elastic.co led me to

which has

http.max_content_length

I dont know if increasing that would help with your specific issue.

catmanjan · February 10, 2025, 3:36am

Thanks yes we found that one - if we were hosting elasticsearch ourself I believe this would be the best options, but it seems Elastic Cloud ignores that setting: Edit Elasticsearch user settings | Elasticsearch Service Documentation | Elastic

RainTown · February 10, 2025, 4:13am

I overlooked the bit that you were using Elastic Cloud. Apologies.

I can make an educated guess what answer might be, but ... no point in guessing as someone from Elastic would need to confirm. Good luck.

dadoonet · February 10, 2025, 11:22am

Yeah I don't think we can change this setting. It's really a "dangerous" one IMO in terms of node stability.

@catmanjan what is the size on disk on the PDF source file?

they are large government documents

That looks huge! Not sure how a human can actually read such a document...

I guess you can not share one of those documents, right?
Are you trying to just index the text or also running some vectorization? I'm curious about the current mapping of your documents. Could you share that?.

catmanjan · February 10, 2025, 10:20pm

Its around 290MB, I think its basically just a bunch of manuals stuck together but I'm not actually allowed to see the contents...

Here is the model we use:

namespace Model
{
    /// <summary>
    /// This class defines the schema of documents stored in Elasticsearch.
    /// </summary>
    [ElasticsearchType(IdProperty = nameof(Uid))]
    public class ElasticDocument
    {
        [Keyword(Name = "_allow_permissions")]
        public IEnumerable<string>? AllowPermissions { get; set; }

        [Keyword(Name = "_deny_permissions")]
        public IEnumerable<string>? DenyPermissions { get; set; }

        [Keyword(Name = "_security_keys")]
        public IEnumerable<string>? SecurityKeys { get; set; }

        [Text(Name = "title")]
        public string? Title { get; set; }

        [Text(Name = "body")]
        public string? Body { get; set; }

        [Keyword(Name = "reference", Normalizer = "lowercase")]
        public string? Reference { get; set; }

        [Keyword(Name = "url", Normalizer = "lowercase")]
        public string? Url { get; set; }

        [Keyword(Name = "edit_url", Normalizer = "lowercase")]
        public string? EditUrl { get; set; }

        [Date(Name = "created_at")]
        public DateTime CreatedAt { get; set; }

        [Date(Name = "updated_at")]
        public DateTime UpdatedAt { get; set; }

        [Keyword(Name = "type", Normalizer = "lowercase")]
        public string? Type { get; set; }

        [Keyword(Name = "hash")]
        public string? Hash { get; set; }

        [Object(Name = "created_by")]
        public ElasticUserField? CreatedBy { get; set; }

        [Object(Name = "updated_by")]
        public ElasticUserField? UpdatedBy { get; set; }

        [Keyword(Name = "mime_type", Normalizer = "lowercase")]
        public string? MimeType { get; set; }

        [Keyword(Name = "message_id")]
        public string? MessageId { get; set; }

        [Keyword(Name = "extension", Normalizer = "lowercase")]
        public string? Extension { get; set; }

        [Keyword(Name = "icon")]
        public string? Icon { get; set; }

        [Keyword(Name = "size")]
        public long? Size { get; set; }

        [Keyword(Name = "breadcrumb")]
        public string? Breadcrumb { get; set; }

        [Keyword(Name = "repository_type")]
        public RepositoryType RepositoryType { get; set; }
    }

}

dadoonet · February 11, 2025, 3:31am

So I guess the only way to solve this is what you said at the beginning : splitting the content into multiple parts.

RainTown · February 13, 2025, 4:00pm

just in passing, I happened to read this today in the documentation:

In certain situations it can make sense to store a field. For instance, if you have a document with a title , a date , and a very large content field, you may want to retrieve just the title and the date without having to extract those fields from a large _source field

The "very large content field" reminded me of this thread. Possibly an optimization you have already, or could be very helpful in some scenarios.

You can close the thread by accepting one of the answers, and good luck with your project.

Topic		Replies	Views
Recommendation for indexing a large size document < 1G Elasticsearch	4	5815	July 5, 2017
Large zip content Elasticsearch	5	1152	January 23, 2018
Indexing large pdf document Elasticsearch	10	5824	July 5, 2017
Managing Large Document Uploads: Performance and Limitations in Elasticsearch vs. Standard App Search Engines Elastic Search elastic-app-search	2	269	February 9, 2024
Performance suggestions for Indexing large documents Elasticsearch	8	374	July 6, 2017

Dealing with indexing large singular documents

Related topics