Storing Very Large Text Field in Elasticsearch

kdwolf · July 24, 2025, 12:57pm

Hello all,

I’d appreciate your input on the following, please:

The business has a requirement to extract a lengthy string (text) from an external object — for example, a MOBI file — and store it in an Elasticsearch index as a searchable field. The challenge is that the string could be quite large, potentially in the region of 40–50 MB (which I think exceeds Lucene limits significantly) , and the business expects to be able to search within this field.

I imagine this isn’t the first time such a scenario has arisen, so I’m keen to understand what best practices are in place for handling this type of requirement.

Many thanks in advance for your insights.

stephenb · July 24, 2025, 4:44pm

Hi @kdwolf

I'm a little confused. Are you referring to a single token that is 40-50MB, or a text field containing multiple strings/tokens (this seems to be what you are implying)? If so, that is fine to store 40-50MB in a single text field in Elastic.

Perhaps you are confusing the single token limit in Lucene, which is 32K

Lucene still has a document limit (field limit) of about 2GB.

That said, there are other areas of concern, like actually returning the data due to HTTP limits, etc.

Perhaps take a look at this thread

Minor Side Note: Most of the text in a .mobi file is binary so not sure if that is just and example ... The Header etc is text but the actual text of the document is binary AFAIK.

kdwolf · July 24, 2025, 8:17pm

Thank you, @stephenb
Yes, I am referring to the extracted text like
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

which should be searchable i.e. if the customer is looking for "Excepteur sint occaecat cupidatat non proident" phrase it should return all the documents with this phrase, as one would expect.

We face significant performance issue when storing very long text and trying to search amongst millions of documents and I wonder if there is a recommended approach to optimise either the query or the storage (I guess nothing can be done with the latter) when querying this type of a text field?

stephenb · July 24, 2025, 9:50pm

Clearly a lot of detail to dig into ....

A couple of things come to mind...

1st Perhaps it may be that the search is quite fast, finding the document(s) that match the query, but then pulling those Gigantic Documents back out of Lucene, marshaling them up and then sending them back to the client... yeah... that could be quite non-performant.

We are doing on the Semantic search side is chunking up these big docs into digestible parts...

Does your user really want to pull back the entire 50 MBs to find the 3 sentences or paragraph that matches, along with the document name (could be section etc) ?

So perhaps a chunking, perhaps you do not want to do semantic search but you could do similar for normal search

semantic_text

Topic		Replies	Views
Size limitations? Elasticsearch	6	10096	July 6, 2017
How big a field can be Elasticsearch	3	24621	December 28, 2018
Query document with very large text field Elasticsearch	1	898	July 5, 2017
Comparing Large Text Documents -- Queries with Large Text Fields Elasticsearch	2	945	July 6, 2017
Maximum characters limit text fields in elasticsearch Elasticsearch	1	539	April 1, 2025

Storing Very Large Text Field in Elasticsearch

Related topics