What is the maximum text size that can be indexed as a single term?

Hi,

I'm using 7.16.1,
When I save to elastic field with large amount of text it is not indexed as a single term.
What is the maximum text size for indexing field as a single term?

Thanks,
Shay

1 Like

I've put some pretty big (multi thousand line stack traces) pieces of text in a field and have not run a problem. If your bulk insert is larger than 20MB it will break it up into smaller pieces, however if a single document is larger than 20MB is will just get indexed using the normal (non bulk) api.

Thanks for your answer Andreas, but it not reply my question. I asked about something else.
I've saved a field with few lines of text (lets say 200 words) and the field is marked as _Igonred.
It is not indexed... I need to know what is the maximum field size that can be indexed as a single term.

Whoops. Sorry you're right I got mixed up with another post.
Check your ignore_above parameter...but 200 words is not a lot and that should easily work if you have not changed any settings before. There is a limit on the http size of 100MB if I remember correctly but that can be changed too.

Thanks again, but I will wait for the elastic team member answer.

Hi @ShayWeizman

I am a bit confused as you are using some mixed terminology... let perhaps clarify a bit.

First there is a source document that contains fields.

Those _source field are then either "Indexed" which makes them searchable or they are not indexed and thus are not searchable.

Whether a field is indexed or not does not affect the _source unless you specifically drop the _source.

An indexed fields is then searchable.

Fields that are indexed have fields types example text (for full text search) or a keyword type which is for exact match or aggregations etc

keywords are used in term searches so many of us think that keyword and term are synonymous, so I am unclear what you are actually asking.

Also when we think of full text that is a sentence / paragraph etc we think of every word as a token (I think you may be using term for this)

Perhaps you could help clarify exactly are you trying to accomplish?

Are you asking what is the longest text field or the longest keyword field or the longest field in the _source field?

text fields can be easily be many MBs but that may or may not be most efficient

keywords can be very long as well but that is not efficient typically you use ignore_above to limit the actual length.

Or are you asking what is the longest string IN the text field?

There is also a binary / blob type that I think goes up to 2GB.

Or are you asking about an ingest strategy like you have big docs and you are unclear how to ingest them the way you want? example you are trying to use HTTP and it is chunking up the data?

There is a built in limit in the HTTP (chunk handling) layer that limits requests to 100mb. You can set it using http.max_content_length (for example, set it to a bigger value).

So back to what are you really asking / trying to accomplish / what is the actual issue you are trying to solve.

This blog gives full details.

1 Like

Hi Mark,

In this post you have said:
"This is typically ignored because the value is too large to be indexed as a single term."

So this is what I'm asking about.

Thanks,
Shay

Quoting from the blog post:

The other big issue with the keyword field is it can’t handle very long fields. The default string mapping ignores strings longer than 256 characters, silently dropping values from the list of indexed terms. The majority of Elasticsearch’s log file messages exceed this limit.

And even if you do raise the Elasticsearch limit, you cannot exceed the hard Lucene limit of 32k for a single token, and Elasticsearch certainly logs some messages that exceed this.

1 Like

Cool thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.