What is the solution in order to search for the entire text accurately without increasing ignore_above in order for the space to remain the same?

dsagent · December 28, 2024, 3:42pm

Hi

I have data that I store in elasticsearch and from these data there are long fields of up to 2000 characters and I want to search for them by term It is known in elasticsearch that the keyword field type reaches a length of only 255, meaning that if more than this number is considered text and does not search for it accurately as a result of that I modified the ignore_above to 3000 in order to be searched in the entire text accurately using term, but I noticed that There is a big problem, which is that 200 million of the data reached the size of 2884 gigabytes, which before the increase ignore_above its size was only 77 gigabytes

What is the solution to this problem?

Christian_Dahlqvist · December 28, 2024, 4:02pm

What kind of field is this? Is it some kind of concatenated ID field that you always search by exact match on the full term?

If that is what it is I do not think there is any built in way to handle it more efficiently as storing a lot of long, largely unique identifiers will take up a lot of space. A workaround I have seen used in similar scenarios is to create a hash of the field of sufficient complexity to minimise hash collisions and then index this in the document and use this for exact lookups.

dsagent · December 28, 2024, 4:09pm

yes

Can you give me an example or a link?

Christian_Dahlqvist · December 28, 2024, 5:07pm

The hashing need to be handled outside Elasticsearch so it will depend on how you are ingesting your data. Logstash has a fingerprint plugin, but as you need to generate the same hash when querying it may be best to handle both aspects of hashing yourself.

dsagent · December 28, 2024, 5:19pm

Sorry, I didn't understand what you mean exactly.
Do you mean that I hash through the code?

Christian_Dahlqvist · December 28, 2024, 5:30pm

Yes, create a hash of the field and add it to the document. That should be shorter, so should not cause the same problem. When you search you the generate the same hash based on the long field and use this in your query against the field that holds the hash value.

dsagent · December 28, 2024, 5:50pm

You mean I divide the long field by the code into several fields and when querying I group it as if it were one field and do the search accurately?

Well, does this affect the speed or will it be fast?

Will it be the same area because it is in several fields as if I have increased the ignore_above or does it not affect and the size will be normal as the previous size?

Christian_Dahlqvist · December 28, 2024, 6:10pm

No, I am suggesting you store the field in the document, possibly without indexing it. You then create a separate field where you store the result of a hash function that you run on the long field. The Logstash fingerprint can calculate different types of cryptographic hashes so you can look at that for inspiration. The field that contains the hash value should be reasonably short and be indexed as keyword.

When you want to query based on one of the long identifiers in your code you calculate the same hash based on this and use this to query the field that holds the hash value.

dsagent · December 28, 2024, 6:30pm

Okay
I'm going to try this method.
Thank you

RainTown · December 28, 2024, 6:50pm

Er, I've answered some of your Qs before, and this in a similar style to your previous Qs:

How to search for text within text?

How to increase the speed of response?

Increased speed elasticsearch

What are the requirements?

What is the reason for not reading from the HDD storage disk?

What is the largest size that one node can hold?

What is the fastest storage type suitable for elasticsaerch than the HDD?

Now maybe you are involved in several different elasticsearch projects, but the style of each thread is almost the same - you ask pretty vague questions, often things are "big" and you want "fast", clarifications betray a lack of any deep understanding. Thats generally fine, this is a community forum, but ...

In most recent of these I suggested you take a step back and start writing down actual requirements.

I am not at all convinced you really need to search for your "long fields of up to 2000 characters" by term, I'd take some convincing thats a good architecture for whatever problem your solution might solve. But there's nothing about your various problem descriptions that suggest any sort of solution architect or designer has been involved, and this is directly leading to the number and content of the questions you have posted. I fear the "solution" you will end up with will not be fit for purpose due to this.

Here, you seem not to have completely appreciated what ignore_above does - default is 256 so every keyword field longer than that would be partially ignored in search contexts. But if you really want your very long strings to be matched to every character, i.e. they might be different in the 1st char or the 2999th char, and every one in between, then that has a cost. Using hashes as suggested by Christian (who has also answered a bunch of your Qs) is a really good idea.

There's a little discussion about ignore_above here:

Philosophy behind ignore_above mapping parameter

Note what the elasticsearch documentation says for keyword types:

keyword-field-type is used for structured content such as IDs, email addresses, hostnames, status codes, zip codes, or tags.

Does your 3000 long characters really seem to fit?

dsagent · January 15, 2025, 5:23pm

That was something else.
As for the current topic
For example
I have the following email

http://iopohpoeaiuh9p8weikjpoijthmahiujouieopnujrhou8iyrbgniwegpuwer0rguh7nr0ighnr0H7U9HRIURWEHNAH0U7I8ERPUHopfsmkj[y90mkgyhsinjmytjrtmytuhkyukeuytktyk/SDERnkojortieojihEA34KNGU9I4UPHNUJPI4UEu5y37uyhoiguwehoiuhI#HOI#@%HOIUBOU@VBO#%UYGihgoiuxhfijonitreshnj8O95Re8nsetuiinplosrcvyk,ero79c8ujhkolrtg084xg7hwxiog; 4thybewubl l9wygruorsLeghrt8oerthgu.,bnorsuigb8enrv8,yergb,irvonjbvoueybriohnboubyrurtsoidvutoyghnbvhuo  vrngdsoiguy 8segion sio es 8rgsndnifjoui .bnrey8gt,ch9gwyhreyugc.uwy8gr,mgyuig,bisrugiucrbmfuigcwucmer,ucy,hbgihegb,iueygwhriumgwywerH

In my search I am looking for the text mentioned in full using term
Well, when I search for this text accurately, the term does not accept that because its size is greater than 255, so this led me to raise the value of the ignore_above to 3000 because I have some fields that reach this number of characters as a result, so the size of the indicator increased from 77 GB to 200000000 of data to 2 TB

dsagent · January 15, 2025, 5:33pm

Thank you
I did the hash, but because of the presence of symbols in the field, it did not fit because it ignores the symbols

This is what they asked me to do I tried to convince that it is useless but to no avail
There are many things that depend on this field

Christian_Dahlqvist · January 15, 2025, 5:36pm

This sounds wrong as a hash function should not ignore symbols, just calculate a hash based on the full string. How did you create the hash?

RainTown · January 15, 2025, 11:40pm

dsagent:

As for the current topic
For example
I have the following email

http://iopohpoeaiuh9p8weikjpoijthmahiujouieopnujrhou8iyrbgniwegpuwer0rguh7nr0ighnr0H7U9HRIURWEHNAH0U7I8ERPUHopfsmkj[y90mkgyhsinjmytjrtmytuhkyukeuytktyk/SDERnkojortieojihEA34KNGU9I4UPHNUJPI4UEu5y37uyhoiguwehoiuhI#HOI#@%HOIUBOU@VBO#%UYGihgoiuxhfijonitreshnj8O95Re8nsetuiinplosrcvyk,ero79c8ujhkolrtg084xg7hwxiog; 4thybewubl l9wygruorsLeghrt8oerthgu.,bnorsuigb8enrv8,yergb,irvonjbvoueybriohnboubyrurtsoidvutoyghnbvhuo  vrngdsoiguy 8segion sio es 8rgsndnifjoui .bnrey8gt,ch9gwyhreyugc.uwy8gr,mgyuig,bisrugiucrbmfuigcwucmer,ucy,hbgihegb,iueygwhriumgwywerH

In what sense is that an “email”?

If you really do have ridiculous long effectively random strings, and want exact matches on inputs that are similarly long, aside from correctly implementing the hash suggestion, I remain convinced you are simply looking at your problem the “wrong” way. Wrong here means not taking a step back and reconsidering the real world problem your solution tries to address. Respectfully, a failure in understanding at that higher level often manifests itself in asking puzzling technical questions.

dsagent · January 16, 2025, 9:30am

I think I couldn't convey the idea clearly

dsagent · January 16, 2025, 9:33am

I did the hash using the analyzer

dsagent · January 16, 2025, 9:35am

Also, I have thought about encoding the text to a certain size, it was useful and the results are correct in the search, but the size did not say

Christian_Dahlqvist · January 16, 2025, 9:36am

As I described in my post you need to calculate the hash outside of Elasticsearch. You can not do this through an analyzer (at least unless you actually have developed a completely custom analyzer plugin that does calculate a cryptographic hash based on the full field).

dsagent · January 16, 2025, 9:43am

Well, I'll do it.

He got a little busy encoding the text you talked about.

RainTown · January 16, 2025, 10:26am

You are 100% right that you could not “convey the idea clearly”. The “I did the hash with the Analyzer” shows haw far away from a correct understanding you are.

Christian is very patiently trying to help fill gaps in your knowledge. But I’m again asking you to seriously consider if you are really understanding the core problem here.

Topic		Replies	Views
How big a field can be Elasticsearch	3	24421	December 28, 2018
Document contains at least one immense term in field="REGIONS" (whose UTF8 e Elasticsearch	3	1563	July 5, 2017
String mapping: ignore_above for not_analyzed fields Elasticsearch	4	6902	July 5, 2017
Wildcard query limit? Elasticsearch	8	744	July 6, 2017
What all are the possible problems if I don't give keyword mapping to my text type? Elasticsearch	6	627	December 28, 2017

Related topics