I have data that I store in elasticsearch and from these data there are long fields of up to 2000 characters and I want to search for them by term It is known in elasticsearch that the keyword field type reaches a length of only 255, meaning that if more than this number is considered text and does not search for it accurately as a result of that I modified the ignore_above to 3000 in order to be searched in the entire text accurately using term, but I noticed that There is a big problem, which is that 200 million of the data reached the size of 2884 gigabytes, which before the increase ignore_above its size was only 77 gigabytes
What is the solution to this problem?
What is the solution in order to search for the entire text accurately without increasing ignore_above in order for the space to remain the same?
What kind of field is this? Is it some kind of concatenated ID field that you always search by exact match on the full term?
If that is what it is I do not think there is any built in way to handle it more efficiently as storing a lot of long, largely unique identifiers will take up a lot of space. A workaround I have seen used in similar scenarios is to create a hash of the field of sufficient complexity to minimise hash collisions and then index this in the document and use this for exact lookups.
The hashing need to be handled outside Elasticsearch so it will depend on how you are ingesting your data. Logstash has a fingerprint plugin, but as you need to generate the same hash when querying it may be best to handle both aspects of hashing yourself.
Yes, create a hash of the field and add it to the document. That should be shorter, so should not cause the same problem. When you search you the generate the same hash based on the long field and use this in your query against the field that holds the hash value.
You mean I divide the long field by the code into several fields and when querying I group it as if it were one field and do the search accurately?
Well, does this affect the speed or will it be fast?
Will it be the same area because it is in several fields as if I have increased the ignore_above or does it not affect and the size will be normal as the previous size?
No, I am suggesting you store the field in the document, possibly without indexing it. You then create a separate field where you store the result of a hash function that you run on the long field. The Logstash fingerprint can calculate different types of cryptographic hashes so you can look at that for inspiration. The field that contains the hash value should be reasonably short and be indexed as keyword.
When you want to query based on one of the long identifiers in your code you calculate the same hash based on this and use this to query the field that holds the hash value.
Now maybe you are involved in several different elasticsearch projects, but the style of each thread is almost the same - you ask pretty vague questions, often things are "big" and you want "fast", clarifications betray a lack of any deep understanding. Thats generally fine, this is a community forum, but ...
In most recent of these I suggested you take a step back and start writing down actual requirements.
I am not at all convinced you really need to search for your "long fields of up to 2000 characters" by term, I'd take some convincing thats a good architecture for whatever problem your solution might solve. But there's nothing about your various problem descriptions that suggest any sort of solution architect or designer has been involved, and this is directly leading to the number and content of the questions you have posted. I fear the "solution" you will end up with will not be fit for purpose due to this.
Here, you seem not to have completely appreciated what ignore_above does - default is 256 so every keyword field longer than that would be partially ignored in search contexts. But if you really want your very long strings to be matched to every character, i.e. they might be different in the 1st char or the 2999th char, and every one in between, then that has a cost. Using hashes as suggested by Christian (who has also answered a bunch of your Qs) is a really good idea.
There's a little discussion about ignore_above here:
That was something else.
As for the current topic
For example
I have the following email
http://iopohpoeaiuh9p8weikjpoijthmahiujouieopnujrhou8iyrbgniwegpuwer0rguh7nr0ighnr0H7U9HRIURWEHNAH0U7I8ERPUHopfsmkj[y90mkgyhsinjmytjrtmytuhkyukeuytktyk/SDERnkojortieojihEA34KNGU9I4UPHNUJPI4UEu5y37uyhoiguwehoiuhI#HOI#@%HOIUBOU@VBO#%UYGihgoiuxhfijonitreshnj8O95Re8nsetuiinplosrcvyk,ero79c8ujhkolrtg084xg7hwxiog; 4thybewubl l9wygruorsLeghrt8oerthgu.,bnorsuigb8enrv8,yergb,irvonjbvoueybriohnboubyrurtsoidvutoyghnbvhuo vrngdsoiguy 8segion sio es 8rgsndnifjoui .bnrey8gt,ch9gwyhreyugc.uwy8gr,mgyuig,bisrugiucrbmfuigcwucmer,ucy,hbgihegb,iueygwhriumgwywerH
In my search I am looking for the text mentioned in full using term
Well, when I search for this text accurately, the term does not accept that because its size is greater than 255, so this led me to raise the value of the ignore_above to 3000 because I have some fields that reach this number of characters as a result, so the size of the indicator increased from 77 GB to 200000000 of data to 2 TB
If you really do have ridiculous long effectively random strings, and want exact matches on inputs that are similarly long, aside from correctly implementing the hash suggestion, I remain convinced you are simply looking at your problem the “wrong” way. Wrong here means not taking a step back and reconsidering the real world problem your solution tries to address. Respectfully, a failure in understanding at that higher level often manifests itself in asking puzzling technical questions.
As I described in my post you need to calculate the hash outside of Elasticsearch. You can not do this through an analyzer (at least unless you actually have developed a completely custom analyzer plugin that does calculate a cryptographic hash based on the full field).
You are 100% right that you could not “convey the idea clearly”. The “I did the hash with the Analyzer” shows haw far away from a correct understanding you are.
Christian is very patiently trying to help fill gaps in your knowledge. But I’m again asking you to seriously consider if you are really understanding the core problem here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.