I have data that I store in elasticsearch and from these data there are long fields of up to 2000 characters and I want to search for them by term It is known in elasticsearch that the keyword field type reaches a length of only 255, meaning that if more than this number is considered text and does not search for it accurately as a result of that I modified the ignore_above to 3000 in order to be searched in the entire text accurately using term, but I noticed that There is a big problem, which is that 200 million of the data reached the size of 2884 gigabytes, which before the increase ignore_above its size was only 77 gigabytes
What is the solution to this problem?
What is the solution in order to search for the entire text accurately without increasing ignore_above in order for the space to remain the same?
What kind of field is this? Is it some kind of concatenated ID field that you always search by exact match on the full term?
If that is what it is I do not think there is any built in way to handle it more efficiently as storing a lot of long, largely unique identifiers will take up a lot of space. A workaround I have seen used in similar scenarios is to create a hash of the field of sufficient complexity to minimise hash collisions and then index this in the document and use this for exact lookups.
The hashing need to be handled outside Elasticsearch so it will depend on how you are ingesting your data. Logstash has a fingerprint plugin, but as you need to generate the same hash when querying it may be best to handle both aspects of hashing yourself.
Yes, create a hash of the field and add it to the document. That should be shorter, so should not cause the same problem. When you search you the generate the same hash based on the long field and use this in your query against the field that holds the hash value.
You mean I divide the long field by the code into several fields and when querying I group it as if it were one field and do the search accurately?
Well, does this affect the speed or will it be fast?
Will it be the same area because it is in several fields as if I have increased the ignore_above or does it not affect and the size will be normal as the previous size?
No, I am suggesting you store the field in the document, possibly without indexing it. You then create a separate field where you store the result of a hash function that you run on the long field. The Logstash fingerprint can calculate different types of cryptographic hashes so you can look at that for inspiration. The field that contains the hash value should be reasonably short and be indexed as keyword.
When you want to query based on one of the long identifiers in your code you calculate the same hash based on this and use this to query the field that holds the hash value.
Now maybe you are involved in several different elasticsearch projects, but the style of each thread is almost the same - you ask pretty vague questions, often things are "big" and you want "fast", clarifications betray a lack of any deep understanding. Thats generally fine, this is a community forum, but ...
In most recent of these I suggested you take a step back and start writing down actual requirements.
I am not at all convinced you really need to search for your "long fields of up to 2000 characters" by term, I'd take some convincing thats a good architecture for whatever problem your solution might solve. But there's nothing about your various problem descriptions that suggest any sort of solution architect or designer has been involved, and this is directly leading to the number and content of the questions you have posted. I fear the "solution" you will end up with will not be fit for purpose due to this.
Here, you seem not to have completely appreciated what ignore_above does - default is 256 so every keyword field longer than that would be partially ignored in search contexts. But if you really want your very long strings to be matched to every character, i.e. they might be different in the 1st char or the 2999th char, and every one in between, then that has a cost. Using hashes as suggested by Christian (who has also answered a bunch of your Qs) is a really good idea.
There's a little discussion about ignore_above here:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.