At the moment, I have a dataset containing over 170 million records and I want to index it in Elasticsearch. But the problem is that the way I need the mapping to be, generates to much datastorage.
Some other fields are there as well, but they are not so dynamic. The mapping has to allow for searches where the category (in_ or out_) has to start with a prefix, contain some text, ends with a postfix or an exact match. Also, for the filename (in_ and out_) we need to be able to search for prefixes, postfixes, data within the filename and an exact match.
So for example, let's take a filename "my_file_20170802-1454.pdf". I need to be able to find this when I enter one of the following queries:
my_file
20170802
pdf
2017
my_file_20170802-1454.pdf
Basically for the category, the same is true, only other datastructure. Usually containing underscores, hypens or spaces.
I have configured a mapping which allows this behavior, but it'll take up to 1,5TB of storage. Is there a simpler way to handle this?
IMO the way to solve this is by either using ngrams (size of 3 as in your example the lowest number of characters is 3: pdf). But may be using a pattern tokenizer would help to produce the exact terms you mentioned?
FWIW, ngrams or pattern tokenizers will use mode disk space for sure.
In the mapping I currently have, I'm using ngrams, 3-25, and another field for exact matching. And this is giving us a storage (calculated) of over 1,5TB. IMO, 500GB would be acceptable, but 1,5TB is a bit much.
Well, I took a subset of the data and ran it into a test elastic search instance. Then after that was done, I let the index optimize itself (because from experience, I've noticed that just pushing it in and then checking disk size is not reliable). Then with that data I've extrapolated until we had a "guess" about what the size would be with the full dataset.
That test already uses the n-gram filter, 2-20 for category and 3-25 for the filename. Both 2 times (in_ and out_). I still have to run the test without the _all field.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.