At the moment, I have a dataset containing over 170 million records and I want to index it in Elasticsearch. But the problem is that the way I need the mapping to be, generates to much datastorage.
Our data looks like this:
in_category (max 256 characters)
in_filename (max 256 characters)
out_category (max 256 characters)
out_filename (max 256 characters)
Some other fields are there as well, but they are not so dynamic. The mapping has to allow for searches where the category (in_ or out_) has to start with a prefix, contain some text, ends with a postfix or an exact match. Also, for the filename (in_ and out_) we need to be able to search for prefixes, postfixes, data within the filename and an exact match.
So for example, let's take a filename "my_file_20170802-1454.pdf". I need to be able to find this when I enter one of the following queries:
Basically for the category, the same is true, only other datastructure. Usually containing underscores, hypens or spaces.
I have configured a mapping which allows this behavior, but it'll take up to 1,5TB of storage. Is there a simpler way to handle this?