How to index a large dataset



At the moment, I have a dataset containing over 170 million records and I want to index it in Elasticsearch. But the problem is that the way I need the mapping to be, generates to much datastorage.

Our data looks like this:

in_category (max 256 characters)
in_filename (max 256 characters)
out_category (max 256 characters)
out_filename (max 256 characters)

Some other fields are there as well, but they are not so dynamic. The mapping has to allow for searches where the category (in_ or out_) has to start with a prefix, contain some text, ends with a postfix or an exact match. Also, for the filename (in_ and out_) we need to be able to search for prefixes, postfixes, data within the filename and an exact match.

So for example, let's take a filename "my_file_20170802-1454.pdf". I need to be able to find this when I enter one of the following queries:

  • my_file
  • 20170802
  • pdf
  • 2017
  • my_file_20170802-1454.pdf

Basically for the category, the same is true, only other datastructure. Usually containing underscores, hypens or spaces.

I have configured a mapping which allows this behavior, but it'll take up to 1,5TB of storage. Is there a simpler way to handle this?


(David Pilato) #2

IMO the way to solve this is by either using ngrams (size of 3 as in your example the lowest number of characters is 3: pdf). But may be using a pattern tokenizer would help to produce the exact terms you mentioned?

FWIW, ngrams or pattern tokenizers will use mode disk space for sure.


Thanks for the reply!

In the mapping I currently have, I'm using ngrams, 3-25, and another field for exact matching. And this is giving us a storage (calculated) of over 1,5TB. IMO, 500GB would be acceptable, but 1,5TB is a bit much.

(David Pilato) #4

Well. I'm not sure what you can do.

I think it's a tradeoff between index space and speed at query time.

If you don't want to pay the space price, you'll have to pay the speed price by running super slow wildcard queries.

BTW did you disable the _all field?


No we did not. I'll run another test with the field disabled.

(David Pilato) #6

Not sure about what you tested but can you compare just with adding a ngram based field with no field added?

In term of default space.


Well, I took a subset of the data and ran it into a test elastic search instance. Then after that was done, I let the index optimize itself (because from experience, I've noticed that just pushing it in and then checking disk size is not reliable). Then with that data I've extrapolated until we had a "guess" about what the size would be with the full dataset.

That test already uses the n-gram filter, 2-20 for category and 3-25 for the filename. Both 2 times (in_ and out_). I still have to run the test without the _all field.

(David Pilato) #8

You can call the forcemerge API if needed so you will be able to compare somehow with the same number of segments.

(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.