How to index a large dataset

Erates · August 2, 2017, 1:00pm

Hello,

At the moment, I have a dataset containing over 170 million records and I want to index it in Elasticsearch. But the problem is that the way I need the mapping to be, generates to much datastorage.

Our data looks like this:

in_category (max 256 characters)
in_filename (max 256 characters)
in_timestamp
out_category (max 256 characters)
out_filename (max 256 characters)
out_timestamp

Some other fields are there as well, but they are not so dynamic. The mapping has to allow for searches where the category (in_ or out_) has to start with a prefix, contain some text, ends with a postfix or an exact match. Also, for the filename (in_ and out_) we need to be able to search for prefixes, postfixes, data within the filename and an exact match.

So for example, let's take a filename "my_file_20170802-1454.pdf". I need to be able to find this when I enter one of the following queries:

my_file
20170802
pdf
2017
my_file_20170802-1454.pdf

Basically for the category, the same is true, only other datastructure. Usually containing underscores, hypens or spaces.

I have configured a mapping which allows this behavior, but it'll take up to 1,5TB of storage. Is there a simpler way to handle this?

Erates

dadoonet · August 2, 2017, 1:23pm

IMO the way to solve this is by either using ngrams (size of 3 as in your example the lowest number of characters is 3: pdf). But may be using a pattern tokenizer would help to produce the exact terms you mentioned?

FWIW, ngrams or pattern tokenizers will use mode disk space for sure.

Erates · August 2, 2017, 1:35pm

Thanks for the reply!

In the mapping I currently have, I'm using ngrams, 3-25, and another field for exact matching. And this is giving us a storage (calculated) of over 1,5TB. IMO, 500GB would be acceptable, but 1,5TB is a bit much.

dadoonet · August 2, 2017, 2:34pm

Well. I'm not sure what you can do.

I think it's a tradeoff between index space and speed at query time.

If you don't want to pay the space price, you'll have to pay the speed price by running super slow wildcard queries.

BTW did you disable the _all field?

Erates · August 2, 2017, 3:03pm

No we did not. I'll run another test with the field disabled.

dadoonet · August 2, 2017, 3:15pm

Not sure about what you tested but can you compare just with adding a ngram based field with no field added?

In term of default space.

Erates · August 2, 2017, 3:24pm

Well, I took a subset of the data and ran it into a test elastic search instance. Then after that was done, I let the index optimize itself (because from experience, I've noticed that just pushing it in and then checking disk size is not reliable). Then with that data I've extrapolated until we had a "guess" about what the size would be with the full dataset.

That test already uses the n-gram filter, 2-20 for category and 3-25 for the filename. Both 2 times (in_ and out_). I still have to run the test without the _all field.

dadoonet · August 2, 2017, 4:23pm

You can call the forcemerge API if needed so you will be able to compare somehow with the same number of segments.

system · August 30, 2017, 4:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Efficient storage of non-analysed text fields in Elasticsearch Elasticsearch	4	646	June 5, 2018
Queries with large character counts in fields Elasticsearch	6	936	August 26, 2019
Advice on large dataset query problem Elasticsearch	1	361	March 26, 2018
Is ElasticSearch meant for long term storage of large datasets? Elasticsearch	3	926	March 21, 2020
Whole phrases searching in large texts in ElasticSearch take a long of time Elasticsearch	1	488	June 22, 2017

How to index a large dataset

Related topics