Doc Values Storage

alexkru · August 17, 2015, 6:25am

Hi all,
We are using ES on AWS EC2 instances and we are want to reduce our JVM usage. The most common solution to this is to use doc values and thus reducing the in memory field data. When we tried it on a test system this is resulted with high I/O consumption and very large storage.
We want to use the doc values in the ephemeral storage but when I looked into it the doc values files are stored with the all the other data.
Is it possible to make a way to store the doc values only on another filesystem?

Thanks in advance,
Alex.

warkolm · August 17, 2015, 6:35am

It's not, they are stored with the shards.

However doc values should not be generating large amounts of data, do you have more information around what you saw?

alexkru · August 17, 2015, 7:35am

First of all, thanks a lot for a quick response!

The file amount wasn't high, the I/O issues were because of read I/O and not write.
This happened when we got about 1 mil documents per minute and then did a query for a few hours through the Kibana. Before the change to doc values the search crashed due to OutOfMemoryError, when we changed it to doc values the query could be after a very long time and the read I/O on the instance was very high in the time of the search.
We thought maybe we could put the doc values files in a SSD disk with high IOPS but as you confirmed it cant be done.
Can you suggest any other way to handle very high amount of data and not hurt the performance while searching?

warkolm · August 17, 2015, 7:37am

SSDs will help.

Do you have graphs of before and after doc values were implemented?

alexkru · August 17, 2015, 8:02am

Hi warkolm,

We don't have the graphs showing the described behavior.

I would like to emphasize that we have few TB of data in a write intensive cluster and the total heap size in the cluster isn't big enough to contain the needed field data cache for searches.

In order to overcome this, we manually deletes every 20 mins the field data cache so it's being generate from scratch each time user perform sort on data.

Putting all of the data on Ephemeral storage isn't an option (will multiple the total cost of the cluster).

Few questions:

Is it lucene restriction or Elasticsearch to have the doc values in a different FS (storage)?
Do you plan to support such feature in the future?
What do you suggest in such circumstances?

Your help is much appreciated.

Thanks,
Alex.

warkolm · August 17, 2015, 8:23am

Elasticsearch, Lucene doesn't have field data.
I'm not aware of any feature requests for this.
Use SSDs/faster storage or add more nodes.

alexkru · August 17, 2015, 8:55am

While I did my research on the matter I saw that doc values are a Lucene feature, you can see a reference to this here:
https://lucene.apache.org/core/5_1_0/core/org/apache/lucene/index/DocValues.html
https://lucene.apache.org/core/5_1_0/core/org/apache/lucene/codecs/lucene50/Lucene50DocValuesFormat.html

I wanted to know whether the definition of the doc values data path is hard coded to be with the other data in Lucene or is it in Elasticsearch?

warkolm · August 17, 2015, 10:14am

Yep. You're right and I'm wrong! I thought it was purely an ES abstraction.

It may be defined in Lucene, but you won't be able to alter this in Elasticsearch without some hacking.

Topic		Replies	Views
Indexing performance with doc values (particularly with larger number of fields) Elasticsearch	2	570	July 6, 2017
Cause of doc_values_memory_in_bytes and how to reduce? Elasticsearch	2	997	July 5, 2017
Fielddata cache and doc values Elasticsearch	2	390	July 6, 2017
Elasticsearch 2.0 2.5X Disk Space Elasticsearch	4	1172	July 5, 2017
Elasticsearch disk usage 1.x vs 2.x Elasticsearch	3	624	July 5, 2017

Doc Values Storage

Related topics