Doc Values Storage


(Alex Kruchkov) #1

Hi all,
We are using ES on AWS EC2 instances and we are want to reduce our JVM usage. The most common solution to this is to use doc values and thus reducing the in memory field data. When we tried it on a test system this is resulted with high I/O consumption and very large storage.
We want to use the doc values in the ephemeral storage but when I looked into it the doc values files are stored with the all the other data.
Is it possible to make a way to store the doc values only on another filesystem?

Thanks in advance,
Alex.


(Mark Walkom) #2

It's not, they are stored with the shards.

However doc values should not be generating large amounts of data, do you have more information around what you saw?


(Alex Kruchkov) #3

First of all, thanks a lot for a quick response!

The file amount wasn't high, the I/O issues were because of read I/O and not write.
This happened when we got about 1 mil documents per minute and then did a query for a few hours through the Kibana. Before the change to doc values the search crashed due to OutOfMemoryError, when we changed it to doc values the query could be after a very long time and the read I/O on the instance was very high in the time of the search.
We thought maybe we could put the doc values files in a SSD disk with high IOPS but as you confirmed it cant be done.
Can you suggest any other way to handle very high amount of data and not hurt the performance while searching?


(Mark Walkom) #4

SSDs will help.

Do you have graphs of before and after doc values were implemented?


(Alex Kruchkov) #5

Hi warkolm,

We don't have the graphs showing the described behavior.

I would like to emphasize that we have few TB of data in a write intensive cluster and the total heap size in the cluster isn't big enough to contain the needed field data cache for searches.

In order to overcome this, we manually deletes every 20 mins the field data cache so it's being generate from scratch each time user perform sort on data.

Putting all of the data on Ephemeral storage isn't an option (will multiple the total cost of the cluster).

Few questions:

  1. Is it lucene restriction or Elasticsearch to have the doc values in a different FS (storage)?
  2. Do you plan to support such feature in the future?
  3. What do you suggest in such circumstances?

Your help is much appreciated.

Thanks,
Alex.


(Mark Walkom) #6
  1. Elasticsearch, Lucene doesn't have field data.
  2. I'm not aware of any feature requests for this.
  3. Use SSDs/faster storage or add more nodes.

(Alex Kruchkov) #7

While I did my research on the matter I saw that doc values are a Lucene feature, you can see a reference to this here:
https://lucene.apache.org/core/5_1_0/core/org/apache/lucene/index/DocValues.html
https://lucene.apache.org/core/5_1_0/core/org/apache/lucene/codecs/lucene50/Lucene50DocValuesFormat.html

I wanted to know whether the definition of the doc values data path is hard coded to be with the other data in Lucene or is it in Elasticsearch?


(Mark Walkom) #8

Yep. You're right and I'm wrong! I thought it was purely an ES abstraction.

It may be defined in Lucene, but you won't be able to alter this in Elasticsearch without some hacking.


(system) #9