Advantages of DocValue

linlma · July 24, 2015, 9:53pm

Hello Elastic Search experts,

The advantages of performance of DocValue is able to leverage OS disk cache, which is different (maybe more) memory from JVM. Wondering for non-DocValue field, it is only able to use memory of JVM itself (if so, why it is not able to leverage OS disk cache for performance boosting, which could avoid us explicitly specify physical storage when defining logical index structure)?

thanks in advance,
Lin

warkolm · July 25, 2015, 12:35am

Non doc values, ie fielddata, is held in heap so it'll be pretty fast.

linlma · July 25, 2015, 1:02am

Thanks Mark,

For non doc values, they can only use Java heap? Cannot leverage OS level buffer?

regards,
Lin

warkolm · July 25, 2015, 2:10am

They aren't on disk, so no.

linlma · July 25, 2015, 2:34am

Thanks Mark to clarify,

Then for Doc Values, how do they leverage memory outside of JVM? Doc Value are still JVM (Java) data structures (if so, how do Doc Value fields leverage memory outside JVM)?

regards,
Lin

warkolm · July 25, 2015, 5:04am

They use the OS cache, which uses any free system memory to hold files. This is why we recommend setting heap to 50% of total system RAM.

Doc values are not JVM structures though.

linlma · July 25, 2015, 5:51am

Thanks Mark,

If Doc Values are not JVM data structures, what are they using and where are they? JNI is still using JVM memory I think? Please feel free to correct me if I am wrong.

Have a good weekend.

regards,
Lin

magnusbaeck · July 25, 2015, 8:46am

Doc values are mapped into the process's address space instead of (like field data) being read into memory allocated from the JVM heap.

linlma · July 26, 2015, 5:01am

Thanks Mark,

Are there any more details how "Doc values are mapped into the process's address"? I tried Doc Value and it works great, just curious to learn a bit more.

Have a great weekend.

regards,
Lin

magnusbaeck · July 26, 2015, 8:28am

Not sure exactly what you're looking for, but Wikipedia's article on memory-mapped files seems to explain things pretty well. Any book on operating systems (e.g. Tanenbaum) also covers this.

upayavira · July 26, 2015, 3:53pm

Lucene's main data structure is an inverted index. That is, terms point
to documents.

For things like sorting and faceting, this doesn't work, because you
need to be able to point from a document to a term (if you want to sort
by the price field, you need to identify the value of the price field
for a specific document).

We solve this with an uninverted index, such as Lucene's FieldCache. The
field cache is built in the background by Lucene by reading through the
inverted index on each commit, and "uninverting" it. This takes an on-
disk data structure, which (on certain OSes) can be accessed via a memory-
mapped file system, and creates an in-heap data structure. This can be
really fast, but suffers from the need to build the data structure
entirely on each commit, which can take some time.

DocValues provide a solution. They are an uninverted column based store,
that is build at index time as an on-disk data structure.

The point of the memory-mapped filesystem is quite simple. Lucene
developers noticed that when they were loading indexes into memory,
they were loading them from disk into the OS disk cache, and then from
there into the Java heap. As a result, there were two copies of the
data in memory, which was overkill. The solution was to switch to using
memory-mapped files in which Java can access the files in the OS disk
cache as if they were simply in memory, thus halving memory
requirements for Lucene.

Upayavira

linlma · July 26, 2015, 9:53pm

Thanks Upayavira for the details,

For your comments below, want to confirm my understanding is correct. Java still access through Java File I/O interface as if accessing a normal disk file, but underlying the files are already in memory (which is using OS cache memory other than JVM memory) using memory mapped file?

For your comments, "We solve this with an uninverted index", confused what means an uninverted index? In your sample, I think it could be implemented simply by the revert index (which is price field => document ID), suppose we want to find document whose price is more than 10 USD, we just scan the invert index price field and keep those only above 10 USD? Please feel free to correct me if I am wrong.

BTW, when you say "faceting", do you mean filtering by a field?

regards,
Lin

linlma · July 26, 2015, 10:00pm

Thanks Magnus,

In the scenario of filter student information (suppose each student as a document) by student ID filed. I think Elastic Search will build revert index from student ID to the real student information document. For memory mapped file in this use case, the memory mapped file is used on revert index (student ID => student information document)? Or used on the student information document itself?

regards,
Lin

warkolm · July 26, 2015, 10:02pm

He's Magnus, I'm Mark.

linlma · July 26, 2015, 10:05pm

Corrected,

Both if you are super experts.

regards,
Lin

linlma · August 6, 2015, 3:44am

@upayavira, it will be great if you could comment on my questions.

Topic		Replies	Views
Indexing performance with doc values (particularly with larger number of fields) Elasticsearch	2	570	July 6, 2017
DocValues in ElasticSearch Elasticsearch	3	860	July 5, 2017
Fielddata cache and doc values Elasticsearch	2	390	July 6, 2017
Doc Values Storage Elasticsearch	8	973	July 5, 2017
Why is docvalue_fields much faster than source.includes? Elasticsearch	3	661	April 29, 2020

Advantages of DocValue

Related topics