Advantages of DocValue

Hello Elastic Search experts,

The advantages of performance of DocValue is able to leverage OS disk cache, which is different (maybe more) memory from JVM. Wondering for non-DocValue field, it is only able to use memory of JVM itself (if so, why it is not able to leverage OS disk cache for performance boosting, which could avoid us explicitly specify physical storage when defining logical index structure)?

thanks in advance,
Lin

Non doc values, ie fielddata, is held in heap so it'll be pretty fast.

Thanks Mark,

For non doc values, they can only use Java heap? Cannot leverage OS level buffer?

regards,
Lin

They aren't on disk, so no.

Thanks Mark to clarify,

Then for Doc Values, how do they leverage memory outside of JVM? Doc Value are still JVM (Java) data structures (if so, how do Doc Value fields leverage memory outside JVM)?

regards,
Lin

They use the OS cache, which uses any free system memory to hold files. This is why we recommend setting heap to 50% of total system RAM.

Doc values are not JVM structures though.

Thanks Mark,

If Doc Values are not JVM data structures, what are they using and where are they? JNI is still using JVM memory I think? Please feel free to correct me if I am wrong.

Have a good weekend. :smile:

regards,
Lin

Doc values are mapped into the process's address space instead of (like field data) being read into memory allocated from the JVM heap.

Thanks Mark,

Are there any more details how "Doc values are mapped into the process's address"? I tried Doc Value and it works great, just curious to learn a bit more. :smile:

Have a great weekend.

regards,
Lin

Not sure exactly what you're looking for, but Wikipedia's article on memory-mapped files seems to explain things pretty well. Any book on operating systems (e.g. Tanenbaum) also covers this.

Lucene's main data structure is an inverted index. That is, terms point
to documents.

For things like sorting and faceting, this doesn't work, because you
need to be able to point from a document to a term (if you want to sort
by the price field, you need to identify the value of the price field
for a specific document).

We solve this with an uninverted index, such as Lucene's FieldCache. The
field cache is built in the background by Lucene by reading through the
inverted index on each commit, and "uninverting" it. This takes an on-
disk data structure, which (on certain OSes) can be accessed via a memory-
mapped file system, and creates an in-heap data structure. This can be
really fast, but suffers from the need to build the data structure
entirely on each commit, which can take some time.

DocValues provide a solution. They are an uninverted column based store,
that is build at index time as an on-disk data structure.

The point of the memory-mapped filesystem is quite simple. Lucene
developers noticed that when they were loading indexes into memory,
they were loading them from disk into the OS disk cache, and then from
there into the Java heap. As a result, there were two copies of the
data in memory, which was overkill. The solution was to switch to using
memory-mapped files in which Java can access the files in the OS disk
cache as if they were simply in memory, thus halving memory
requirements for Lucene.

Upayavira

Thanks Upayavira for the details,

  1. For your comments below, want to confirm my understanding is correct. Java still access through Java File I/O interface as if accessing a normal disk file, but underlying the files are already in memory (which is using OS cache memory other than JVM memory) using memory mapped file?
  1. For your comments, "We solve this with an uninverted index", confused what means an uninverted index? In your sample, I think it could be implemented simply by the revert index (which is price field => document ID), suppose we want to find document whose price is more than 10 USD, we just scan the invert index price field and keep those only above 10 USD? Please feel free to correct me if I am wrong. :smile:

BTW, when you say "faceting", do you mean filtering by a field?

regards,
Lin

Thanks Magnus,

In the scenario of filter student information (suppose each student as a document) by student ID filed. I think Elastic Search will build revert index from student ID to the real student information document. For memory mapped file in this use case, the memory mapped file is used on revert index (student ID => student information document)? Or used on the student information document itself?

regards,
Lin

He's Magnus, I'm Mark.

Corrected, :wink:

Both if you are super experts. :smile:

regards,
Lin

@upayavira, it will be great if you could comment on my questions. :smile: