The advantages of performance of DocValue is able to leverage OS disk cache, which is different (maybe more) memory from JVM. Wondering for non-DocValue field, it is only able to use memory of JVM itself (if so, why it is not able to leverage OS disk cache for performance boosting, which could avoid us explicitly specify physical storage when defining logical index structure)?
Then for Doc Values, how do they leverage memory outside of JVM? Doc Value are still JVM (Java) data structures (if so, how do Doc Value fields leverage memory outside JVM)?
If Doc Values are not JVM data structures, what are they using and where are they? JNI is still using JVM memory I think? Please feel free to correct me if I am wrong.
Are there any more details how "Doc values are mapped into the process's address"? I tried Doc Value and it works great, just curious to learn a bit more.
Not sure exactly what you're looking for, but Wikipedia's article on memory-mapped files seems to explain things pretty well. Any book on operating systems (e.g. Tanenbaum) also covers this.
Lucene's main data structure is an inverted index. That is, terms point
to documents.
For things like sorting and faceting, this doesn't work, because you
need to be able to point from a document to a term (if you want to sort
by the price field, you need to identify the value of the price field
for a specific document).
We solve this with an uninverted index, such as Lucene's FieldCache. The
field cache is built in the background by Lucene by reading through the
inverted index on each commit, and "uninverting" it. This takes an on-
disk data structure, which (on certain OSes) can be accessed via a memory-
mapped file system, and creates an in-heap data structure. This can be
really fast, but suffers from the need to build the data structure
entirely on each commit, which can take some time.
DocValues provide a solution. They are an uninverted column based store,
that is build at index time as an on-disk data structure.
The point of the memory-mapped filesystem is quite simple. Lucene
developers noticed that when they were loading indexes into memory,
they were loading them from disk into the OS disk cache, and then from
there into the Java heap. As a result, there were two copies of the
data in memory, which was overkill. The solution was to switch to using
memory-mapped files in which Java can access the files in the OS disk
cache as if they were simply in memory, thus halving memory
requirements for Lucene.
For your comments below, want to confirm my understanding is correct. Java still access through Java File I/O interface as if accessing a normal disk file, but underlying the files are already in memory (which is using OS cache memory other than JVM memory) using memory mapped file?
For your comments, "We solve this with an uninverted index", confused what means an uninverted index? In your sample, I think it could be implemented simply by the revert index (which is price field => document ID), suppose we want to find document whose price is more than 10 USD, we just scan the invert index price field and keep those only above 10 USD? Please feel free to correct me if I am wrong.
BTW, when you say "faceting", do you mean filtering by a field?
In the scenario of filter student information (suppose each student as a document) by student ID filed. I think Elastic Search will build revert index from student ID to the real student information document. For memory mapped file in this use case, the memory mapped file is used on revert index (student ID => student information document)? Or used on the student information document itself?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.