Hopefully someone for the elasticsearch team or anyone else that really
knows the internals can respond, but in the meantime here is a general
overview.
Elasticsearch is essentially a distributed version of Lucene. Ultimately
all the queried content resides in the indices and segments managed by
Lucene. Index data is stored in off-heap memory, which is memory not
allocated to the JVM. This allocation of memory is why it is suggested to
ran the JVM with 50% of the system memory, since Lucene uses the rest. You
can tune Lucene's memory management by changing the underlying filesystem
implementation. [1] This memory is managed by the operating system, so data
outside the size of the system will be paged to disk. If you want to learn
more, look into Lucene.
The field data (used for facets and sorting) is allocated in the JVM's
heap. Elasticsearch uses Google Guava [2] to manage its caches. The
settings are tuneable and basically adjust the Google Guava settings. Too
large of field data will result in large garbage collections by the JVM.
Cache entry can expire depending on the settings used and the amount of
data in the cache.
The last piece of the puzzle is the transaction log. Documents indexed in
Elasticsearch are first placed into the transaction log. If the write
consistency is achieved, then the documents are written to the Lucene
segments. The Lucene segments are not distributed between shard/nodes, the
transaction log is. Lucene queries work from the indices, but Get requests
in Elasticsearch will use the transaction log for real-time queries. It is
the transaction log that is perhaps least documented and discussed. There
are no major configuration settings (AFAIK), so not much is exposed. You
can trace the transaction log code (probably starting from a Get request)
to learn more.
Essentially Elasticsearch loads as much as possible into memory until it
can't. Lucene data is managed by the OS, field cache by the JVM. I probably
have already written too much, but have also written too little.
Cheers,
Ivan
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic
[2] GitHub - google/guava: Google core libraries for Java
On Sat, Apr 13, 2013 at 11:42 AM, Ahmet Akyol liqusha@gmail.com wrote:
OK, my bad... Here is a better example documentation:
http://www.datastax.com/docs/1.2/dml/about_reads
This page explains how reads work with cassandra at high level. Bloom
filters, caches, memtables, SStables etc. But for this topic, what matters
is:
"Finally, Cassandra performs a single seek and a sequential read of
columns (a range read) in the SSTable if the columns are contiguous, and
returns the result set."
With HBase sequential row access and with Cassandra contiguous column
access (wide rows) are the key part of schema design because of sequential
reads on disk (SSTable, HFile).
Back to question, let's look field caches, as you pointed out:
Elasticsearch Platform — Find real-time answers at scale | Elastic
"The field data cache can be expensive to build for a field, so its
recommended to have enough memory to allocate it, and to keep it loaded."
This explains expensiveness of memory usage but not the disk access part.
So the question is now, where and how does this field data come from. In
other words, what is the SSTable of Elasticsearch, meaning what happens to
index information (analyzed or not) on disk? What is the structure?
Best regards,
P.s. : Ivan, you're right, it's a desperate expectation, a "documentation
on the internals". Maybe a clever (indeed a clairvoyant) guy can say, "Too
much talk. You can't explain it but I see what you really mean.The simple
answer to your question is ... " in a few sentences
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.