Better understanding Lucene/Shard overheads

Hi,

I just came across this blog post: http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html

Seems like there has been a lot of work done on Lucene to reduce its memory requirements and even more on Lucene 5.0. This is specifically interesting to me since I’m working on a project that uses Elasticsearch and we are planning on using 1 index per customer model (each with 1 or maybe 2 shards and no replicas) and shard allocation, mainly because:

  1. We are going to have few thousand customers at most

  2. Each customer will only need access to their own data (no global queries)

  3. The indices are going be relatively large (each with millions of small docs)

  4. We are going to need to do a lot of parent/child type queries (and ES doesn’t support cross-shard parent/child relationships and the parent id cache seems not that efficient, see http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/parent-child.html http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/parent-child.html and https://github.com/elasticsearch/elasticsearch/issues/3516#issuecomment-23081662 https://github.com/elasticsearch/elasticsearch/issues/3516#issuecomment-23081662). This is the main reason we feel we can’t use time based (daily, monthly, …) indices.

  5. Being able to easily “drop” an index if a customer leaves the initial trial.

I wanted to better understand the overheads of an Elasticsearch shard. Is it just memory or CPU/threads too? Where can I find more information about this?

Thanks,

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F59813A2-904C-4B29-BBC9-6174DD3C8DAF%40venarc.com.
For more options, visit https://groups.google.com/d/optout.

There is definitely a non-trivial per-index cost.

From Lucene's standpoint, ES holds an IndexReader (for searching) and
IndexWriter (for indexing) open.

IndexReader requires some RAM for each segment to hold structures like live
docs, terms index, index data structures for doc values fields, and holds
open a number of file descriptors in proportion to how many segments are in
the index.

IndexWriter has a RAM buffer (indices.memory.index_buffer_size in ES) to
hold recently indexed/deleted documents, and periodically opens readers (10
at a time by default) to do merging, which bumps up RAM usage and file
descriptors while the merge runs.

There is also a per-indexed-field cost in Lucene; if you have a great many
unique indexed fields that may matter.

If you use field data, it's entirely RAM resident (doc values is a better
choice since it uses much less RAM).

ES has common thread pools on the node which are shared for all ops across
all shards on that node, so I don't think more indices translates to more
threads.

Net/net you really should just conduct your own tests to get a feel of
resource consumption in your use case...

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jan 22, 2015 at 4:07 PM, Drew Kutcharian drew@venarc.com wrote:

Hi,

I just came across this blog post:
Changing Bits: Lucene's RAM usage for searching

Seems like there has been a lot of work done on Lucene to reduce its
memory requirements and even more on Lucene 5.0. This is specifically
interesting to me since I’m working on a project that uses Elasticsearch
and we are planning on using 1 index per customer model (each with 1 or
maybe 2 shards and no replicas) and shard allocation, mainly because:

  1. We are going to have few thousand customers at most

  2. Each customer will only need access to their own data (no global
    queries)

  3. The indices are going be relatively large (each with millions of small
    docs)

  4. We are going to need to do a lot of parent/child type queries (and ES
    doesn’t support cross-shard parent/child relationships and the parent id
    cache seems not that efficient, see
    Elasticsearch Platform — Find real-time answers at scale | Elastic
    and
    id_cache memory footprint grows linearly with number of parent documents · Issue #3516 · elastic/elasticsearch · GitHub).
    This is the main reason we feel we can’t use time based (daily, monthly, …)
    indices.

  5. Being able to easily “drop” an index if a customer leaves the initial
    trial.

I wanted to better understand the overheads of an Elasticsearch shard. Is
it just memory or CPU/threads too? Where can I find more information about
this?

Thanks,

Drew

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/F59813A2-904C-4B29-BBC9-6174DD3C8DAF%40venarc.com
https://groups.google.com/d/msgid/elasticsearch/F59813A2-904C-4B29-BBC9-6174DD3C8DAF%40venarc.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRcpOy6RYgvi-GC6jpsuO1-qsRcTecUvr066Rkr3qxZijA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Mike. I’m still a bit unclear on these comments:

IndexReader requires some RAM for each segment to hold structures like live docs, terms index, index data structures for doc values fields, and holds open a number of file descriptors in proportion to how many segments are in the index.
There is also a per-indexed-field cost in Lucene; if you have a great many unique indexed fields that may matter.

Aren’t these structures dependent on the size of the “lucene index"? Say if I have 1 large lucene index vs 10 small lucene indices (considering not much duplicated data across indices) wouldn’t the total memory used be the same? I understand that there will be more file descriptors because there will be more segments.

IndexWriter has a RAM buffer (indices.memory.index_buffer_size in ES) to hold recently indexed/deleted documents, and periodically opens readers (10 at a time by default) to do merging, which bumps up RAM usage and file descriptors while the merge runs.

According to the doc at https://github.com/elasticsearch/elasticsearch/blob/master/docs/reference/modules/indices.asciidoc https://github.com/elasticsearch/elasticsearch/blob/master/docs/reference/modules/indices.asciidoc seems like indices.memory.index_buffer_size is the “total” size of the buffer for all the shards on a node, so not sure how this would matter in case of having too many shards. I understand that there will be more file descriptors and a lot more “smaller” merge jobs running.

I’m going to test this myself, but I just wanted to understand the model better first so I have more accurate tests.

Thanks again,

Drew

On Jan 23, 2015, at 2:18 AM, Michael McCandless mike@elasticsearch.com wrote:

There is definitely a non-trivial per-index cost.

From Lucene's standpoint, ES holds an IndexReader (for searching) and IndexWriter (for indexing) open.

IndexReader requires some RAM for each segment to hold structures like live docs, terms index, index data structures for doc values fields, and holds open a number of file descriptors in proportion to how many segments are in the index.

IndexWriter has a RAM buffer (indices.memory.index_buffer_size in ES) to hold recently indexed/deleted documents, and periodically opens readers (10 at a time by default) to do merging, which bumps up RAM usage and file descriptors while the merge runs.

There is also a per-indexed-field cost in Lucene; if you have a great many unique indexed fields that may matter.

If you use field data, it's entirely RAM resident (doc values is a better choice since it uses much less RAM).

ES has common thread pools on the node which are shared for all ops across all shards on that node, so I don't think more indices translates to more threads.

Net/net you really should just conduct your own tests to get a feel of resource consumption in your use case...

Mike McCandless

http://blog.mikemccandless.com http://blog.mikemccandless.com/
On Thu, Jan 22, 2015 at 4:07 PM, Drew Kutcharian <drew@venarc.com mailto:drew@venarc.com> wrote:
Hi,

I just came across this blog post: Changing Bits: Lucene's RAM usage for searching http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html

Seems like there has been a lot of work done on Lucene to reduce its memory requirements and even more on Lucene 5.0. This is specifically interesting to me since I’m working on a project that uses Elasticsearch and we are planning on using 1 index per customer model (each with 1 or maybe 2 shards and no replicas) and shard allocation, mainly because:

  1. We are going to have few thousand customers at most

  2. Each customer will only need access to their own data (no global queries)

  3. The indices are going be relatively large (each with millions of small docs)

  4. We are going to need to do a lot of parent/child type queries (and ES doesn’t support cross-shard parent/child relationships and the parent id cache seems not that efficient, see Elasticsearch Platform — Find real-time answers at scale | Elastic http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/parent-child.html and id_cache memory footprint grows linearly with number of parent documents · Issue #3516 · elastic/elasticsearch · GitHub https://github.com/elasticsearch/elasticsearch/issues/3516#issuecomment-23081662). This is the main reason we feel we can’t use time based (daily, monthly, …) indices.

  5. Being able to easily “drop” an index if a customer leaves the initial trial.

I wanted to better understand the overheads of an Elasticsearch shard. Is it just memory or CPU/threads too? Where can I find more information about this?

Thanks,

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F59813A2-904C-4B29-BBC9-6174DD3C8DAF%40venarc.com https://groups.google.com/d/msgid/elasticsearch/F59813A2-904C-4B29-BBC9-6174DD3C8DAF%40venarc.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRcpOy6RYgvi-GC6jpsuO1-qsRcTecUvr066Rkr3qxZijA%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAD7smRcpOy6RYgvi-GC6jpsuO1-qsRcTecUvr066Rkr3qxZijA%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/85AA9AA2-2B5A-49DF-969F-96F5C3438290%40venarc.com.
For more options, visit https://groups.google.com/d/optout.

On Fri, Jan 23, 2015 at 8:42 PM, Drew Kutcharian drew@venarc.com wrote:

Thanks Mike. I’m still a bit unclear on these comments:

IndexReader requires some RAM for each segment to hold structures like
live docs, terms index, index data structures for doc values fields, and
holds open a number of file descriptors in proportion to how many segments
are in the index.

There is also a per-indexed-field cost in Lucene; if you have a great many
unique indexed fields that may matter.

Aren’t these structures dependent on the size of the “lucene index"? Say
if I have 1 large lucene index vs 10 small lucene indices (considering not
much duplicated data across indices) wouldn’t the total memory used be the
same? I understand that there will be more file descriptors because there
will be more segments.

Yes and no.

There are parts of the RAM usage that are just fixed cost per index, per
segment, per field, and then there are parts (usually dominating) that
correlate to how large the index is.

So if you split the same index into 10 indexes, those 10 will use somewhat
more RAM.

IndexWriter has a RAM buffer (indices.memory.index_buffer_size in ES) to
hold recently indexed/deleted documents, and periodically opens readers (10
at a time by default) to do merging, which bumps up RAM usage and file
descriptors while the merge runs.

According to the doc at
https://github.com/elasticsearch/elasticsearch/blob/master/docs/reference/modules/indices.asciidoc seems
like indices.memory.index_buffer_size is the “total” size of the buffer for
all the shards on a node, so not sure how this would matter in case of
having too many shards. I understand that there will be more file
descriptors and a lot more “smaller” merge jobs running.

That's true, so the indexing RAM buffer won't be different if you use 1 vs
10 indices, so it's just the fixed overhead of holding an IndexWriter open
for indexing, and IndexReader(s) for searching.

Mike McCandless

http://blog.mikemccandless.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRd43Sujht%2BSGtL-1RTHPF-6KY33v%3D8n%2BehDpSyQn7CAxw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.