Cluster metadata

I'm building a very large number of ES indices. I'm trying to decide
whether to federate the data into ES indinces, stored one per AWS EBS
volume or whether to use an S3 gateway.

When I run a query, I plan to use an index alias to name the indices. If I
use one EBS volume per index, I have the problem of ensuring that those
volumes are attached to AWS instances when I need to run a query. If not,
then I need to programmatically attach the needed volumes and then start ES
nodes to join a cluster.

I'd like to have ES maintain as much of the state as possible.
Specifically, I'd like to be able to ask ES which EBS volumes need to be
attached to access a given set of AWS indices.

One way to do this is to maintain a separate ES metaindex. However, I know
that ES maintains cluster-wide metadata. My question is, where does this
cluster metadata live? If I maintain a single ES node that does NOT store
index data, will it maintain the cluster metadata? Can I query the cluster
metadata when ALL of the indices that make up the cluster are not only
closed, but actually offline? What do I get back? Do I get back the
mapping meta-data? If so, could I simply add the EBS volume ID to the
index mapping meta-data?

Also, where are index alias definitions stored. May I query an ES non-data
node to get the alias definitions for the associated cluster? If I could
do that, then I could query the cluster, get the current alias definition,
then for each index in the alias, get the required EBS volume and make sure
that it is mounted on some AWS instance. Make sense?

If I let ES manage the index state by using the S3 gateway, then I would
need to know how many AWS instances I need to start up in order to load all
the required indices. This is a simpler problem, and would obviate the
need to maintain a mapping from index name to EBS volume id. However, I've
been warned that the startup time may be significant. Should I simply use
the S3 gateway?

How can I find out which ES clusters are available within a given AWS
availability zone?

To be honest, a bit lost with your question... . An elasticsearch cluster
will allocate shards on nodes, you don't need to specify specific location
for each index, or make sure you need to do that. A node has a "data"
location, and as shards allocate on it, they are stored "within" it.

On Mon, Jan 16, 2012 at 11:33 PM, Derrick derrickrburns@gmail.com wrote:

I'm building a very large number of ES indices. I'm trying to decide
whether to federate the data into ES indinces, stored one per AWS EBS
volume or whether to use an S3 gateway.

When I run a query, I plan to use an index alias to name the indices. If
I use one EBS volume per index, I have the problem of ensuring that those
volumes are attached to AWS instances when I need to run a query. If not,
then I need to programmatically attach the needed volumes and then start ES
nodes to join a cluster.

I'd like to have ES maintain as much of the state as possible.
Specifically, I'd like to be able to ask ES which EBS volumes need to be
attached to access a given set of AWS indices.

One way to do this is to maintain a separate ES metaindex. However, I
know that ES maintains cluster-wide metadata. My question is, where does
this cluster metadata live? If I maintain a single ES node that does NOT
store index data, will it maintain the cluster metadata? Can I query the
cluster metadata when ALL of the indices that make up the cluster are not
only closed, but actually offline? What do I get back? Do I get back the
mapping meta-data? If so, could I simply add the EBS volume ID to the
index mapping meta-data?

Also, where are index alias definitions stored. May I query an ES
non-data node to get the alias definitions for the associated cluster? If
I could do that, then I could query the cluster, get the current alias
definition, then for each index in the alias, get the required EBS volume
and make sure that it is mounted on some AWS instance. Make sense?

If I let ES manage the index state by using the S3 gateway, then I would
need to know how many AWS instances I need to start up in order to load all
the required indices. This is a simpler problem, and would obviate the
need to maintain a mapping from index name to EBS volume id. However, I've
been warned that the startup time may be significant. Should I simply use
the S3 gateway?

How can I find out which ES clusters are available within a given AWS
availability zone?

Thanks Shay. elasticsearch is very elegant. I am trying to figure out how
to use it efficiently when I cannot afford to keep all the data online all
the time!

Think about the Twitter data. The number of tweets grows each day. One
cannot create a single index to hold a fixed amount of time since a) the
number of shards is fixed and b) the shards cannot grow without bound.
Even if I could, I don't want to do so, because my compute budget would
grow without bound. So, I make a decision to maintain quick access to more
recent data, and effectively archive older data.

So, how can I maintain access (query) recent data and archive (take
offline) older data? It seems that I have two choices:

  1. manually "pre-shard"
  2. create indices bounded by time/space
    2. use the index aliasing feature to define "recent" indices OR
  3. manually "post-shard"
  4. place the data into a single index
    2. regularly extract old data using queries
    3. reindex that old data into archival indices
    4. delete the extracted data from ES (or use the TTL feature to have
    ES delete the data)

In the latter case I have a problem. A single shard cannot grow beyond a
certain size, else it cannot be served within a single JVM of a given heap
size without threat of OOM problems, right? To solve that problem, I would
have to introduce JVMs with more memory (or store indices outside of heap
memory and make sure that I can allocate more memory from the host
computer). To do that, I need to monitor index size and introduce JVMs of
larger size over time to accommodate the (slowly) growing shard size. But
then, I am growing index size without growing CPU. At some point, I would
need to reindex!

The core issue is that if the number of shards per index is fixed and the
amount of data per shard is bounded, then I have a de facto limit on the
amount of data that a single ES index can hold.

So, it appears that for twitter-like indexing applications, where both the
first and second derivative of index size over time is growing, I need to
manually "pre-shard" the data based on time AND index size.

Am I missing something?

We've dealt with designing something similar (indexing time based data), so
here's my 1.2 cents...

  • Option 1 is the way to go. It's simpler and more flexible. From
    resource usage perspective, it'd use same amount of disk space, less IO,
    less CPU, and memory usage can be controlled by closing old indices.
  • You can combine the two approaches and use auto sharding as well as
    creating new indices depending on how many nodes you'll have. For example
    create an index with 5 shards per day/week/month, etc.
  • Currently ES balances data among nodes based on number of shards,
    hence having incides with significantly different sizes can cause
    unbalanced data distribution among nodes that leads to problems.
  • Keep metadata about which index consists of which data in your
    application or on another data store. This way you can limit searches to
    the relevant indices only, close old indices and only open if a search
    requires to access it etc.
  • If you don't mind the latency (if older indices are rarely searched,
    etc.), you can move the closed indices to S3 and copy back when they are
    needed.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Tue, Jan 17, 2012 at 1:06 PM, Derrick derrickrburns@gmail.com wrote:

Thanks Shay. elasticsearch is very elegant. I am trying to figure out
how to use it efficiently when I cannot afford to keep all the data online
all the time!

Think about the Twitter data. The number of tweets grows each day. One
cannot create a single index to hold a fixed amount of time since a) the
number of shards is fixed and b) the shards cannot grow without bound.
Even if I could, I don't want to do so, because my compute budget would
grow without bound. So, I make a decision to maintain quick access to more
recent data, and effectively archive older data.

So, how can I maintain access (query) recent data and archive (take
offline) older data? It seems that I have two choices:

  1. manually "pre-shard"
  2. create indices bounded by time/space
    2. use the index aliasing feature to define "recent" indices OR
  3. manually "post-shard"
  4. place the data into a single index
    2. regularly extract old data using queries
    3. reindex that old data into archival indices
    4. delete the extracted data from ES (or use the TTL feature to
    have ES delete the data)

In the latter case I have a problem. A single shard cannot grow beyond a
certain size, else it cannot be served within a single JVM of a given heap
size without threat of OOM problems, right? To solve that problem, I would
have to introduce JVMs with more memory (or store indices outside of heap
memory and make sure that I can allocate more memory from the host
computer). To do that, I need to monitor index size and introduce JVMs of
larger size over time to accommodate the (slowly) growing shard size. But
then, I am growing index size without growing CPU. At some point, I would
need to reindex!

The core issue is that if the number of shards per index is fixed and the
amount of data per shard is bounded, then I have a de facto limit on the
amount of data that a single ES index can hold.

So, it appears that for twitter-like indexing applications, where both the
first and second derivative of index size over time is growing, I need to
manually "pre-shard" the data based on time AND index size.

Am I missing something?

Berkay, your advice is much appreciated. This is the direction that I am
headed. Have you found a sweet spot in terms of shard size (bytes)? Any
suggestions on how to size a JVM heap relative to a shard size? I'm
thinking that I need to run "typical" queries and watch how much heap is
used? Is there a better way? An automated way? A way to use ES to create
the estimate?

Hi Derrick,

We've done this a number of times now for a number of customers with
both Solr and now Elasticsearch and our experience matches Berkay's.
Index aliasing comes in very handy. :slight_smile:

Regarding your JVM heap question, I don't think anyone can answer that

  • indeed, it requires some testing and experimenting. For what it is
    worth, and if it helps, we've just made our SPM for Solr work on an
    Elasticsearch cluster for one of our clients, precisely so we can use
    it to watch JVM/GC information, as well as various OS performance
    metrics while doing load testing. This sort of stuff helps you figure
    out hardware limits, how much hardware you need, etc. You can get SPM
    from our site, see the signature below.

Otis

Sematext is hiring Elasticsearch / Solr developers --

On Jan 17, 6:05 pm, Derrick derrickrbu...@gmail.com wrote:

Berkay, your advice is much appreciated. This is the direction that I am
headed. Have you found a sweet spot in terms of shard size (bytes)? Any
suggestions on how to size a JVM heap relative to a shard size? I'm
thinking that I need to run "typical" queries and watch how much heap is
used? Is there a better way? An automated way? A way to use ES to create
the estimate?

Derrick,

Unfortunately I don't have any numbers handy. There really are quite a few
factors that would impact the amount of resources required, and we did not
do a through testing playing with each factor individually. As you have
guessed, you'll have to run some tests to see how it looks like.
To determine heap usage, you can first look into memory usage with only
indexing without any queries, and then add the queries. You'll want to make
your queries as representative of the real usage as possible. Beware of
query vs filter, whether the results are cached or not. Facets also seem to
have a significant impact on memory usage. Returning large sets of data may
also require large amounts of memory. In short, I think if you have control
over what the queries will be, you can determine resource requirements. If
not, it's a lot more difficult.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Tue, Jan 17, 2012 at 6:05 PM, Derrick derrickrburns@gmail.com wrote:

Berkay, your advice is much appreciated. This is the direction that I am
headed. Have you found a sweet spot in terms of shard size (bytes)? Any
suggestions on how to size a JVM heap relative to a shard size? I'm
thinking that I need to run "typical" queries and watch how much heap is
used? Is there a better way? An automated way? A way to use ES to create
the estimate?

To be clear, are you suggesting one EBS volume per elasticsearch index? I
could end up with hundreds of small volumes, correct? Wouldn't I need to
run one ES node per volume?

Derrick, elasticsearch does not require you to have a storage per index,
you are actually abstracted from it and you don't care about it. Each nodes
are "containers" for shards (that belong to an index) and shards get
allocated and moved around between nodes. All you need to do is make sure
to place the data location of an elasticsearch node wherever you want.

On Fri, Jan 20, 2012 at 10:32 PM, Derrick derrickrburns@gmail.com wrote:

To be clear, are you suggesting one EBS volume per elasticsearch index? I
could end up with hundreds of small volumes, correct? Wouldn't I need to
run one ES node per volume?