Suggestions on indexing LARGE, GROWING data sets


(derrickburns) #1

What are the policies surrounding local storage management when using
an S3 gateway? I notice that local store IS used (temporarily?) when
S3 is using as a storage gateway. When does that data get deleted?
Does the ES node using that local storage watch to see if the local
disk space is filling up? Does it discard the data when the index
associated with that data is closed? How can I use one AWS instance
to index data (sequentially) into N different ES indices, where N is
unbounded? Do I have to clear the local store on close? Should I use
an in-memory index for this purpose?

Background

I am indexing a large data set that grows by over 10GB (index space)
per day. I need to be able to search the last N (N>100) days of data,
but I need to maintain the last N +M (M>500) days of data for archival
purposes.

My strategy is to put each day into a separate ES index. When I need
that index, I fire up a AWS EC2 instance, launch an ES node on that
instance, and open the index.

This works well if my EC2 instance uses a local ES gateway and the
data for the index is already stored in the local store.

Instead, I'd like to use an S3 gateway so that I don't have to worry
about storage.

In the S3 gateway scenario, each ES node is assigned (shards of) a
newly opened index and presumably the shard data is loaded from the S3
gateway.

However, the data is also stored locally on disk during the indexing
process. So, I must provision local disk space even though an S3
gateway is used for the backing store. This is fine if I can bound
the size of the local space needed, but if I am simply using the ES
node to index data into a number of different ES indices, then I want
the local space associated with an ES index to be freed when the ES
index is closed.


(Shay Banon) #2

Data is stored locally in s3 gateway case. Moreover, if you do things like
full restart, it will make sure to recover shards on nodes that have the
"most common" data with s3. You can delete the storage of an index and it
will be recovered from s3 once you need it. When you close an index, it
will not be deleted. Note, be weary of the time it takes to actually
download the data from s3 to the relevant nodes.

On Fri, Jan 13, 2012 at 9:13 PM, Derrick derrickrburns@gmail.com wrote:

What are the policies surrounding local storage management when using
an S3 gateway? I notice that local store IS used (temporarily?) when
S3 is using as a storage gateway. When does that data get deleted?
Does the ES node using that local storage watch to see if the local
disk space is filling up? Does it discard the data when the index
associated with that data is closed? How can I use one AWS instance
to index data (sequentially) into N different ES indices, where N is
unbounded? Do I have to clear the local store on close? Should I use
an in-memory index for this purpose?

Background

I am indexing a large data set that grows by over 10GB (index space)
per day. I need to be able to search the last N (N>100) days of data,
but I need to maintain the last N +M (M>500) days of data for archival
purposes.

My strategy is to put each day into a separate ES index. When I need
that index, I fire up a AWS EC2 instance, launch an ES node on that
instance, and open the index.

This works well if my EC2 instance uses a local ES gateway and the
data for the index is already stored in the local store.

Instead, I'd like to use an S3 gateway so that I don't have to worry
about storage.

In the S3 gateway scenario, each ES node is assigned (shards of) a
newly opened index and presumably the shard data is loaded from the S3
gateway.

However, the data is also stored locally on disk during the indexing
process. So, I must provision local disk space even though an S3
gateway is used for the backing store. This is fine if I can bound
the size of the local space needed, but if I am simply using the ES
node to index data into a number of different ES indices, then I want
the local space associated with an ES index to be freed when the ES
index is closed.


(derrickburns) #3

Do you have a guideline for the maximum number of ES data nodes for a
single cluster? If I have

  • 200 data nodes in a single cluster
  • with each data node on a separate AWS instance
  • with the ES cluster backed by an S3 gateway
  • with each data node starting with NO local state (MEMORY is primary
    storage for index);

How long will it take to start the cluster? Will the loading from S3 to
each data node occur in parallel, so the time is the same as loading a
single data node from S3? Or, does the time to restart grow proportional
to the number of data nodes?

I am trying to avoid having to manage cluster state outside of ES, i.e.
tracking which AWS instance/volume has which ES index. However, if the
time to restart the cluster is prohibitive, then I will resort to launching
specific AWS instances/volumes.


(Shay Banon) #4

You don't need to track which machine is running which index, you are
abstracted away from it in elasticsearch. They get allocated across the
cluster.

When recovering from s3, yes, it will be done in parallel across the nodes.
An index shard gets allocated to a node, and then it will read its data
from s3.

On Mon, Jan 16, 2012 at 9:39 PM, Derrick derrickrburns@gmail.com wrote:

Do you have a guideline for the maximum number of ES data nodes for a
single cluster? If I have

  • 200 data nodes in a single cluster
  • with each data node on a separate AWS instance
  • with the ES cluster backed by an S3 gateway
  • with each data node starting with NO local state (MEMORY is primary
    storage for index);

How long will it take to start the cluster? Will the loading from S3 to
each data node occur in parallel, so the time is the same as loading a
single data node from S3? Or, does the time to restart grow proportional
to the number of data nodes?

I am trying to avoid having to manage cluster state outside of ES, i.e.
tracking which AWS instance/volume has which ES index. However, if the
time to restart the cluster is prohibitive, then I will resort to launching
specific AWS instances/volumes.


(system) #5