Questions on the scan search

I am using an ES cluster with thousands of documents spread across several
types in a single index. Rather small compared to the size of most ES
instances I see on the list.

I am deploying to EC2 using a local index and the S3 gateway. ES is the only
data store, so if I have to reindex my data because of a mapping change or
corruption of the S3 gateway I would have no way to get my original
documents.

I have a long term solution to persist data as it is written to ES to
another data store for safekeeping. In the meantime, I have a job which
performs a scan search of all records in my ES index and writes them to S3.
It writes about 5,000 records in about 30 seconds, and most of that time is
spent writing the records one-by-one to S3 over HTTP. Not very efficient,
but it is working for now.

I cannot shut down the ES server for 30 seconds while I write these records,
so I had a couple questions.

  1. When scan executes, does it cache all of the ids of the documents
    which match the query?
  2. As I fetch documents, does scan return me the version of the document
    which existed at the time of the initial scan, or at the time of the
    subsequent scrollId request?

Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else
have any thoughts on how to backup ES? I imagine that most people are using
ES as a secondary store for search purposes only, but I think more and more
people are wanting to ditch their primary storage in favor of ES.

Thanks.

  • deleted -

Hi,

Scan is a snapshot of the state of the index when it was first executed, so you won't see changes happening after you executed the first scan request.

My recommendation, if you are using ES this way (and in general on AWS), is to move to use local gateway. I guess you back things up now. One way is to use EBS and snapshot it to s3, or do the backup yourself. Though, you can't move from an s3 gateway to a local gateway without reindexing the data, which you can do by scanning one cluster and indexing into the other.

-shay.banon

On Thursday, June 9, 2011 at 3:29 AM, James Cook wrote:

I am using an ES cluster with thousands of documents spread across several types in a single index. Rather small compared to the size of most ES instances I see on the list.

I am deploying to EC2 using a local index and the S3 gateway. ES is the only data store, so if I have to reindex my data because of a mapping change or corruption of the S3 gateway I would have no way to get my original documents.

I have a long term solution to persist data as it is written to ES to another data store for safekeeping. In the meantime, I have a job which performs a scan search of all records in my ES index and writes them to S3. It writes about 5,000 records in about 30 seconds, and most of that time is spent writing the records one-by-one to S3 over HTTP. Not very efficient, but it is working for now.

I cannot shut down the ES server for 30 seconds while I write these records, so I had a couple questions.
When scan executes, does it cache all of the ids of the documents which match the query?
As I fetch documents, does scan return me the version of the document which existed at the time of the initial scan, or at the time of the subsequent scrollId request?

Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else have any thoughts on how to backup ES? I imagine that most people are using ES as a secondary store for search purposes only, but I think more and more people are wanting to ditch their primary storage in favor of ES.

Thanks.

The problem (as I understand it) with using a local gateway is our
architecture scales up and down to meet demand. We are using Elastic
Beanstalk from Amazon.

When Beanstalk determines an instance should be spun up based on demand, it
creates a new instance with an EBS volume. When Beanstalk determines an
instance should be spun down based on demand, it simply terminates the
instance. There is no reuse or control of EBS.

Is there some issue with using S3 as gateway?

-- jim

On Thu, Jun 9, 2011 at 2:57 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

Scan is a snapshot of the state of the index when it was first executed,
so you won't see changes happening after you executed the first scan
request.

My recommendation, if you are using ES this way (and in general on AWS),
is to move to use local gateway. I guess you back things up now. One way is
to use EBS and snapshot it to s3, or do the backup yourself. Though, you
can't move from an s3 gateway to a local gateway without reindexing the
data, which you can do by scanning one cluster and indexing into the other.

-shay.banon

On Thursday, June 9, 2011 at 3:29 AM, James Cook wrote:

I am using an ES cluster with thousands of documents spread across several
types in a single index. Rather small compared to the size of most ES
instances I see on the list.

I am deploying to EC2 using a local index and the S3 gateway. ES is the
only data store, so if I have to reindex my data because of a mapping change
or corruption of the S3 gateway I would have no way to get my original
documents.

I have a long term solution to persist data as it is written to ES to
another data store for safekeeping. In the meantime, I have a job which
performs a scan search of all records in my ES index and writes them to S3.
It writes about 5,000 records in about 30 seconds, and most of that time is
spent writing the records one-by-one to S3 over HTTP. Not very efficient,
but it is working for now.

I cannot shut down the ES server for 30 seconds while I write these
records, so I had a couple questions.

  1. When scan executes, does it cache all of the ids of the documents
    which match the query?
  2. As I fetch documents, does scan return me the version of the
    document which existed at the time of the initial scan, or at the time of
    the subsequent scrollId request?

Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else
have any thoughts on how to backup ES? I imagine that most people are using
ES as a secondary store for search purposes only, but I think more and more
people are wanting to ditch their primary storage in favor of ES.

Thanks.

There isn't an issue with s3, it does require extra work in moving data to s3.

On Thursday, June 9, 2011 at 11:56 PM, James Cook wrote:

The problem (as I understand it) with using a local gateway is our architecture scales up and down to meet demand. We are using Elastic Beanstalk from Amazon.

When Beanstalk determines an instance should be spun up based on demand, it creates a new instance with an EBS volume. When Beanstalk determines an instance should be spun down based on demand, it simply terminates the instance. There is no reuse or control of EBS.

Is there some issue with using S3 as gateway?

-- jim

On Thu, Jun 9, 2011 at 2:57 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hi,

Scan is a snapshot of the state of the index when it was first executed, so you won't see changes happening after you executed the first scan request.

My recommendation, if you are using ES this way (and in general on AWS), is to move to use local gateway. I guess you back things up now. One way is to use EBS and snapshot it to s3, or do the backup yourself. Though, you can't move from an s3 gateway to a local gateway without reindexing the data, which you can do by scanning one cluster and indexing into the other.

-shay.banon

On Thursday, June 9, 2011 at 3:29 AM, James Cook wrote:

I am using an ES cluster with thousands of documents spread across several types in a single index. Rather small compared to the size of most ES instances I see on the list.

I am deploying to EC2 using a local index and the S3 gateway. ES is the only data store, so if I have to reindex my data because of a mapping change or corruption of the S3 gateway I would have no way to get my original documents.

I have a long term solution to persist data as it is written to ES to another data store for safekeeping. In the meantime, I have a job which performs a scan search of all records in my ES index and writes them to S3. It writes about 5,000 records in about 30 seconds, and most of that time is spent writing the records one-by-one to S3 over HTTP. Not very efficient, but it is working for now.

I cannot shut down the ES server for 30 seconds while I write these records, so I had a couple questions.
When scan executes, does it cache all of the ids of the documents which match the query?
As I fetch documents, does scan return me the version of the document which existed at the time of the initial scan, or at the time of the subsequent scrollId request?

Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else have any thoughts on how to backup ES? I imagine that most people are using ES as a secondary store for search purposes only, but I think more and more people are wanting to ditch their primary storage in favor of ES.

Thanks.