Questions on the scan search

James_Cook · June 9, 2011, 12:29am

I am using an ES cluster with thousands of documents spread across several
types in a single index. Rather small compared to the size of most ES
instances I see on the list.

I am deploying to EC2 using a local index and the S3 gateway. ES is the only
data store, so if I have to reindex my data because of a mapping change or
corruption of the S3 gateway I would have no way to get my original
documents.

I have a long term solution to persist data as it is written to ES to
another data store for safekeeping. In the meantime, I have a job which
performs a scan search of all records in my ES index and writes them to S3.
It writes about 5,000 records in about 30 seconds, and most of that time is
spent writing the records one-by-one to S3 over HTTP. Not very efficient,
but it is working for now.

I cannot shut down the ES server for 30 seconds while I write these records,
so I had a couple questions.

When scan executes, does it cache all of the ids of the documents
which match the query?
As I fetch documents, does scan return me the version of the document
which existed at the time of the initial scan, or at the time of the
subsequent scrollId request?

Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else
have any thoughts on how to backup ES? I imagine that most people are using
ES as a secondary store for search purposes only, but I think more and more
people are wanting to ditch their primary storage in favor of ES.

Thanks.

fashionalwallet · June 9, 2011, 9:00am

deleted -

kimchy · June 9, 2011, 6:57pm

Hi,

Scan is a snapshot of the state of the index when it was first executed, so you won't see changes happening after you executed the first scan request.

My recommendation, if you are using ES this way (and in general on AWS), is to move to use local gateway. I guess you back things up now. One way is to use EBS and snapshot it to s3, or do the backup yourself. Though, you can't move from an s3 gateway to a local gateway without reindexing the data, which you can do by scanning one cluster and indexing into the other.

-shay.banon

On Thursday, June 9, 2011 at 3:29 AM, James Cook wrote:

I am using an ES cluster with thousands of documents spread across several types in a single index. Rather small compared to the size of most ES instances I see on the list.

I am deploying to EC2 using a local index and the S3 gateway. ES is the only data store, so if I have to reindex my data because of a mapping change or corruption of the S3 gateway I would have no way to get my original documents.

I have a long term solution to persist data as it is written to ES to another data store for safekeeping. In the meantime, I have a job which performs a scan search of all records in my ES index and writes them to S3. It writes about 5,000 records in about 30 seconds, and most of that time is spent writing the records one-by-one to S3 over HTTP. Not very efficient, but it is working for now.

I cannot shut down the ES server for 30 seconds while I write these records, so I had a couple questions.
When scan executes, does it cache all of the ids of the documents which match the query?
As I fetch documents, does scan return me the version of the document which existed at the time of the initial scan, or at the time of the subsequent scrollId request?

Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else have any thoughts on how to backup ES? I imagine that most people are using ES as a secondary store for search purposes only, but I think more and more people are wanting to ditch their primary storage in favor of ES.

Thanks.

James_Cook · June 9, 2011, 8:56pm

The problem (as I understand it) with using a local gateway is our
architecture scales up and down to meet demand. We are using Elastic
Beanstalk from Amazon.

When Beanstalk determines an instance should be spun up based on demand, it
creates a new instance with an EBS volume. When Beanstalk determines an
instance should be spun down based on demand, it simply terminates the
instance. There is no reuse or control of EBS.

Is there some issue with using S3 as gateway?

-- jim

On Thu, Jun 9, 2011 at 2:57 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

Scan is a snapshot of the state of the index when it was first executed,
so you won't see changes happening after you executed the first scan
request.

My recommendation, if you are using ES this way (and in general on AWS),
is to move to use local gateway. I guess you back things up now. One way is
to use EBS and snapshot it to s3, or do the backup yourself. Though, you
can't move from an s3 gateway to a local gateway without reindexing the
data, which you can do by scanning one cluster and indexing into the other.

-shay.banon

On Thursday, June 9, 2011 at 3:29 AM, James Cook wrote:

I am using an ES cluster with thousands of documents spread across several
types in a single index. Rather small compared to the size of most ES
instances I see on the list.

I am deploying to EC2 using a local index and the S3 gateway. ES is the
only data store, so if I have to reindex my data because of a mapping change
or corruption of the S3 gateway I would have no way to get my original
documents.

I have a long term solution to persist data as it is written to ES to
another data store for safekeeping. In the meantime, I have a job which
performs a scan search of all records in my ES index and writes them to S3.
It writes about 5,000 records in about 30 seconds, and most of that time is
spent writing the records one-by-one to S3 over HTTP. Not very efficient,
but it is working for now.

I cannot shut down the ES server for 30 seconds while I write these
records, so I had a couple questions.

When scan executes, does it cache all of the ids of the documents
which match the query?

As I fetch documents, does scan return me the version of the
document which existed at the time of the initial scan, or at the time of
the subsequent scrollId request?

Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else
have any thoughts on how to backup ES? I imagine that most people are using
ES as a secondary store for search purposes only, but I think more and more
people are wanting to ditch their primary storage in favor of ES.

Thanks.

kimchy · June 9, 2011, 10:41pm

There isn't an issue with s3, it does require extra work in moving data to s3.

On Thursday, June 9, 2011 at 11:56 PM, James Cook wrote:

The problem (as I understand it) with using a local gateway is our architecture scales up and down to meet demand. We are using Elastic Beanstalk from Amazon.

When Beanstalk determines an instance should be spun up based on demand, it creates a new instance with an EBS volume. When Beanstalk determines an instance should be spun down based on demand, it simply terminates the instance. There is no reuse or control of EBS.

Is there some issue with using S3 as gateway?

-- jim

On Thu, Jun 9, 2011 at 2:57 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hi,

Scan is a snapshot of the state of the index when it was first executed, so you won't see changes happening after you executed the first scan request.

My recommendation, if you are using ES this way (and in general on AWS), is to move to use local gateway. I guess you back things up now. One way is to use EBS and snapshot it to s3, or do the backup yourself. Though, you can't move from an s3 gateway to a local gateway without reindexing the data, which you can do by scanning one cluster and indexing into the other.

-shay.banon

On Thursday, June 9, 2011 at 3:29 AM, James Cook wrote:

I am using an ES cluster with thousands of documents spread across several types in a single index. Rather small compared to the size of most ES instances I see on the list.

I am deploying to EC2 using a local index and the S3 gateway. ES is the only data store, so if I have to reindex my data because of a mapping change or corruption of the S3 gateway I would have no way to get my original documents.

I have a long term solution to persist data as it is written to ES to another data store for safekeeping. In the meantime, I have a job which performs a scan search of all records in my ES index and writes them to S3. It writes about 5,000 records in about 30 seconds, and most of that time is spent writing the records one-by-one to S3 over HTTP. Not very efficient, but it is working for now.

I cannot shut down the ES server for 30 seconds while I write these records, so I had a couple questions.
When scan executes, does it cache all of the ids of the documents which match the query?
As I fetch documents, does scan return me the version of the document which existed at the time of the initial scan, or at the time of the subsequent scrollId request?

Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else have any thoughts on how to backup ES? I imagine that most people are using ES as a secondary store for search purposes only, but I think more and more people are wanting to ditch their primary storage in favor of ES.

Thanks.

Topic		Replies	Views
Creating and storing ES indices on S3 Elasticsearch	2	735	July 6, 2017
Migration of ES from AWS to non-AWS cluster Elasticsearch	3	1248	February 20, 2017
Suggestions on indexing LARGE, GROWING data sets Elasticsearch	4	550	July 6, 2017
How to dump the entire contents of ES? Elasticsearch	6	2174	July 6, 2017
Elasticsearch as a primary store Elasticsearch	1	369	June 18, 2018

Questions on the scan search

Related topics