Loading data on ec2 micro instances

Hi,

I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.

What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.

Any help?

Thanks,

Ranjan

Don't use micro instances

On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi ranjan.bagchi@gmail.comwrote:

Hi,

I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.

What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.

Any help?

Thanks,

Ranjan

So I ended up solving the problem by doing the following to each node:
Shrinking the max heap to 450M
Adding swap on the EBS node as follows:
* /bin/dd if=/dev/zero of=/ebs/swap.1 bs=1M count=1024
* mkswap /ebs/swap.1
* swapon /ebs/swap.1

I was able to add 12M small documents in several hours smoothly after that.

Ranjan

On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:

Don't use micro instances

On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi ranjan.bagchi@gmail.comwrote:

Hi,

I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.

What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.

Any help?

Thanks,

Ranjan

It seems counter intuitive to me that Elasticsearch is not recommended for
a micro profile. Seems more a question of tuning and shard planning to me.

On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:

Don't use micro instances

On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi ranjan.bagchi@gmail.comwrote:

Hi,

I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.

What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.

Any help?

Thanks,

Ranjan

What is it counter intuitive? Amazon states that micro instances are "well
suited for lower throughput applications and web sites that require
additional compute cycles periodically". Amazon documentation shows the
type of workloads that would not be suitable for micro instances, and most
ES uses (if not all) don't have behavior that would match the behavior
patter described by Amazon.

It may be possible to run ES on micro instances with proper tuning, but
it's likely not a good idea to use them in production.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Wed, Apr 18, 2012 at 12:29 PM, Clay Graham clay@welocally.com wrote:

It seems counter intuitive to me that Elasticsearch is not recommended
for a micro profile. Seems more a question of tuning and shard planning to
me.

On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:

Don't use micro instances

On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi ranjan.bagchi@gmail.comwrote:

Hi,

I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.

What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.

Any help?

Thanks,

Ranjan

Well saying "dont use micro instances" is way different than "you may have
to tune them". please explain "likely not a good idea". Based on what
values? This is not me arguing, this is honest curiosity. What makes me
think its counter intuitive is that the whole point of horizontal scaling
is that its well... horizontal.

We are a lean startup, we have millions of docs not millions of users,
and the cost profile for running at the next instance tier up is way more
expensive. Right now we are running a four node cluster so it costs about
$40/month to run our cluster as micros and it performs pretty well as long
as we add swap. I am also pretty sure it runs as good if not better than
the same cluster with small instances and two nodes (more expensive). we
would love to be able to afford a better cluster but unless there is a need
we would prefer to run lean. We also want to do lean geospatial search
projects for non-profits like Code for Oakland that may just not be willing
to do projects that require major overhead.

So if this is a nonstarter we would love to know.

Respectfully,

Clay

On Wednesday, April 18, 2012 10:28:17 AM UTC-7, Berkay Mollamustafaoglu
wrote:

What is it counter intuitive? Amazon states that micro instances are "well
suited for lower throughput applications and web sites that require
additional compute cycles periodically". Amazon documentation shows the
type of workloads that would not be suitable for micro instances, and most
ES uses (if not all) don't have behavior that would match the behavior
patter described by Amazon.

Burstable performance instances - Amazon Elastic Compute Cloud

It may be possible to run ES on micro instances with proper tuning, but
it's likely not a good idea to use them in production.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Wed, Apr 18, 2012 at 12:29 PM, Clay Graham clay@welocally.com wrote:

It seems counter intuitive to me that Elasticsearch is not recommended
for a micro profile. Seems more a question of tuning and shard planning to
me.

On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:

Don't use micro instances

On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi wrote:

Hi,

I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.

What's the best way to load the index up? I'm using another machine
and the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.

Any help?

Thanks,

Ranjan

I have run test instances of ES on microinstances successfully. Had
some spares around (used for low intensity processes) so I started 1-3
node clusters. Your issue is with startup indexing? If you are using
bulk indexing, I would keep the number very low (depends on your
document size). That said, I would not recommend using
microinstances.

For searching, I would avoid any type of faceted search. Disable as
many indexed/term-vectors/norms as possible.

--
Ivan

On Wed, Apr 25, 2012 at 1:27 PM, Clay Graham clay@welocally.com wrote:

Well saying "dont use micro instances" is way different than "you may have
to tune them". please explain "likely not a good idea". Based on what
values? This is not me arguing, this is honest curiosity. What makes me
think its counter intuitive is that the whole point of horizontal scaling is
that its well... horizontal.

We are a lean startup, we have millions of docs not millions of users, and
the cost profile for running at the next instance tier up is way more
expensive. Right now we are running a four node cluster so it costs about
$40/month to run our cluster as micros and it performs pretty well as long
as we add swap. I am also pretty sure it runs as good if not better than the
same cluster with small instances and two nodes (more expensive). we would
love to be able to afford a better cluster but unless there is a need we
would prefer to run lean. We also want to do lean geospatial search projects
for non-profits like Code for Oakland that may just not be willing to do
projects that require major overhead.

So if this is a nonstarter we would love to know.

Respectfully,

Clay

On Wednesday, April 18, 2012 10:28:17 AM UTC-7, Berkay Mollamustafaoglu
wrote:

What is it counter intuitive? Amazon states that micro instances are "well
suited for lower throughput applications and web sites that require
additional compute cycles periodically". Amazon documentation shows the type
of workloads that would not be suitable for micro instances, and most ES
uses (if not all) don't have behavior that would match the behavior patter
described by Amazon.

Burstable performance instances - Amazon Elastic Compute Cloud

It may be possible to run ES on micro instances with proper tuning, but
it's likely not a good idea to use them in production.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Wed, Apr 18, 2012 at 12:29 PM, Clay Graham clay@welocally.com wrote:

It seems counter intuitive to me that Elasticsearch is not recommended
for a micro profile. Seems more a question of tuning and shard planning to
me.

On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:

Don't use micro instances

On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi wrote:

Hi,

I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.

What's the best way to load the index up? I'm using another machine
and the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.

Any help?

Thanks,

Ranjan

The shortcomings of the micro instances are explained in the Amazon doc
I've referenced in the previous email. These are essentially highly
oversubscribed VMs. It's not only a matter of configuration parameters on
your part; it's also highly impacted by the activities of the other
instances on the same server. This leads me to believe that performance
would be very unpredictable and user experience can suffer, especially when
usage is not uniform.
If it works for your use cases, great.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Wed, Apr 25, 2012 at 4:27 PM, Clay Graham clay@welocally.com wrote:

Well saying "dont use micro instances" is way different than "you may have
to tune them". please explain "likely not a good idea". Based on what
values? This is not me arguing, this is honest curiosity. What makes me
think its counter intuitive is that the whole point of horizontal scaling
is that its well... horizontal.

We are a lean startup, we have millions of docs not millions of users,
and the cost profile for running at the next instance tier up is way more
expensive. Right now we are running a four node cluster so it costs about
$40/month to run our cluster as micros and it performs pretty well as long
as we add swap. I am also pretty sure it runs as good if not better than
the same cluster with small instances and two nodes (more expensive). we
would love to be able to afford a better cluster but unless there is a need
we would prefer to run lean. We also want to do lean geospatial search
projects for non-profits like Code for Oakland that may just not be willing
to do projects that require major overhead.

So if this is a nonstarter we would love to know.

Respectfully,

Clay

On Wednesday, April 18, 2012 10:28:17 AM UTC-7, Berkay Mollamustafaoglu
wrote:

What is it counter intuitive? Amazon states that micro instances are "well
suited for lower throughput applications and web sites that require
additional compute cycles periodically". Amazon documentation shows the
type of workloads that would not be suitable for micro instances, and most
ES uses (if not all) don't have behavior that would match the behavior
patter described by Amazon.
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/
concepts_micro_instances.htmlhttp://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html

It may be possible to run ES on micro instances with proper tuning, but
it's likely not a good idea to use them in production.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Wed, Apr 18, 2012 at 12:29 PM, Clay Graham clay@welocally.com wrote:

It seems counter intuitive to me that Elasticsearch is not recommended
for a micro profile. Seems more a question of tuning and shard planning to
me.

On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:

Don't use micro instances

On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi wrote:

Hi,

I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.

What's the best way to load the index up? I'm using another machine
and the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.

Any help?

Thanks,

Ranjan