I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.
What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.
I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.
What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.
So I ended up solving the problem by doing the following to each node:
Shrinking the max heap to 450M
Adding swap on the EBS node as follows:
* /bin/dd if=/dev/zero of=/ebs/swap.1 bs=1M count=1024
* mkswap /ebs/swap.1
* swapon /ebs/swap.1
I was able to add 12M small documents in several hours smoothly after that.
Ranjan
On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:
I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.
What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.
I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.
What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.
What is it counter intuitive? Amazon states that micro instances are "well
suited for lower throughput applications and web sites that require
additional compute cycles periodically". Amazon documentation shows the
type of workloads that would not be suitable for micro instances, and most
ES uses (if not all) don't have behavior that would match the behavior
patter described by Amazon.
It may be possible to run ES on micro instances with proper tuning, but
it's likely not a good idea to use them in production.
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype
I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.
What's the best way to load the index up? I'm using another machine and
the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.
Well saying "dont use micro instances" is way different than "you may have
to tune them". please explain "likely not a good idea". Based on what
values? This is not me arguing, this is honest curiosity. What makes me
think its counter intuitive is that the whole point of horizontal scaling
is that its well... horizontal.
We are a lean startup, we have millions of docs not millions of users,
and the cost profile for running at the next instance tier up is way more
expensive. Right now we are running a four node cluster so it costs about
$40/month to run our cluster as micros and it performs pretty well as long
as we add swap. I am also pretty sure it runs as good if not better than
the same cluster with small instances and two nodes (more expensive). we
would love to be able to afford a better cluster but unless there is a need
we would prefer to run lean. We also want to do lean geospatial search
projects for non-profits like Code for Oakland that may just not be willing
to do projects that require major overhead.
So if this is a nonstarter we would love to know.
Respectfully,
Clay
On Wednesday, April 18, 2012 10:28:17 AM UTC-7, Berkay Mollamustafaoglu
wrote:
What is it counter intuitive? Amazon states that micro instances are "well
suited for lower throughput applications and web sites that require
additional compute cycles periodically". Amazon documentation shows the
type of workloads that would not be suitable for micro instances, and most
ES uses (if not all) don't have behavior that would match the behavior
patter described by Amazon.
It seems counter intuitive to me that Elasticsearch is not recommended
for a micro profile. Seems more a question of tuning and shard planning to
me.
On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:
Don't use micro instances
On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi wrote:
Hi,
I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.
What's the best way to load the index up? I'm using another machine
and the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.
I have run test instances of ES on microinstances successfully. Had
some spares around (used for low intensity processes) so I started 1-3
node clusters. Your issue is with startup indexing? If you are using
bulk indexing, I would keep the number very low (depends on your
document size). That said, I would not recommend using
microinstances.
For searching, I would avoid any type of faceted search. Disable as
many indexed/term-vectors/norms as possible.
Well saying "dont use micro instances" is way different than "you may have
to tune them". please explain "likely not a good idea". Based on what
values? This is not me arguing, this is honest curiosity. What makes me
think its counter intuitive is that the whole point of horizontal scaling is
that its well... horizontal.
We are a lean startup, we have millions of docs not millions of users, and
the cost profile for running at the next instance tier up is way more
expensive. Right now we are running a four node cluster so it costs about
$40/month to run our cluster as micros and it performs pretty well as long
as we add swap. I am also pretty sure it runs as good if not better than the
same cluster with small instances and two nodes (more expensive). we would
love to be able to afford a better cluster but unless there is a need we
would prefer to run lean. We also want to do lean geospatial search projects
for non-profits like Code for Oakland that may just not be willing to do
projects that require major overhead.
So if this is a nonstarter we would love to know.
Respectfully,
Clay
On Wednesday, April 18, 2012 10:28:17 AM UTC-7, Berkay Mollamustafaoglu
wrote:
What is it counter intuitive? Amazon states that micro instances are "well
suited for lower throughput applications and web sites that require
additional compute cycles periodically". Amazon documentation shows the type
of workloads that would not be suitable for micro instances, and most ES
uses (if not all) don't have behavior that would match the behavior patter
described by Amazon.
It seems counter intuitive to me that Elasticsearch is not recommended
for a micro profile. Seems more a question of tuning and shard planning to
me.
On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:
Don't use micro instances
On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi wrote:
Hi,
I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.
What's the best way to load the index up? I'm using another machine
and the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.
The shortcomings of the micro instances are explained in the Amazon doc
I've referenced in the previous email. These are essentially highly
oversubscribed VMs. It's not only a matter of configuration parameters on
your part; it's also highly impacted by the activities of the other
instances on the same server. This leads me to believe that performance
would be very unpredictable and user experience can suffer, especially when
usage is not uniform.
If it works for your use cases, great.
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype
Well saying "dont use micro instances" is way different than "you may have
to tune them". please explain "likely not a good idea". Based on what
values? This is not me arguing, this is honest curiosity. What makes me
think its counter intuitive is that the whole point of horizontal scaling
is that its well... horizontal.
We are a lean startup, we have millions of docs not millions of users,
and the cost profile for running at the next instance tier up is way more
expensive. Right now we are running a four node cluster so it costs about
$40/month to run our cluster as micros and it performs pretty well as long
as we add swap. I am also pretty sure it runs as good if not better than
the same cluster with small instances and two nodes (more expensive). we
would love to be able to afford a better cluster but unless there is a need
we would prefer to run lean. We also want to do lean geospatial search
projects for non-profits like Code for Oakland that may just not be willing
to do projects that require major overhead.
So if this is a nonstarter we would love to know.
Respectfully,
Clay
On Wednesday, April 18, 2012 10:28:17 AM UTC-7, Berkay Mollamustafaoglu
wrote:
What is it counter intuitive? Amazon states that micro instances are "well
suited for lower throughput applications and web sites that require
additional compute cycles periodically". Amazon documentation shows the
type of workloads that would not be suitable for micro instances, and most
ES uses (if not all) don't have behavior that would match the behavior
patter described by Amazon. http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/
concepts_micro_instances.htmlhttp://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html
It may be possible to run ES on micro instances with proper tuning, but
it's likely not a good idea to use them in production.
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype
It seems counter intuitive to me that Elasticsearch is not recommended
for a micro profile. Seems more a question of tuning and shard planning to
me.
On Tuesday, April 17, 2012 6:06:37 AM UTC-7, kimchy wrote:
Don't use micro instances
On Mon, Apr 16, 2012 at 5:23 AM, Ranjan Bagchi wrote:
Hi,
I've a file of several million documents that I'd like to load into an
elasticsearch cluster, hosted on 4 ec2 micro instances using the S3 gateway.
What's the best way to load the index up? I'm using another machine
and the java client, but it's really taking a long time and I'm finding the
instances in the cluster keep falling over.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.