Elasticsearch on EC2 : load-average problem

Talal_2 · March 27, 2013, 11:23am

Hi everybody,

I have used elasticsearch for some months now, and I have a problem that I
don't really know how to solve.
I have an iPhone application that sends notifications. Regarding the number
of users we have, we may have to send +50k notifications per hour.
The notifications to send are stored on an index in elasticsearch, and once
the notification is sent, it is logged in another index. To send the
notifications, once one is created, it is put on a queue, and a worker gets
it and sends it.
The problem is that when the notifications are sent, the load-average of
the elastic-search instances becomes very high, and I don't stop getting
nagios notifications (sometimes, the load-average is > 8). And because we
use elastic-search for other parts of the app (the search, ...), it slows
it down.

In our architecture on AWS, we have 2 elastic-search instances behind a
load-balancer (ELB). The instances are m1.large (I started with m1.small,
then upgraded to c1.medium, then went to m1.large). Elasticsearch is given
6GB of memory (out of 7.2 for the instance). The indexes configuration are
the default one : 5 shards and 1 replica.
And because I can't afford to lose data, I set up the S3 "backup".
I am using elasticsearch 0.19.8
The main index size is about 30G. The logs index size is much smaller.

So, I'd like to know what is the problem here. Are the instances m1.large
not big enough for my usage? (That would be bad, because I just bought 2
reserved instances last month ...). Do I have to change something in the
configuration?

If you need more data on my configuration, feel free to ask!

Thanks in advance for your help,

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · March 27, 2013, 1:36pm

Hello,

I'll shoot a bit in the dark here and assume that you've allocated too much
memory for ES (usually 50% of the system RAM is a good starting point).
This leaves very little room for OS caches (of the 1.2GB left, some of that
will be used by OS). Indexing is CPU and I/O-intensive, so with little OS
caches, you probably hit the disks more often, causing I/O waits, which
might explain your load figures.

But that's just a shot in the dark. You can easily confirm/deny it by
lowering the amount of memory you allocate to ES, restart, and see if
anything changes.

To get more information, I'd suggest you use a monitoring solution to have
a deeper look at what's happening:

what's the bottleneck? is it really CPU? Or is it I/O, or too little
memory and a lot of garbage collection that's causing the load?
you can see whether you allocated too much memory to ES, or if you
actually need more, which implies you have to upgrade your instances

If you don't already have a preferred monitoring solution for ES, I'd
suggest you have a look at our SPM:

I think high load during indexing can be caused by one of the following:

you have too little memory allocated to ES, garbage cleaning eats a lot
of CPU, and CPU becomes the bottleneck
you have too much memory allocated to ES, there's too little OS cache to
help with I/O, there's too much stress on I/O causing high load
neither of the above is a problem, you simply need machines with higher
I/O throughput and/or CPU power

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Mar 27, 2013 at 1:23 PM, Talal mazroui.talal@gmail.com wrote:

Hi everybody,

I have used elasticsearch for some months now, and I have a problem that I
don't really know how to solve.
I have an iPhone application that sends notifications. Regarding the
number of users we have, we may have to send +50k notifications per hour.
The notifications to send are stored on an index in elasticsearch, and
once the notification is sent, it is logged in another index. To send the
notifications, once one is created, it is put on a queue, and a worker gets
it and sends it.
The problem is that when the notifications are sent, the load-average of
the elastic-search instances becomes very high, and I don't stop getting
nagios notifications (sometimes, the load-average is > 8). And because we
use elastic-search for other parts of the app (the search, ...), it slows
it down.

In our architecture on AWS, we have 2 elastic-search instances behind a
load-balancer (ELB). The instances are m1.large (I started with m1.small,
then upgraded to c1.medium, then went to m1.large). Elasticsearch is given
6GB of memory (out of 7.2 for the instance). The indexes configuration are
the default one : 5 shards and 1 replica.
And because I can't afford to lose data, I set up the S3 "backup".
I am using elasticsearch 0.19.8
The main index size is about 30G. The logs index size is much smaller.

So, I'd like to know what is the problem here. Are the instances m1.large
not big enough for my usage? (That would be bad, because I just bought 2
reserved instances last month ...). Do I have to change something in the
configuration?

If you need more data on my configuration, feel free to ask!

Thanks in advance for your help,

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Talal_2 · March 27, 2013, 1:57pm

Hi,

In fact, I already changed up and down the memory allocated to ES. I
changed it down to 4GB to see if the problem continues.
I already use a monitoring solution for elasticsearch, which is bigdesk.
From what I saw with bigdesk, the bottleneck is the CPU, because it is at
the max during those peaks. I will post the screenshots this afternoon when
the problem happens (because unfortunately, I'm sure that it will).

Regards,

On Wednesday, March 27, 2013 2:36:12 PM UTC+1, Radu Gheorghe wrote:

Hello,

I'll shoot a bit in the dark here and assume that you've allocated too
much memory for ES (usually 50% of the system RAM is a good starting
point). This leaves very little room for OS caches (of the 1.2GB left, some
of that will be used by OS). Indexing is CPU and I/O-intensive, so with
little OS caches, you probably hit the disks more often, causing I/O waits,
which might explain your load figures.

But that's just a shot in the dark. You can easily confirm/deny it by
lowering the amount of memory you allocate to ES, restart, and see if
anything changes.

To get more information, I'd suggest you use a monitoring solution to have
a deeper look at what's happening:

what's the bottleneck? is it really CPU? Or is it I/O, or too little
memory and a lot of garbage collection that's causing the load?

you can see whether you allocated too much memory to ES, or if you
actually need more, which implies you have to upgrade your instances

If you don't already have a preferred monitoring solution for ES, I'd
suggest you have a look at our SPM:
Elasticsearch Monitoring

I think high load during indexing can be caused by one of the following:

you have too little memory allocated to ES, garbage cleaning eats a lot
of CPU, and CPU becomes the bottleneck

you have too much memory allocated to ES, there's too little OS cache to
help with I/O, there's too much stress on I/O causing high load

neither of the above is a problem, you simply need machines with higher
I/O throughput and/or CPU power

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Mar 27, 2013 at 1:23 PM, Talal <mazrou...@gmail.com <javascript:>>wrote:

Hi everybody,

I have used elasticsearch for some months now, and I have a problem that
I don't really know how to solve.
I have an iPhone application that sends notifications. Regarding the
number of users we have, we may have to send +50k notifications per hour.
The notifications to send are stored on an index in elasticsearch, and
once the notification is sent, it is logged in another index. To send the
notifications, once one is created, it is put on a queue, and a worker gets
it and sends it.
The problem is that when the notifications are sent, the load-average of
the elastic-search instances becomes very high, and I don't stop getting
nagios notifications (sometimes, the load-average is > 8). And because we
use elastic-search for other parts of the app (the search, ...), it slows
it down.

In our architecture on AWS, we have 2 elastic-search instances behind a
load-balancer (ELB). The instances are m1.large (I started with m1.small,
then upgraded to c1.medium, then went to m1.large). Elasticsearch is given
6GB of memory (out of 7.2 for the instance). The indexes configuration are
the default one : 5 shards and 1 replica.
And because I can't afford to lose data, I set up the S3 "backup".
I am using elasticsearch 0.19.8
The main index size is about 30G. The logs index size is much smaller.

So, I'd like to know what is the problem here. Are the instances m1.large
not big enough for my usage? (That would be bad, because I just bought 2
reserved instances last month ...). Do I have to change something in the
configuration?

If you need more data on my configuration, feel free to ask!

Thanks in advance for your help,

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Talal_2 · March 27, 2013, 6:38pm

Hi,

Here is a screenshot during the problem.
Maybe there is something obvious here that needs optimization, but I can't
see what.

Thanks,

On Wednesday, March 27, 2013 2:57:43 PM UTC+1, Talal wrote:

Hi,

In fact, I already changed up and down the memory allocated to ES. I
changed it down to 4GB to see if the problem continues.
I already use a monitoring solution for elasticsearch, which is bigdesk.
From what I saw with bigdesk, the bottleneck is the CPU, because it is at
the max during those peaks. I will post the screenshots this afternoon when
the problem happens (because unfortunately, I'm sure that it will).

Regards,

On Wednesday, March 27, 2013 2:36:12 PM UTC+1, Radu Gheorghe wrote:

Hello,

I'll shoot a bit in the dark here and assume that you've allocated too
much memory for ES (usually 50% of the system RAM is a good starting
point). This leaves very little room for OS caches (of the 1.2GB left, some
of that will be used by OS). Indexing is CPU and I/O-intensive, so with
little OS caches, you probably hit the disks more often, causing I/O waits,
which might explain your load figures.

But that's just a shot in the dark. You can easily confirm/deny it by
lowering the amount of memory you allocate to ES, restart, and see if
anything changes.

To get more information, I'd suggest you use a monitoring solution to
have a deeper look at what's happening:

what's the bottleneck? is it really CPU? Or is it I/O, or too little
memory and a lot of garbage collection that's causing the load?

you can see whether you allocated too much memory to ES, or if you
actually need more, which implies you have to upgrade your instances

If you don't already have a preferred monitoring solution for ES, I'd
suggest you have a look at our SPM:
Elasticsearch Monitoring

I think high load during indexing can be caused by one of the following:

you have too little memory allocated to ES, garbage cleaning eats a lot
of CPU, and CPU becomes the bottleneck

you have too much memory allocated to ES, there's too little OS cache
to help with I/O, there's too much stress on I/O causing high load

neither of the above is a problem, you simply need machines with higher
I/O throughput and/or CPU power

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Mar 27, 2013 at 1:23 PM, Talal mazrou...@gmail.com wrote:

Hi everybody,

I have used elasticsearch for some months now, and I have a problem that
I don't really know how to solve.
I have an iPhone application that sends notifications. Regarding the
number of users we have, we may have to send +50k notifications per hour.
The notifications to send are stored on an index in elasticsearch, and
once the notification is sent, it is logged in another index. To send the
notifications, once one is created, it is put on a queue, and a worker gets
it and sends it.
The problem is that when the notifications are sent, the load-average of
the elastic-search instances becomes very high, and I don't stop getting
nagios notifications (sometimes, the load-average is > 8). And because we
use elastic-search for other parts of the app (the search, ...), it slows
it down.

In our architecture on AWS, we have 2 elastic-search instances behind a
load-balancer (ELB). The instances are m1.large (I started with m1.small,
then upgraded to c1.medium, then went to m1.large). Elasticsearch is given
6GB of memory (out of 7.2 for the instance). The indexes configuration are
the default one : 5 shards and 1 replica.
And because I can't afford to lose data, I set up the S3 "backup".
I am using elasticsearch 0.19.8
The main index size is about 30G. The logs index size is much smaller.

So, I'd like to know what is the problem here. Are the instances
m1.large not big enough for my usage? (That would be bad, because I just
bought 2 reserved instances last month ...). Do I have to change something
in the configuration?

If you need more data on my configuration, feel free to ask!

Thanks in advance for your help,

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · March 28, 2013, 8:22am

Hello,

I don't see anything obvious. One question, though: does your
indexing&search performance drop under acceptable levels during that time?
Or is it just the alerts from Nagios that are bugging you? Because if it's
the latter, you can change the settings on Nagios.

Assuming it's not the case, there are a few things that might help:

increase index.refresh_interval
unfortunately, I can't tell if the high CPU usage is caused by I/O wait.
If it isn't, you probably need more/larger nodes. If it is, you can either
use faster storage, or try one of the following:
- use store-level
  throttlinghttp://www.elasticsearch.org/guide/reference/index-modules/store/
- change translog
  settingshttp://www.elasticsearch.org/guide/reference/index-modules/translog/to
  commit less often
- tune the merge
  policyhttp://www.elasticsearch.org/guide/reference/index-modules/merge/for
  more segments (eg: increase segments_per_tier). This will make your
  searches slower, though. And it might actually be your searches causing the
  load (I see lots more reads than writes). The scenario being that merges
  invalidate caches, and it's expensive to rebuild those caches, as new
  searches run.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Mar 27, 2013 at 8:38 PM, Talal mazroui.talal@gmail.com wrote:

Hi,

Here is a screenshot during the problem.
Maybe there is something obvious here that needs optimization, but I can't
see what.

Thanks,

On Wednesday, March 27, 2013 2:57:43 PM UTC+1, Talal wrote:

Hi,

In fact, I already changed up and down the memory allocated to ES. I
changed it down to 4GB to see if the problem continues.
I already use a monitoring solution for elasticsearch, which is bigdesk.
From what I saw with bigdesk, the bottleneck is the CPU, because it is at
the max during those peaks. I will post the screenshots this afternoon when
the problem happens (because unfortunately, I'm sure that it will).

Regards,

On Wednesday, March 27, 2013 2:36:12 PM UTC+1, Radu Gheorghe wrote:

Hello,

I'll shoot a bit in the dark here and assume that you've allocated too
much memory for ES (usually 50% of the system RAM is a good starting
point). This leaves very little room for OS caches (of the 1.2GB left, some
of that will be used by OS). Indexing is CPU and I/O-intensive, so with
little OS caches, you probably hit the disks more often, causing I/O waits,
which might explain your load figures.

But that's just a shot in the dark. You can easily confirm/deny it by
lowering the amount of memory you allocate to ES, restart, and see if
anything changes.

To get more information, I'd suggest you use a monitoring solution to
have a deeper look at what's happening:

what's the bottleneck? is it really CPU? Or is it I/O, or too little
memory and a lot of garbage collection that's causing the load?

you can see whether you allocated too much memory to ES, or if you
actually need more, which implies you have to upgrade your instances

If you don't already have a preferred monitoring solution for ES, I'd
suggest you have a look at our SPM:
http://sematext.com/spm/**elasticsearch-performance-**
monitoring/index.htmlhttp://sematext.com/spm/elasticsearch-performance-monitoring/index.html

I think high load during indexing can be caused by one of the following:

you have too little memory allocated to ES, garbage cleaning eats a
lot of CPU, and CPU becomes the bottleneck

you have too much memory allocated to ES, there's too little OS cache
to help with I/O, there's too much stress on I/O causing high load

neither of the above is a problem, you simply need machines with
higher I/O throughput and/or CPU power

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Mar 27, 2013 at 1:23 PM, Talal mazrou...@gmail.com wrote:

Hi everybody,

I have used elasticsearch for some months now, and I have a problem
that I don't really know how to solve.
I have an iPhone application that sends notifications. Regarding the
number of users we have, we may have to send +50k notifications per hour.
The notifications to send are stored on an index in elasticsearch, and
once the notification is sent, it is logged in another index. To send the
notifications, once one is created, it is put on a queue, and a worker gets
it and sends it.
The problem is that when the notifications are sent, the load-average
of the elastic-search instances becomes very high, and I don't stop getting
nagios notifications (sometimes, the load-average is > 8). And because we
use elastic-search for other parts of the app (the search, ...), it slows
it down.

In our architecture on AWS, we have 2 elastic-search instances behind a
load-balancer (ELB). The instances are m1.large (I started with m1.small,
then upgraded to c1.medium, then went to m1.large). Elasticsearch is given
6GB of memory (out of 7.2 for the instance). The indexes configuration are
the default one : 5 shards and 1 replica.
And because I can't afford to lose data, I set up the S3 "backup".
I am using elasticsearch 0.19.8
The main index size is about 30G. The logs index size is much smaller.

So, I'd like to know what is the problem here. Are the instances
m1.large not big enough for my usage? (That would be bad, because I just
bought 2 reserved instances last month ...). Do I have to change something
in the configuration?

If you need more data on my configuration, feel free to ask!

Thanks in advance for your help,

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shadyabhi · March 28, 2013, 8:49am

On Thu, Mar 28, 2013 at 1:52 PM, Radu Gheorghe
radu.gheorghe@sematext.com wrote:

change translog settings to commit less often

Sorry for hijacking the thread but I've a question.

I see two options: index.translog.flush_threshold_ops and
index.translog.flush_threshold_size.

I'm not sure how do these settings co-exist! Is it like whichever comes first?

I've a setup where docs are added at rate of 9k per second and seeing
the default, it seems that my setup flushes every second. I want to
optimize it for maximum indexing speed per second. What value should I
set? Also, the system is low on memory so does increasing these
results have a significant effect on memory usage?

--
Regards,
Abhijeet Rastogi (shadyabhi)
http://blog.abhijeetr.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 28, 2013, 10:12am

On Thu, 2013-03-28 at 14:19 +0530, Abhijeet Rastogi wrote:

On Thu, Mar 28, 2013 at 1:52 PM, Radu Gheorghe
radu.gheorghe@sematext.com wrote:

change translog settings to commit less often

Sorry for hijacking the thread but I've a question.

I see two options: index.translog.flush_threshold_ops and
index.translog.flush_threshold_size.

I'm not sure how do these settings co-exist! Is it like whichever comes first?

Yes. You could have just 5 ops, but each op is indexing a 100MB
document, in which case size would trigger before ops.

I've a setup where docs are added at rate of 9k per second and seeing
the default, it seems that my setup flushes every second. I want to
optimize it for maximum indexing speed per second. What value should I
set? Also, the system is low on memory so does increasing these
results have a significant effect on memory usage?

I don't think it will affect memory usage, but I'm not absolutely sure.
Presumably your 9k docs are all small? You probably want to increase
the ops value

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shadyabhi · March 28, 2013, 10:18am

Thanks clint. That's exactly what I wanted to know.

On Thu, Mar 28, 2013 at 3:42 PM, Clinton Gormley clint@traveljury.com wrote:

On Thu, 2013-03-28 at 14:19 +0530, Abhijeet Rastogi wrote:

On Thu, Mar 28, 2013 at 1:52 PM, Radu Gheorghe
radu.gheorghe@sematext.com wrote:

change translog settings to commit less often

Sorry for hijacking the thread but I've a question.

I see two options: index.translog.flush_threshold_ops and
index.translog.flush_threshold_size.

I'm not sure how do these settings co-exist! Is it like whichever comes first?

Yes. You could have just 5 ops, but each op is indexing a 100MB
document, in which case size would trigger before ops.

I've a setup where docs are added at rate of 9k per second and seeing
the default, it seems that my setup flushes every second. I want to
optimize it for maximum indexing speed per second. What value should I
set? Also, the system is low on memory so does increasing these
results have a significant effect on memory usage?

I don't think it will affect memory usage, but I'm not absolutely sure.
Presumably your 9k docs are all small? You probably want to increase
the ops value

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Regards,
Abhijeet Rastogi (shadyabhi)
http://blog.abhijeetr.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Elasticsearch on EC2 : load-average problem

Best regards, Radu

Best regards, Radu

Best regards, Radu

Best regards, Radu

Best regards, Radu

Best regards,
Radu

Best regards,
Radu

Best regards,
Radu

Best regards,
Radu

Best regards,
Radu