Cluster not able to keep up?


(tdjb) #1

We have recently built out our Elasticsearch cluster and are now running
close to our true volume of data through it. Unfortunately it seems like
our cluster basically runs out of steam after a bit and can't keep up.

The cluster consists of four physical machines with 32 CPUs and 252gb of
memory. We are currently running three ES instances on each box, a search
instance, a master instance and a data instance. We are currently inserting
about 34k-40k documents a second that vary in size but are usually in the
1kb-6kb range (log entries). We run what we call our consumers on other
hardware, these are what pull the messages in, format them and then send
them to Elasticsearch. These are written in java and use the transport
client + bulk api to send the documents. We are using version 0.90.5 on
java 7u25.
We currently keep one index per day with the following settings:

"index.number_of_replicas" : "1",
"index.number_of_shards" : "8",
"index.indexing.slowlog.threshold.index.warn": "10s",
"index.refresh_interval" : "10s"

The issue we seem to be having is the cluster does great for a bit,
sometimes up to 30-40 minutes but then it starts acting like it is unable
to keep up. Our insert rate starts becoming erratic (it should be pretty
steady) and the results start taking longer and longer to show up. We had
seen some queues in the bulk thread pool so we thought we'd try increasing
that to:

"threadpool.bulk.type" : "fixed",
"threadpool.bulk.size" : "1024",
"threadpool.bulk.queue_size" : "2000"

That seemed to get rid of the queue issue but didn't change the fact that
the cluster just stops keeping up. We also noticed that our merge times
seem to be all over the place:

[2014-01-03 08:06:43,486][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][21] merge [_22u] done, took [29.2s]
[2014-01-03 08:06:44,520][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][13] merge [_1pv] done, took [4.3m]
[2014-01-03 08:06:44,683][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][29] merge [_yf] done, took [22.5m]
[2014-01-03 08:06:47,908][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][8] merge [_1y0] done, took [1.2m]
[2014-01-03 08:06:48,685][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][4] merge [_1w3] done, took [3.2m]
[2014-01-03 08:06:48,785][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][12] merge [_1vf] done, took [31.1s]

We are new to Elasticsearch but have to assume that merges taking that long
are bad right?

Is this just a case of our cluster cannot support our volume or are there
some settings we can play with to get this working right?

TIA

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/788ece22-516e-4938-945c-842eeaaf43ee%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

I have a similar setup.

The merge times are no problem at all, they look good. Yes, it may take

20min to merge gigabyte-sized segments (you can streamline that for lower
numbers). This is a background process and does not halt any other part of
ES.

A threadpool size of 1024 is definitely way too high. You should at maximum
set the number of cores to that, otherwise you will starve the JVM thread
manager.

Raising the queue_size might help, especially on small systems, not on a
large one you have. But that is just a symptom, not the cause of the
trouble.

So do you monitor the OS, does it swap when insert rate starts to become
erratic (whatever that means, some numbers would help)? What about disk I/O
rates? How large are the segment files in the data folder?

What means 252g memory? It is not very relevant to ES... what is the ES
heap mem size and the ES process size?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFO2Wz-BU58qAa8SFXnj9ckX-YS8tiiLfG85i%2BmjDnXsw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(tdjb) #3

Jörg, We are running all 4 data nodes with 20gb of heap and our monitoring
shows that we are not maxing that out. At most we hit about 60% total heap
usage and our GCs look ok (no long ones, mostly new gen, etc). The I/O
looks good via iostat, no waiting or anything. I'll have to get concrete
numbers during our next test but in our last test I remember only seeing
about 5mb-10mb in writes a second. I'm not sure what the file type is on
the filesystem but looking at the segment sizes via the API it appears they
range from 4mb all the way to 1.5gb, most of them seem to be somewhere in
the middle. The system never swaps during any of this.

As far as what erratic means. We'll see a steady insert rate that pairs up
with the rate we are pulling messages from kafka. This will be anywhere
from 34k to 40k a second and can stay that way for up to 30 minutes. Then
at some point something happens and the ES insert rates drop, say from 34k
down to 20k. Then the rate shoots back up to 30k, then back down to 25k,
etc, etc. Once this up and down starts it never goes back to being a steady
rate again. This whole time the incoming rate from kafka stays flat as it
should. This happens to both of our insert apps at the same time. These two
apps live on their own hardware outside of the ES cluster. Once the insert
rate starts going up and down instead of flat the search results start
getting more and more delayed (which makes sense).

It's almost like ES can keep up with our indexing rate for a bit and then
at some point it just can't and starts to fall behind and never recover.

On Friday, January 3, 2014 12:18:59 PM UTC-7, Jörg Prante wrote:

I have a similar setup.

The merge times are no problem at all, they look good. Yes, it may take

20min to merge gigabyte-sized segments (you can streamline that for lower
numbers). This is a background process and does not halt any other part of
ES.

A threadpool size of 1024 is definitely way too high. You should at
maximum set the number of cores to that, otherwise you will starve the JVM
thread manager.

Raising the queue_size might help, especially on small systems, not on a
large one you have. But that is just a symptom, not the cause of the
trouble.

So do you monitor the OS, does it swap when insert rate starts to become
erratic (whatever that means, some numbers would help)? What about disk I/O
rates? How large are the segment files in the data folder?

What means 252g memory? It is not very relevant to ES... what is the ES
heap mem size and the ES process size?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/68858b98-b261-44e8-9ee8-35789317ecee%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(tdjb) #4

Just to add, it appears that each of the 4 ES nodes is only taking in about
8mb-10mb a second so it doesn't seem like we're overloading the network
either (these are on gigabit links).

On Friday, January 3, 2014 12:42:06 PM UTC-7, tdjb wrote:

Jörg, We are running all 4 data nodes with 20gb of heap and our monitoring
shows that we are not maxing that out. At most we hit about 60% total heap
usage and our GCs look ok (no long ones, mostly new gen, etc). The I/O
looks good via iostat, no waiting or anything. I'll have to get concrete
numbers during our next test but in our last test I remember only seeing
about 5mb-10mb in writes a second. I'm not sure what the file type is on
the filesystem but looking at the segment sizes via the API it appears they
range from 4mb all the way to 1.5gb, most of them seem to be somewhere in
the middle. The system never swaps during any of this.

As far as what erratic means. We'll see a steady insert rate that pairs up
with the rate we are pulling messages from kafka. This will be anywhere
from 34k to 40k a second and can stay that way for up to 30 minutes. Then
at some point something happens and the ES insert rates drop, say from 34k
down to 20k. Then the rate shoots back up to 30k, then back down to 25k,
etc, etc. Once this up and down starts it never goes back to being a steady
rate again. This whole time the incoming rate from kafka stays flat as it
should. This happens to both of our insert apps at the same time. These two
apps live on their own hardware outside of the ES cluster. Once the insert
rate starts going up and down instead of flat the search results start
getting more and more delayed (which makes sense).

It's almost like ES can keep up with our indexing rate for a bit and then
at some point it just can't and starts to fall behind and never recover.

On Friday, January 3, 2014 12:18:59 PM UTC-7, Jörg Prante wrote:

I have a similar setup.

The merge times are no problem at all, they look good. Yes, it may take

20min to merge gigabyte-sized segments (you can streamline that for lower
numbers). This is a background process and does not halt any other part of
ES.

A threadpool size of 1024 is definitely way too high. You should at
maximum set the number of cores to that, otherwise you will starve the JVM
thread manager.

Raising the queue_size might help, especially on small systems, not on a
large one you have. But that is just a symptom, not the cause of the
trouble.

So do you monitor the OS, does it swap when insert rate starts to become
erratic (whatever that means, some numbers would help)? What about disk I/O
rates? How large are the segment files in the data folder?

What means 252g memory? It is not very relevant to ES... what is the ES
heap mem size and the ES process size?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/33613876-4c61-4e06-ab2f-0c5a6a1fa025%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

This all sounds good from a far distance view .... a wild guess is that
since 8-10mb a sec is pretty well (I never saw more on the systems I have
here, also with gigabit), there are some internal limits active which might
need some tuning (maybe index throttling or thread pools?) or adding nodes
could be the next step...

ES is not creating CPU load in a constant manner, it is more cyclic. It has
to start the Lucene mergers in the background, and when they kick in, the
indexing and searching performance is affected for seconds and minutes,
mostly depending on the IO throughput capacity of the disk subsystem. So
without seeing live monitor data, it is hard to make educated guesses if
it's the merging effect, or if something else is going on.

Maybe that is the point when you should ask the ES core team for
professional advise how to stabilize maximum performance over time.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFu_gD%2Bq1_LGrKn37Czd761oBneYUBrETMqhK%3D-TKfjUw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #6

Try using monitoring plugins like
elastichq/kopf/bigdesk/elasticsearch-monitoring, these might give you a
better idea of what is happening within ES.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 4 January 2014 07:56, joergprante@gmail.com joergprante@gmail.comwrote:

This all sounds good from a far distance view .... a wild guess is that
since 8-10mb a sec is pretty well (I never saw more on the systems I have
here, also with gigabit), there are some internal limits active which might
need some tuning (maybe index throttling or thread pools?) or adding nodes
could be the next step...

ES is not creating CPU load in a constant manner, it is more cyclic. It
has to start the Lucene mergers in the background, and when they kick in,
the indexing and searching performance is affected for seconds and minutes,
mostly depending on the IO throughput capacity of the disk subsystem. So
without seeing live monitor data, it is hard to make educated guesses if
it's the merging effect, or if something else is going on.

Maybe that is the point when you should ask the ES core team for
professional advise how to stabilize maximum performance over time.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFu_gD%2Bq1_LGrKn37Czd761oBneYUBrETMqhK%3D-TKfjUw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624ZMiSpfBbPrvucOQj9rVuU6nQY%3DEYEP1bsUewa4-K6TpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(tdjb) #7

So we were able to secure some temporary machines to increase the cluster
and that seemed to fix the issue. We set the bulk threads back to the
default, added more machines and now seem to be able to handle ~40k a
second in a stable manner for a long period of time.

On Friday, January 3, 2014 1:56:24 PM UTC-7, Jörg Prante wrote:

This all sounds good from a far distance view .... a wild guess is that
since 8-10mb a sec is pretty well (I never saw more on the systems I have
here, also with gigabit), there are some internal limits active which might
need some tuning (maybe index throttling or thread pools?) or adding nodes
could be the next step...

ES is not creating CPU load in a constant manner, it is more cyclic. It
has to start the Lucene mergers in the background, and when they kick in,
the indexing and searching performance is affected for seconds and minutes,
mostly depending on the IO throughput capacity of the disk subsystem. So
without seeing live monitor data, it is hard to make educated guesses if
it's the merging effect, or if something else is going on.

Maybe that is the point when you should ask the ES core team for
professional advise how to stabilize maximum performance over time.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ee23c83a-6e92-4c37-8acd-5d8f15746eb0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Otis Gospodnetić) #8

Hi,

I bet it's Lucene segment merges. You have more machines so you can
sustain high input rates longer, but I bet you'll hit the moment when the
indexing rate drops again.

Check this graph:
https://apps.sematext.com/spm-reports/s/eUgWhPqZrg
(just look at the last big "tooth")
Or instead of looking at the # of docs growing slower and slower (while the
input rate remains the same, like in our case), look at the indexing rate
graph:
https://apps.sematext.com/spm-reports/s/MZbHLRt4qY
(again, just look at the last big "tooth")

Does your indexing rate look the same?

If so, look at your disk IO. Here is the disk IO for the same cluster as
above:
https://apps.sematext.com/spm-reports/s/sHahBnvoUw

Those reads you see there.... that, I believe, is due to Lucene segment
merges.

Hm, in your case you said there is no waiting.... in the system above there
is.

Btw. you have 32 CPU cores on each server and only 10 shards and 1 replica?
You could try more shards then to keep all your CPUs busy.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Friday, January 3, 2014 5:34:34 PM UTC-5, tdjb wrote:

So we were able to secure some temporary machines to increase the cluster
and that seemed to fix the issue. We set the bulk threads back to the
default, added more machines and now seem to be able to handle ~40k a
second in a stable manner for a long period of time.

On Friday, January 3, 2014 1:56:24 PM UTC-7, Jörg Prante wrote:

This all sounds good from a far distance view .... a wild guess is that
since 8-10mb a sec is pretty well (I never saw more on the systems I have
here, also with gigabit), there are some internal limits active which might
need some tuning (maybe index throttling or thread pools?) or adding nodes
could be the next step...

ES is not creating CPU load in a constant manner, it is more cyclic. It
has to start the Lucene mergers in the background, and when they kick in,
the indexing and searching performance is affected for seconds and minutes,
mostly depending on the IO throughput capacity of the disk subsystem. So
without seeing live monitor data, it is hard to make educated guesses if
it's the merging effect, or if something else is going on.

Maybe that is the point when you should ask the ES core team for
professional advise how to stabilize maximum performance over time.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aa3d00b7-b89a-425a-b545-441fa5dae76d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(James Richardson) #9

When you have linux machine with large ram, you will need to tune page cache pdflush. Google for dirty_background_ratio. Possibly you are writing everything to the page cache for some time, then a limit is hit, and suddenly all your io rates will drop. It is usually possible to correlate this with a large rise in system cpu time.
Good luck!

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a196c704-e6cc-41c0-8f8f-45cee2d91489%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(tdjb) #10

Otis, our index rate actually stays very flat until the issue occurs and
then after that the index rate just goes up and down, it never grows at a
steady rate or anything, it just jumps all over the place.

James, thank you for the suggestion. I've been gathering info on the
dirty_background_ratio topic and will start looking at our systems to see
if I can find anything that indicates that might be the issue.

On Friday, January 3, 2014 8:16:19 PM UTC-7, Otis Gospodnetic wrote:

Hi,

I bet it's Lucene segment merges. You have more machines so you can
sustain high input rates longer, but I bet you'll hit the moment when the
indexing rate drops again.

Check this graph:
https://apps.sematext.com/spm-reports/s/eUgWhPqZrg
(just look at the last big "tooth")
Or instead of looking at the # of docs growing slower and slower (while
the input rate remains the same, like in our case), look at the indexing
rate graph:
https://apps.sematext.com/spm-reports/s/MZbHLRt4qY
(again, just look at the last big "tooth")

Does your indexing rate look the same?

If so, look at your disk IO. Here is the disk IO for the same cluster as
above:
https://apps.sematext.com/spm-reports/s/sHahBnvoUw

Those reads you see there.... that, I believe, is due to Lucene segment
merges.

Hm, in your case you said there is no waiting.... in the system above
there is.

Btw. you have 32 CPU cores on each server and only 10 shards and 1
replica? You could try more shards then to keep all your CPUs busy.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Friday, January 3, 2014 5:34:34 PM UTC-5, tdjb wrote:

So we were able to secure some temporary machines to increase the cluster
and that seemed to fix the issue. We set the bulk threads back to the
default, added more machines and now seem to be able to handle ~40k a
second in a stable manner for a long period of time.

On Friday, January 3, 2014 1:56:24 PM UTC-7, Jörg Prante wrote:

This all sounds good from a far distance view .... a wild guess is that
since 8-10mb a sec is pretty well (I never saw more on the systems I have
here, also with gigabit), there are some internal limits active which might
need some tuning (maybe index throttling or thread pools?) or adding nodes
could be the next step...

ES is not creating CPU load in a constant manner, it is more cyclic. It
has to start the Lucene mergers in the background, and when they kick in,
the indexing and searching performance is affected for seconds and minutes,
mostly depending on the IO throughput capacity of the disk subsystem. So
without seeing live monitor data, it is hard to make educated guesses if
it's the merging effect, or if something else is going on.

Maybe that is the point when you should ask the ES core team for
professional advise how to stabilize maximum performance over time.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/33cff592-4b28-4d61-a98d-96ea59b78d6c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Steve Mayzak) #11

I'd like to help on this if I can but as you noted, adding more nodes
smoothed things out.

First, a couple of clarifications based on what others have said.

  1. I'll reiterate that 1024 is too high for threadpool and Elasticsearch
    comes OOTB with sensible defaults that shouldn't be changed in most cases
    in our experience.
  2. The reason merges vary in time is that we throttle that as part of our
    default settings. Elasticsearch is designed to be a good citizen and no
    one part of the system should overload/starve another. See here for
    details: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html
  3. A shard is built to take advantage of multiple cores out of the box for
    indexing and querying, highly concurrent, so I wouldn't worry about trying
    to correlate the number of cores with the number of shards in an index.

Without seeing the monitoring metrics while your cluster is steady and then
sporadic, its not as easy to troubleshoot. If you have time to share some
of those stats over time (CPU, Memory, Disk IO, Netowrk and JVM related
metrics) we could help you further.

On Monday, January 6, 2014 9:15:45 AM UTC-8, tdjb wrote:

Otis, our index rate actually stays very flat until the issue occurs and
then after that the index rate just goes up and down, it never grows at a
steady rate or anything, it just jumps all over the place.

James, thank you for the suggestion. I've been gathering info on the
dirty_background_ratio topic and will start looking at our systems to see
if I can find anything that indicates that might be the issue.

On Friday, January 3, 2014 8:16:19 PM UTC-7, Otis Gospodnetic wrote:

Hi,

I bet it's Lucene segment merges. You have more machines so you can
sustain high input rates longer, but I bet you'll hit the moment when the
indexing rate drops again.

Check this graph:
https://apps.sematext.com/spm-reports/s/eUgWhPqZrg
(just look at the last big "tooth")
Or instead of looking at the # of docs growing slower and slower (while
the input rate remains the same, like in our case), look at the indexing
rate graph:
https://apps.sematext.com/spm-reports/s/MZbHLRt4qY
(again, just look at the last big "tooth")

Does your indexing rate look the same?

If so, look at your disk IO. Here is the disk IO for the same cluster as
above:
https://apps.sematext.com/spm-reports/s/sHahBnvoUw

Those reads you see there.... that, I believe, is due to Lucene segment
merges.

Hm, in your case you said there is no waiting.... in the system above
there is.

Btw. you have 32 CPU cores on each server and only 10 shards and 1
replica? You could try more shards then to keep all your CPUs busy.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Friday, January 3, 2014 5:34:34 PM UTC-5, tdjb wrote:

So we were able to secure some temporary machines to increase the
cluster and that seemed to fix the issue. We set the bulk threads back to
the default, added more machines and now seem to be able to handle ~40k a
second in a stable manner for a long period of time.

On Friday, January 3, 2014 1:56:24 PM UTC-7, Jörg Prante wrote:

This all sounds good from a far distance view .... a wild guess is that
since 8-10mb a sec is pretty well (I never saw more on the systems I have
here, also with gigabit), there are some internal limits active which might
need some tuning (maybe index throttling or thread pools?) or adding nodes
could be the next step...

ES is not creating CPU load in a constant manner, it is more cyclic. It
has to start the Lucene mergers in the background, and when they kick in,
the indexing and searching performance is affected for seconds and minutes,
mostly depending on the IO throughput capacity of the disk subsystem. So
without seeing live monitor data, it is hard to make educated guesses if
it's the merging effect, or if something else is going on.

Maybe that is the point when you should ask the ES core team for
professional advise how to stabilize maximum performance over time.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ee89f7b7-0f94-4532-bc9a-f268eeb2a59b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(tdjb) #12

Hi Steve, I've put together some screen shots from our monitoring system
that I hope will help. Sadly we did not have the more in-depth disk
monitoring enabled for these machines yet so I was not able to gather much
more the basic file i/o details. There are two zips here, one that has a
set of images from when we are running as expected (goodEsData.zip) and
then another set of images from when we are having issues (badEsData.zip).
Each zip has CMS GC details, ParNew GC details, CPU, network in, network
out, file out and Elasticsearch inserts per second. Each file name should
explain what it is.

You can see in the "bad" Elasticsearch insert image where things are quite
bumpy compared to the "good" one. Looking at all these images together you
can see a drastic file i/o jump in the "bad" images which I guess would
match up to what James was talking about. We have been playing a bit with
the vm.dirty_*_ratio settings but haven't been able to do many tests to see
if we've improved anything.

The more I searched around the more I saw people suggesting to just stick
with the default settings which we have done aside from:
indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000

Because right now we are doing mainly inserts with not much searching.

I realize that this question is probably hard to answer not knowing exactly
what our data looks like, but should our original four nodes be able to
handle a load like this? If it takes more hardware then it takes more
hardware, we just want to make sure we are getting it because we actually
need it and not because we've missed something on the setup side. I did
forget to mention in the original post that all of the nodes are using
local, non-ssd disks and not SAN storage.

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9318337b-c5c1-4ec3-9b54-c1c6560fb85c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #13