How to diagnose slow queries every 10 minutes exactly?


(Dave Reed) #1

I have a 2-node cluster running on some beefy machines. 12g and 16g of heap
space. About 2.1 million documents, each relatively small in size, spread
across 200 or so indexes. The refresh interval is 0.5s (while I don't need
realtime I do need relatively quick refreshes). Documents are continuously
modified by the app, so reindex requests trickle in constantly. By trickle
I mean maybe a dozen a minute. All index requests are made with _bulk,
although a lot of the time there's only 1 in the list.

Searches are very fast -- normally taking 50ms or less.

But oddly, exactly every 10 minutes, searches slow down for a moment. The
exact same query that normally takes <50ms takes 9000ms, for example. Any
other queries regardless of what they are also take multiple seconds to
complete. Once this moment passes, search queries return to normal.

I have a tester I wrote that continuously posts the same query and logs the
results, which is how I discovered this pattern.

Here's an excerpt. Notice that query time is great at 3:49:10, then at :11
things stop for 10 seconds. At :21 the queued up searches finally come
through. The numbers reported are the "took" field from the ES search
response. Then things resume as normal. This is true no matter which node I
run the search against.

This pattern repeats like this every 10 minutes, to the second, for days
now.

3:49:09, 47
3:49:09, 63
3:49:10, 31
3:49:10, 47
3:49:11, 47
3:49:11, 62
3:49:21, 8456
3:49:21, 5460
3:49:21, 7457
3:49:21, 4509
3:49:21, 3510
3:49:21, 515
3:49:21, 1498
3:49:21, 2496
3:49:22, 2465
3:49:22, 2636
3:49:22, 6506
3:49:22, 7504
3:49:22, 9501
3:49:22, 4509
3:49:22, 1638
3:49:22, 6646
3:49:22, 9641
3:49:22, 655
3:49:22, 3667
3:49:22, 31
3:49:22, 78
3:49:22, 47
3:49:23, 47
3:49:23, 47
3:49:24, 93

I've ruled out any obvious periodical process running on either node in the
cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty consistent
and flat even during one of these blips. These are beefy machines like I
said, so there's plenty of CPU and RAM available.

Any advice on how I can figure out what ES is waiting for would be helpful.
Is there any process ES runs every 10 minutes by default?

Thanks!
-Dave

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #2

Could you run a hot_threads API call when this happens?
Anything in logs about GC?

BTW 200 indices is a lot for 2 nodes. And how many shards/replicas do you have?
Why do you need so many indices for 2m docs?

David

Le 21 avr. 2015 à 01:16, Dave Reed infinity88@gmail.com a écrit :

I have a 2-node cluster running on some beefy machines. 12g and 16g of heap space. About 2.1 million documents, each relatively small in size, spread across 200 or so indexes. The refresh interval is 0.5s (while I don't need realtime I do need relatively quick refreshes). Documents are continuously modified by the app, so reindex requests trickle in constantly. By trickle I mean maybe a dozen a minute. All index requests are made with _bulk, although a lot of the time there's only 1 in the list.

Searches are very fast -- normally taking 50ms or less.

But oddly, exactly every 10 minutes, searches slow down for a moment. The exact same query that normally takes <50ms takes 9000ms, for example. Any other queries regardless of what they are also take multiple seconds to complete. Once this moment passes, search queries return to normal.

I have a tester I wrote that continuously posts the same query and logs the results, which is how I discovered this pattern.

Here's an excerpt. Notice that query time is great at 3:49:10, then at :11 things stop for 10 seconds. At :21 the queued up searches finally come through. The numbers reported are the "took" field from the ES search response. Then things resume as normal. This is true no matter which node I run the search against.

This pattern repeats like this every 10 minutes, to the second, for days now.

3:49:09, 47
3:49:09, 63
3:49:10, 31
3:49:10, 47
3:49:11, 47
3:49:11, 62
3:49:21, 8456
3:49:21, 5460
3:49:21, 7457
3:49:21, 4509
3:49:21, 3510
3:49:21, 515
3:49:21, 1498
3:49:21, 2496
3:49:22, 2465
3:49:22, 2636
3:49:22, 6506
3:49:22, 7504
3:49:22, 9501
3:49:22, 4509
3:49:22, 1638
3:49:22, 6646
3:49:22, 9641
3:49:22, 655
3:49:22, 3667
3:49:22, 31
3:49:22, 78
3:49:22, 47
3:49:23, 47
3:49:23, 47
3:49:24, 93

I've ruled out any obvious periodical process running on either node in the cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty consistent and flat even during one of these blips. These are beefy machines like I said, so there's plenty of CPU and RAM available.

Any advice on how I can figure out what ES is waiting for would be helpful. Is there any process ES runs every 10 minutes by default?

Thanks!
-Dave

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2863E632-E94D-4BFA-851B-2D487EC23276%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Dave Reed) #3

The logs are basically empty. There's activity from when I am creating the
indexes, but that's about it. Is there a logging level that could be
increased? I will run hot_thread as soon as I can in the morning and post
results when I can.

I have each index set to 1 replica. If it matters, I first import them with
0 replicas then set it to 1 when they are done. Shard count I will have to
check and get back to you, but it's whatever the default would be, we
haven't tweaked that.

I have 200+ indexes because I have 200+ different "projects" represented,
and each one has it's own set of mappings. Mappings which could collide on
name. I originally tried having a single index with each project
represented by a Type instead, but a conversation I had on this forum about
that led me away from it due to the fact two different projects may
sometimes have the same field names but with different types. Most searches
(99.9%) are done on a per-project basis, and changes to projects can create
a need to reindex the project, so having the segregation is nice, lest I
have to reimport the entire thing.

In case this is related, I also found that a document was changed and
reindexed, but searches would sometimes include it and sometimes not. I
could literally just refresh the search over and over again, and it would
appear in the results roughly 50% of the time. 0 results, then 0, 0, then 1
result, 1, 1, then 0 again, etc.. I was running the search against one of
the nodes via _head. The cluster health was green during all of this. That
was surprising and something I was going to investigate more on my own, but
perhaps these two problems are related.

I'm really hitting the limit of what I know how to troubleshoot with ES,
hence I am really hoping for help here :slight_smile:

On Monday, April 20, 2015 at 10:23:46 PM UTC-7, David Pilato wrote:

Could you run a hot_threads API call when this happens?
Anything in logs about GC?

BTW 200 indices is a lot for 2 nodes. And how many shards/replicas do you
have?
Why do you need so many indices for 2m docs?

David

Le 21 avr. 2015 à 01:16, Dave Reed <infin...@gmail.com <javascript:>> a
écrit :

I have a 2-node cluster running on some beefy machines. 12g and 16g of
heap space. About 2.1 million documents, each relatively small in size,
spread across 200 or so indexes. The refresh interval is 0.5s (while I
don't need realtime I do need relatively quick refreshes). Documents are
continuously modified by the app, so reindex requests trickle in
constantly. By trickle I mean maybe a dozen a minute. All index requests
are made with _bulk, although a lot of the time there's only 1 in the list.

Searches are very fast -- normally taking 50ms or less.

But oddly, exactly every 10 minutes, searches slow down for a moment. The
exact same query that normally takes <50ms takes 9000ms, for example. Any
other queries regardless of what they are also take multiple seconds to
complete. Once this moment passes, search queries return to normal.

I have a tester I wrote that continuously posts the same query and logs
the results, which is how I discovered this pattern.

Here's an excerpt. Notice that query time is great at 3:49:10, then at :11
things stop for 10 seconds. At :21 the queued up searches finally come
through. The numbers reported are the "took" field from the ES search
response. Then things resume as normal. This is true no matter which node I
run the search against.

This pattern repeats like this every 10 minutes, to the second, for days
now.

3:49:09, 47
3:49:09, 63
3:49:10, 31
3:49:10, 47
3:49:11, 47
3:49:11, 62
3:49:21, 8456
3:49:21, 5460
3:49:21, 7457
3:49:21, 4509
3:49:21, 3510
3:49:21, 515
3:49:21, 1498
3:49:21, 2496
3:49:22, 2465
3:49:22, 2636
3:49:22, 6506
3:49:22, 7504
3:49:22, 9501
3:49:22, 4509
3:49:22, 1638
3:49:22, 6646
3:49:22, 9641
3:49:22, 655
3:49:22, 3667
3:49:22, 31
3:49:22, 78
3:49:22, 47
3:49:23, 47
3:49:23, 47
3:49:24, 93

I've ruled out any obvious periodical process running on either node in
the cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty
consistent and flat even during one of these blips. These are beefy
machines like I said, so there's plenty of CPU and RAM available.

Any advice on how I can figure out what ES is waiting for would be
helpful. Is there any process ES runs every 10 minutes by default?

Thanks!
-Dave

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ee7e43a0-a4c4-4ff9-9898-2026c586ab77%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #4

Some notes. You are using defaults. So you have 5 shards per index. 1000 primary shards.
With replicas, it means 1000 shards per Node. Which means 1000 Lucene instances.

First thing to do is IMO to use only one shard per index unless you need more for "big" indices.

Then, have 3 nodes and set minimum_master_nodes to 2. You probably ran into a split brain issue which could explain the difference you are seeing.
I would probably set replica to 0 and then to 1 again. But if you follow my first advice, you have to reindex, so reindex with 1 shard, 0 replica, then set replica to 1.

My 2 cents

David

Le 21 avr. 2015 à 08:31, Dave Reed infinity88@gmail.com a écrit :

The logs are basically empty. There's activity from when I am creating the indexes, but that's about it. Is there a logging level that could be increased? I will run hot_thread as soon as I can in the morning and post results when I can.

I have each index set to 1 replica. If it matters, I first import them with 0 replicas then set it to 1 when they are done. Shard count I will have to check and get back to you, but it's whatever the default would be, we haven't tweaked that.

I have 200+ indexes because I have 200+ different "projects" represented, and each one has it's own set of mappings. Mappings which could collide on name. I originally tried having a single index with each project represented by a Type instead, but a conversation I had on this forum about that led me away from it due to the fact two different projects may sometimes have the same field names but with different types. Most searches (99.9%) are done on a per-project basis, and changes to projects can create a need to reindex the project, so having the segregation is nice, lest I have to reimport the entire thing.

In case this is related, I also found that a document was changed and reindexed, but searches would sometimes include it and sometimes not. I could literally just refresh the search over and over again, and it would appear in the results roughly 50% of the time. 0 results, then 0, 0, then 1 result, 1, 1, then 0 again, etc.. I was running the search against one of the nodes via _head. The cluster health was green during all of this. That was surprising and something I was going to investigate more on my own, but perhaps these two problems are related.

I'm really hitting the limit of what I know how to troubleshoot with ES, hence I am really hoping for help here :slight_smile:

On Monday, April 20, 2015 at 10:23:46 PM UTC-7, David Pilato wrote:
Could you run a hot_threads API call when this happens?
Anything in logs about GC?

BTW 200 indices is a lot for 2 nodes. And how many shards/replicas do you have?
Why do you need so many indices for 2m docs?

David

Le 21 avr. 2015 à 01:16, Dave Reed infin...@gmail.com a écrit :

I have a 2-node cluster running on some beefy machines. 12g and 16g of heap space. About 2.1 million documents, each relatively small in size, spread across 200 or so indexes. The refresh interval is 0.5s (while I don't need realtime I do need relatively quick refreshes). Documents are continuously modified by the app, so reindex requests trickle in constantly. By trickle I mean maybe a dozen a minute. All index requests are made with _bulk, although a lot of the time there's only 1 in the list.

Searches are very fast -- normally taking 50ms or less.

But oddly, exactly every 10 minutes, searches slow down for a moment. The exact same query that normally takes <50ms takes 9000ms, for example. Any other queries regardless of what they are also take multiple seconds to complete. Once this moment passes, search queries return to normal.

I have a tester I wrote that continuously posts the same query and logs the results, which is how I discovered this pattern.

Here's an excerpt. Notice that query time is great at 3:49:10, then at :11 things stop for 10 seconds. At :21 the queued up searches finally come through. The numbers reported are the "took" field from the ES search response. Then things resume as normal. This is true no matter which node I run the search against.

This pattern repeats like this every 10 minutes, to the second, for days now.

3:49:09, 47
3:49:09, 63
3:49:10, 31
3:49:10, 47
3:49:11, 47
3:49:11, 62
3:49:21, 8456
3:49:21, 5460
3:49:21, 7457
3:49:21, 4509
3:49:21, 3510
3:49:21, 515
3:49:21, 1498
3:49:21, 2496
3:49:22, 2465
3:49:22, 2636
3:49:22, 6506
3:49:22, 7504
3:49:22, 9501
3:49:22, 4509
3:49:22, 1638
3:49:22, 6646
3:49:22, 9641
3:49:22, 655
3:49:22, 3667
3:49:22, 31
3:49:22, 78
3:49:22, 47
3:49:23, 47
3:49:23, 47
3:49:24, 93

I've ruled out any obvious periodical process running on either node in the cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty consistent and flat even during one of these blips. These are beefy machines like I said, so there's plenty of CPU and RAM available.

Any advice on how I can figure out what ES is waiting for would be helpful. Is there any process ES runs every 10 minutes by default?

Thanks!
-Dave

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ee7e43a0-a4c4-4ff9-9898-2026c586ab77%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5E7DCD8E-A219-4109-83D9-2F644DD679B5%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Dave Reed) #5

Ok,

I have deleted everything and restarted the ES service, on both nodes, and
rebuilt the indexes. I have changed my shard count to 1 -- thank you for
explaining that, I didn't realize the overhead was as large as it was. I do
have some indexes that are much larger than others. The largest one has
around 400k issues. Search performance seems fine with the 1 shard though,
so I'll leave it at that. I can introduce a per-project shard count setting
if needed later on.

Results:

My search count issue appears to be resolved, it was probably related to my
large shard count.

However, the slow queries once every 10 minutes remains. It's even still on
the same schedule -- always at wall clock time with 9 minutes, e.g.
12:29pm. Here is a hot_threads call I ran while the slow searches were
running (I replaced machine names and IPs with tokens for obscurity):

::: [][TbKES0BpQ---dAP1J9kKKQ][][inet[/:9300]]
Hot threads at 2015-04-21T19:29:27.841Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: [][aPof4TmzQlm7Ck7frfGeuA][][inet[/:9300]]
Hot threads at 2015-04-21T19:29:26.250Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (0s out of 500ms) cpu usage by thread 'Attach Listener'
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot

Here is the slow search log for the same timeframe (first # is the time to
get the response, second # is the .took field from the ES response json)

4/21/2015, 12:29:27 PM 8586 8361
4/21/2015, 12:29:27 PM 9681 9454
4/21/2015, 12:29:27 PM 7685 7457

There's nothing in ES logs after the indexes were initially created, and
it's just info type things like this:
[2015-04-21 10:48:05,029][INFO ][cluster.metadata ] []
updating number_of_replicas to [1] for indices []

It's so interesting that it's every 10 minutes and always at x9 minutes,
regardless of the fact that I've restarted ES. There still could be an
external reason for this, but like I said, there's no CPU usage or Disk I/O
going on during these periods, so it's very strange. I have no leads other
than these slow search times.

Thanks so much for your help. Any other troubleshooting ideas would be
greatly appreciated.
-Dave

On Tuesday, April 21, 2015 at 12:12:38 AM UTC-7, David Pilato wrote:

Some notes. You are using defaults. So you have 5 shards per index. 1000
primary shards.
With replicas, it means 1000 shards per Node. Which means 1000 Lucene
instances.

First thing to do is IMO to use only one shard per index unless you need
more for "big" indices.

Then, have 3 nodes and set minimum_master_nodes to 2. You probably ran
into a split brain issue which could explain the difference you are seeing.
I would probably set replica to 0 and then to 1 again. But if you follow
my first advice, you have to reindex, so reindex with 1 shard, 0 replica,
then set replica to 1.

My 2 cents

David

Le 21 avr. 2015 à 08:31, Dave Reed <infin...@gmail.com <javascript:>> a
écrit :

The logs are basically empty. There's activity from when I am creating the
indexes, but that's about it. Is there a logging level that could be
increased? I will run hot_thread as soon as I can in the morning and post
results when I can.

I have each index set to 1 replica. If it matters, I first import them
with 0 replicas then set it to 1 when they are done. Shard count I will
have to check and get back to you, but it's whatever the default would be,
we haven't tweaked that.

I have 200+ indexes because I have 200+ different "projects" represented,
and each one has it's own set of mappings. Mappings which could collide on
name. I originally tried having a single index with each project
represented by a Type instead, but a conversation I had on this forum about
that led me away from it due to the fact two different projects may
sometimes have the same field names but with different types. Most searches
(99.9%) are done on a per-project basis, and changes to projects can create
a need to reindex the project, so having the segregation is nice, lest I
have to reimport the entire thing.

In case this is related, I also found that a document was changed and
reindexed, but searches would sometimes include it and sometimes not. I
could literally just refresh the search over and over again, and it would
appear in the results roughly 50% of the time. 0 results, then 0, 0, then 1
result, 1, 1, then 0 again, etc.. I was running the search against one of
the nodes via _head. The cluster health was green during all of this. That
was surprising and something I was going to investigate more on my own, but
perhaps these two problems are related.

I'm really hitting the limit of what I know how to troubleshoot with ES,
hence I am really hoping for help here :slight_smile:

On Monday, April 20, 2015 at 10:23:46 PM UTC-7, David Pilato wrote:

Could you run a hot_threads API call when this happens?
Anything in logs about GC?

BTW 200 indices is a lot for 2 nodes. And how many shards/replicas do you
have?
Why do you need so many indices for 2m docs?

David

Le 21 avr. 2015 à 01:16, Dave Reed infin...@gmail.com a écrit :

I have a 2-node cluster running on some beefy machines. 12g and 16g of
heap space. About 2.1 million documents, each relatively small in size,
spread across 200 or so indexes. The refresh interval is 0.5s (while I
don't need realtime I do need relatively quick refreshes). Documents are
continuously modified by the app, so reindex requests trickle in
constantly. By trickle I mean maybe a dozen a minute. All index requests
are made with _bulk, although a lot of the time there's only 1 in the list.

Searches are very fast -- normally taking 50ms or less.

But oddly, exactly every 10 minutes, searches slow down for a moment. The
exact same query that normally takes <50ms takes 9000ms, for example. Any
other queries regardless of what they are also take multiple seconds to
complete. Once this moment passes, search queries return to normal.

I have a tester I wrote that continuously posts the same query and logs
the results, which is how I discovered this pattern.

Here's an excerpt. Notice that query time is great at 3:49:10, then at
:11 things stop for 10 seconds. At :21 the queued up searches finally come
through. The numbers reported are the "took" field from the ES search
response. Then things resume as normal. This is true no matter which node I
run the search against.

This pattern repeats like this every 10 minutes, to the second, for days
now.

3:49:09, 47
3:49:09, 63
3:49:10, 31
3:49:10, 47
3:49:11, 47
3:49:11, 62
3:49:21, 8456
3:49:21, 5460
3:49:21, 7457
3:49:21, 4509
3:49:21, 3510
3:49:21, 515
3:49:21, 1498
3:49:21, 2496
3:49:22, 2465
3:49:22, 2636
3:49:22, 6506
3:49:22, 7504
3:49:22, 9501
3:49:22, 4509
3:49:22, 1638
3:49:22, 6646
3:49:22, 9641
3:49:22, 655
3:49:22, 3667
3:49:22, 31
3:49:22, 78
3:49:22, 47
3:49:23, 47
3:49:23, 47
3:49:24, 93

I've ruled out any obvious periodical process running on either node in
the cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty
consistent and flat even during one of these blips. These are beefy
machines like I said, so there's plenty of CPU and RAM available.

Any advice on how I can figure out what ES is waiting for would be
helpful. Is there any process ES runs every 10 minutes by default?

Thanks!
-Dave

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ee7e43a0-a4c4-4ff9-9898-2026c586ab77%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ee7e43a0-a4c4-4ff9-9898-2026c586ab77%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1be341f1-684c-4122-be67-ade44cdef1bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Alex Roytman) #6

it could be entirely unrelated but if I recall someone reported similar regular interval slowness. it proved to be the load balancer they used if I remember correctly.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9073c7ba-7fdd-4ed5-80dd-c0499e618f9a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Dave Reed) #7

Thanks for the info, but there's no load balancer involved here. No VMs
either.. nothing fancy.

On Tuesday, April 21, 2015 at 1:55:18 PM UTC-7, AlexR wrote:

it could be entirely unrelated but if I recall someone reported similar
regular interval slowness. it proved to be the load balancer they used if I
remember correctly.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fade4dea-91c9-4459-82ef-56f780dcb704%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Dave Reed) #8

Ok, I've identified the problem, and it has nothing to do with ES :slight_smile: It's
something funky with the network card or its driver. I isolated the problem
by testing each of my two nodes independently, identified it was only a
problem on one of the two. Then spun up a simple node http listener and was
able to reproduce the problem with that, thus eliminating ES as even being
involved.

Still, this thread was very helpful, as I had a few other issues that
reducing my shard count seems to have resolved.

Thank you so much for your help. ES rocks.

On Tuesday, April 21, 2015 at 2:20:22 PM UTC-7, Dave Reed wrote:

Thanks for the info, but there's no load balancer involved here. No VMs
either.. nothing fancy.

On Tuesday, April 21, 2015 at 1:55:18 PM UTC-7, AlexR wrote:

it could be entirely unrelated but if I recall someone reported similar
regular interval slowness. it proved to be the load balancer they used if I
remember correctly.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/047ae734-f07d-4e4d-b0f1-f1417ddee044%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#9

Are you sure this was a network interface issue?

I'm seeing the same problem in my system. ES is hosted on cloud.elastic.co, a node client is running on AWS (in the same region). Every 10 minutes (starting at xx:x9) the queries slow down (for about 5 minutes) significantly. 500ms average to about 15,000ms average.

If it matters, these are mostly aggregate queries.


(system) #10