Certain rest requests time out

Hi,

We are running a two node cluster where one of the nodes stopped responding
consistently to REST requests.

The following is going on:

  1. On the problematic node: curl http://localhost:9200/_cluster/health return
    a green state but one request in 5 times out
  2. On the problematic node: curl http://localhost:9200/_search times outs
    all the time
  3. On the problematic node: curl http://localhost:9200/ works all the
    time
    .
  4. On the good node all of the above works but also search (!) . It is
    important to note the both nodes are needed for searching as both contain
    part of the data.

I think this means that this is not a networking issue (running out of
sockets etc.) because of #3. It's also not something structural with ES
(like garbage collection etc.) because searches through the healthy node
work just fine.

Does anyone recognise the pattern/have ideas as to how to proceed with
debugging?

Thanks,
Boaz

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Forgot to mention our version number which is 0.20.6.

Cheers,
Boaz

On Mon, Apr 29, 2013 at 5:11 PM, Boaz Leskes b.leskes@gmail.com wrote:

Hi,
We are running a two node cluster where one of the nodes stopped responding
consistently to REST requests.
The following is going on:

  1. On the problematic node: curl http://localhost:9200/_cluster/health return
    a green state but one request in 5 times out
  2. On the problematic node: curl http://localhost:9200/_search times outs
    all the time
  3. On the problematic node: curl http://localhost:9200/ works all the
    time
    .
  4. On the good node all of the above works but also search (!) . It is
    important to note the both nodes are needed for searching as both contain
    part of the data.
    I think this means that this is not a networking issue (running out of
    sockets etc.) because of #3. It's also not something structural with ES
    (like garbage collection etc.) because searches through the healthy node
    work just fine.
    Does anyone recognise the pattern/have ideas as to how to proceed with
    debugging?
    Thanks,
    Boaz
    --
    You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
    To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US.
    To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Did you restart the faulty node? Does it change after this?

Jörg

Am 29.04.13 17:29, schrieb Boaz Leskes:

Forgot to mention our version number which is 0.20.6.

Cheers,
Boaz

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes. Afterwards everything is OK for about 4 hours and then it happens again..

On Mon, Apr 29, 2013 at 7:11 PM, Jörg Prante joergprante@gmail.com
wrote:

Did you restart the faulty node? Does it change after this?
Jörg
Am 29.04.13 17:29, schrieb Boaz Leskes:

Forgot to mention our version number which is 0.20.6.

Cheers,
Boaz

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Are you sure you have enabled GC monitoring and there is nothing in the log?

Looks like your node has resource shortage issues. I recommend firing up
BigDesk for an impression about the resource usage, or jvisualvm for
direct JVM examination to monitor the resource usage and see what is
going on.

Jörg

Am 29.04.13 19:42, schrieb Boaz Leskes:

Yes. Afterwards everything is OK for about 4 hours and then it happens
again..

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I had it under visualvm and GC looks good. Also intra node searches coming from the other node work fine. What other resources should I look at?

On Mon, Apr 29, 2013 at 7:48 PM, Jörg Prante joergprante@gmail.com
wrote:

Are you sure you have enabled GC monitoring and there is nothing in the log?
Looks like your node has resource shortage issues. I recommend firing up
BigDesk for an impression about the resource usage, or jvisualvm for
direct JVM examination to monitor the resource usage and see what is
going on.
Jörg
Am 29.04.13 19:42, schrieb Boaz Leskes:

Yes. Afterwards everything is OK for about 4 hours and then it happens
again..

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No good ideas. Just shooting in the dark. Maybe an issue with certain
queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node searches
coming from the other node work fine. What other resources should I
look at?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We are seeing similar behavior, the following curl command will timeout
when executed twice in a row, and pretty much sporadically otherwise:

$ curl -XPOST http://localhost:9200/some-index/mydocs/_search -d '
{
"size":100,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"and": [{
"term": {
"user": "itamar"
}
}, {
"missing": {
"field": "last_edited",
"null_value": true
}
}]
}
}
}
}'

The index has 2 replicas, and we observe this happening also for non-REST
clients (native Java clients), so this is probably something related to
threading on the server?

Boaz, were you able to get to any resolution on this?

On Mon, Apr 29, 2013 at 9:19 PM, Jörg Prante joergprante@gmail.com wrote:

No good ideas. Just shooting in the dark. Maybe an issue with certain
queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node searches

coming from the other node work fine. What other resources should I look at?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Itamar,

We ended up giving the nodes some more memory and it stopped happening. I
was blindly guessing memory might help because it started happening after
we rolled out a memory intensive feature. Can't say I'm happy with that
kind of "solution".

You mention the query is a reliable way to reproduce the issue. Can you
also supply a sample data? Do you need to run in an multi-node scenario or
is one node also enough?

Boaz

On Tuesday, May 21, 2013 12:42:00 PM UTC+2, Itamar Syn-Hershko wrote:

We are seeing similar behavior, the following curl command will timeout
when executed twice in a row, and pretty much sporadically otherwise:

$ curl -XPOST http://localhost:9200/some-index/mydocs/_search -d '
{
"size":100,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"and": [{
"term": {
"user": "itamar"
}
}, {
"missing": {
"field": "last_edited",
"null_value": true
}
}]
}
}
}
}'

The index has 2 replicas, and we observe this happening also for non-REST
clients (native Java clients), so this is probably something related to
threading on the server?

Boaz, were you able to get to any resolution on this?

On Mon, Apr 29, 2013 at 9:19 PM, Jörg Prante <joerg...@gmail.com<javascript:>

wrote:

No good ideas. Just shooting in the dark. Maybe an issue with certain
queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node searches

coming from the other node work fine. What other resources should I look at?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

What do you mean "giving"? are you running on a VM, or was it a setting you
changed?

It's any query, as a matter of a fact. We have multi-node setup and that
specific index has 2 replicas so it should be spread between them (though I
admit believed the impl and haven't checked that).

I never seen this happen in our tests, only on our live cluster where the
setup is multi-node. I believe any such setup with any data with multiple
clients accessing it concurrently will evantually get to that faulty state.
I'll be happy to work with you and the ES team on reproducing and killing
this issue. I believe it should be escalated and resolved quickly - it's
definitely a show stopper.

Another big issue I'm seeing at this point is ES not respecting timeouts -
I recall Shay saying its on a best-effort basis. If ES would reliably
timeout lengthy requests this starvation scenario wouldn't happen.

On Tue, May 21, 2013 at 1:56 PM, Boaz Leskes b.leskes@gmail.com wrote:

Hi Itamar,

We ended up giving the nodes some more memory and it stopped happening. I
was blindly guessing memory might help because it started happening after
we rolled out a memory intensive feature. Can't say I'm happy with that
kind of "solution".

You mention the query is a reliable way to reproduce the issue. Can you
also supply a sample data? Do you need to run in an multi-node scenario or
is one node also enough?

Boaz

On Tuesday, May 21, 2013 12:42:00 PM UTC+2, Itamar Syn-Hershko wrote:

We are seeing similar behavior, the following curl command will timeout
when executed twice in a row, and pretty much sporadically otherwise:

$ curl -XPOST http://localhost:9200/some-**index/mydocs/_searchhttp://localhost:9200/some-index/mydocs/_search-d '
{
"size":100,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"and": [{
"term": {
"user": "itamar"
}
}, {
"missing": {
"field": "last_edited",
"null_value": true
}
}]
}
}
}
}'

The index has 2 replicas, and we observe this happening also for non-REST
clients (native Java clients), so this is probably something related to
threading on the server?

Boaz, were you able to get to any resolution on this?

On Mon, Apr 29, 2013 at 9:19 PM, Jörg Prante joerg...@gmail.com wrote:

No good ideas. Just shooting in the dark. Maybe an issue with certain
queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node searches

coming from the other node work fine. What other resources should I look at?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We gave the VM more memory (thus leaving less to the file caches).

It seems you guys are experiencing something slightly different than what
we had - the number connected clients in our case stayed the same. It might
still be a resources problem but of a very specific nature as we couldn't
reproduce it in tests either.

I'd love to try and help but I need some more info. It would be awesome if
you can give a recipe that consistently (/with high percentage) gets it to
block.

On Tue, May 21, 2013 at 1:12 PM, Itamar Syn-Hershko itamar@code972.comwrote:

What do you mean "giving"? are you running on a VM, or was it a setting
you changed?

It's any query, as a matter of a fact. We have multi-node setup and that
specific index has 2 replicas so it should be spread between them (though I
admit believed the impl and haven't checked that).

I never seen this happen in our tests, only on our live cluster where the
setup is multi-node. I believe any such setup with any data with multiple
clients accessing it concurrently will evantually get to that faulty state.
I'll be happy to work with you and the ES team on reproducing and killing
this issue. I believe it should be escalated and resolved quickly - it's
definitely a show stopper.

Another big issue I'm seeing at this point is ES not respecting timeouts -
I recall Shay saying its on a best-effort basis. If ES would reliably
timeout lengthy requests this starvation scenario wouldn't happen.

On Tue, May 21, 2013 at 1:56 PM, Boaz Leskes b.leskes@gmail.com wrote:

Hi Itamar,

We ended up giving the nodes some more memory and it stopped happening. I
was blindly guessing memory might help because it started happening after
we rolled out a memory intensive feature. Can't say I'm happy with that
kind of "solution".

You mention the query is a reliable way to reproduce the issue. Can you
also supply a sample data? Do you need to run in an multi-node scenario or
is one node also enough?

Boaz

On Tuesday, May 21, 2013 12:42:00 PM UTC+2, Itamar Syn-Hershko wrote:

We are seeing similar behavior, the following curl command will timeout
when executed twice in a row, and pretty much sporadically otherwise:

$ curl -XPOST http://localhost:9200/some-**index/mydocs/_searchhttp://localhost:9200/some-index/mydocs/_search-d '
{
"size":100,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"and": [{
"term": {
"user": "itamar"
}
}, {
"missing": {
"field": "last_edited",
"null_value": true
}
}]
}
}
}
}'

The index has 2 replicas, and we observe this happening also for
non-REST clients (native Java clients), so this is probably something
related to threading on the server?

Boaz, were you able to get to any resolution on this?

On Mon, Apr 29, 2013 at 9:19 PM, Jörg Prante joerg...@gmail.com wrote:

No good ideas. Just shooting in the dark. Maybe an issue with certain
queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node searches

coming from the other node work fine. What other resources should I look at?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The number of connected clients is the same - but its larger than one and
they are actively working against the cluster, mostly issuing searches.
Concurrent requests to the cluster seem to cause thread starvation. And we
weren't able to reproduce this in tests yet.

I'll try to work on something reproducible. In the meantime - were you able
to find anything in the logs?

On Tue, May 21, 2013 at 2:21 PM, Boaz Leskes b.leskes@gmail.com wrote:

We gave the VM more memory (thus leaving less to the file caches).

It seems you guys are experiencing something slightly different than what
we had - the number connected clients in our case stayed the same. It might
still be a resources problem but of a very specific nature as we couldn't
reproduce it in tests either.

I'd love to try and help but I need some more info. It would be awesome if
you can give a recipe that consistently (/with high percentage) gets it to
block.

On Tue, May 21, 2013 at 1:12 PM, Itamar Syn-Hershko itamar@code972.comwrote:

What do you mean "giving"? are you running on a VM, or was it a setting
you changed?

It's any query, as a matter of a fact. We have multi-node setup and that
specific index has 2 replicas so it should be spread between them (though I
admit believed the impl and haven't checked that).

I never seen this happen in our tests, only on our live cluster where the
setup is multi-node. I believe any such setup with any data with multiple
clients accessing it concurrently will evantually get to that faulty state.
I'll be happy to work with you and the ES team on reproducing and killing
this issue. I believe it should be escalated and resolved quickly - it's
definitely a show stopper.

Another big issue I'm seeing at this point is ES not respecting timeouts

  • I recall Shay saying its on a best-effort basis. If ES would reliably
    timeout lengthy requests this starvation scenario wouldn't happen.

On Tue, May 21, 2013 at 1:56 PM, Boaz Leskes b.leskes@gmail.com wrote:

Hi Itamar,

We ended up giving the nodes some more memory and it stopped happening.
I was blindly guessing memory might help because it started happening after
we rolled out a memory intensive feature. Can't say I'm happy with that
kind of "solution".

You mention the query is a reliable way to reproduce the issue. Can you
also supply a sample data? Do you need to run in an multi-node scenario or
is one node also enough?

Boaz

On Tuesday, May 21, 2013 12:42:00 PM UTC+2, Itamar Syn-Hershko wrote:

We are seeing similar behavior, the following curl command will timeout
when executed twice in a row, and pretty much sporadically otherwise:

$ curl -XPOST http://localhost:9200/some-**index/mydocs/_searchhttp://localhost:9200/some-index/mydocs/_search-d '
{
"size":100,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"and": [{
"term": {
"user": "itamar"
}
}, {
"missing": {
"field": "last_edited",
"null_value": true
}
}]
}
}
}
}'

The index has 2 replicas, and we observe this happening also for
non-REST clients (native Java clients), so this is probably something
related to threading on the server?

Boaz, were you able to get to any resolution on this?

On Mon, Apr 29, 2013 at 9:19 PM, Jörg Prante joerg...@gmail.comwrote:

No good ideas. Just shooting in the dark. Maybe an issue with certain
queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node searches

coming from the other node work fine. What other resources should I look at?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No, sadly nothing in the logs.

Hopefully you can get it more concrete. I'll try to find some time to look
at the code again later. What version are you running?

On Tue, May 21, 2013 at 1:28 PM, Itamar Syn-Hershko itamar@code972.comwrote:

The number of connected clients is the same - but its larger than one and
they are actively working against the cluster, mostly issuing searches.
Concurrent requests to the cluster seem to cause thread starvation. And we
weren't able to reproduce this in tests yet.

I'll try to work on something reproducible. In the meantime - were you
able to find anything in the logs?

On Tue, May 21, 2013 at 2:21 PM, Boaz Leskes b.leskes@gmail.com wrote:

We gave the VM more memory (thus leaving less to the file caches).

It seems you guys are experiencing something slightly different than what
we had - the number connected clients in our case stayed the same. It might
still be a resources problem but of a very specific nature as we couldn't
reproduce it in tests either.

I'd love to try and help but I need some more info. It would be awesome
if you can give a recipe that consistently (/with high percentage) gets it
to block.

On Tue, May 21, 2013 at 1:12 PM, Itamar Syn-Hershko itamar@code972.comwrote:

What do you mean "giving"? are you running on a VM, or was it a setting
you changed?

It's any query, as a matter of a fact. We have multi-node setup and that
specific index has 2 replicas so it should be spread between them (though I
admit believed the impl and haven't checked that).

I never seen this happen in our tests, only on our live cluster where
the setup is multi-node. I believe any such setup with any data with
multiple clients accessing it concurrently will evantually get to that
faulty state. I'll be happy to work with you and the ES team on reproducing
and killing this issue. I believe it should be escalated and resolved
quickly - it's definitely a show stopper.

Another big issue I'm seeing at this point is ES not respecting timeouts

  • I recall Shay saying its on a best-effort basis. If ES would reliably
    timeout lengthy requests this starvation scenario wouldn't happen.

On Tue, May 21, 2013 at 1:56 PM, Boaz Leskes b.leskes@gmail.com wrote:

Hi Itamar,

We ended up giving the nodes some more memory and it stopped happening.
I was blindly guessing memory might help because it started happening after
we rolled out a memory intensive feature. Can't say I'm happy with that
kind of "solution".

You mention the query is a reliable way to reproduce the issue. Can you
also supply a sample data? Do you need to run in an multi-node scenario or
is one node also enough?

Boaz

On Tuesday, May 21, 2013 12:42:00 PM UTC+2, Itamar Syn-Hershko wrote:

We are seeing similar behavior, the following curl command will
timeout when executed twice in a row, and pretty much sporadically
otherwise:

$ curl -XPOST http://localhost:9200/some-**index/mydocs/_searchhttp://localhost:9200/some-index/mydocs/_search-d '
{
"size":100,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"and": [{
"term": {
"user": "itamar"
}
}, {
"missing": {
"field": "last_edited",
"null_value": true
}
}]
}
}
}
}'

The index has 2 replicas, and we observe this happening also for
non-REST clients (native Java clients), so this is probably something
related to threading on the server?

Boaz, were you able to get to any resolution on this?

On Mon, Apr 29, 2013 at 9:19 PM, Jörg Prante joerg...@gmail.comwrote:

No good ideas. Just shooting in the dark. Maybe an issue with certain
queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node searches

coming from the other node work fine. What other resources should I look at?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

0.90 with some additional recent commits (we are compiling a fork, no
relevant changes to the core)

On Tue, May 21, 2013 at 2:33 PM, Boaz Leskes b.leskes@gmail.com wrote:

No, sadly nothing in the logs.

Hopefully you can get it more concrete. I'll try to find some time to look
at the code again later. What version are you running?

On Tue, May 21, 2013 at 1:28 PM, Itamar Syn-Hershko itamar@code972.comwrote:

The number of connected clients is the same - but its larger than one and
they are actively working against the cluster, mostly issuing searches.
Concurrent requests to the cluster seem to cause thread starvation. And we
weren't able to reproduce this in tests yet.

I'll try to work on something reproducible. In the meantime - were you
able to find anything in the logs?

On Tue, May 21, 2013 at 2:21 PM, Boaz Leskes b.leskes@gmail.com wrote:

We gave the VM more memory (thus leaving less to the file caches).

It seems you guys are experiencing something slightly different than
what we had - the number connected clients in our case stayed the same. It
might still be a resources problem but of a very specific nature as we
couldn't reproduce it in tests either.

I'd love to try and help but I need some more info. It would be awesome
if you can give a recipe that consistently (/with high percentage) gets it
to block.

On Tue, May 21, 2013 at 1:12 PM, Itamar Syn-Hershko itamar@code972.comwrote:

What do you mean "giving"? are you running on a VM, or was it a setting
you changed?

It's any query, as a matter of a fact. We have multi-node setup and
that specific index has 2 replicas so it should be spread between them
(though I admit believed the impl and haven't checked that).

I never seen this happen in our tests, only on our live cluster where
the setup is multi-node. I believe any such setup with any data with
multiple clients accessing it concurrently will evantually get to that
faulty state. I'll be happy to work with you and the ES team on reproducing
and killing this issue. I believe it should be escalated and resolved
quickly - it's definitely a show stopper.

Another big issue I'm seeing at this point is ES not respecting
timeouts - I recall Shay saying its on a best-effort basis. If ES would
reliably timeout lengthy requests this starvation scenario wouldn't happen.

On Tue, May 21, 2013 at 1:56 PM, Boaz Leskes b.leskes@gmail.comwrote:

Hi Itamar,

We ended up giving the nodes some more memory and it stopped
happening. I was blindly guessing memory might help because it started
happening after we rolled out a memory intensive feature. Can't say I'm
happy with that kind of "solution".

You mention the query is a reliable way to reproduce the issue. Can
you also supply a sample data? Do you need to run in an multi-node scenario
or is one node also enough?

Boaz

On Tuesday, May 21, 2013 12:42:00 PM UTC+2, Itamar Syn-Hershko wrote:

We are seeing similar behavior, the following curl command will
timeout when executed twice in a row, and pretty much sporadically
otherwise:

$ curl -XPOST http://localhost:9200/some-**index/mydocs/_searchhttp://localhost:9200/some-index/mydocs/_search-d '
{
"size":100,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"and": [{
"term": {
"user": "itamar"
}
}, {
"missing": {
"field": "last_edited",
"null_value": true
}
}]
}
}
}
}'

The index has 2 replicas, and we observe this happening also for
non-REST clients (native Java clients), so this is probably something
related to threading on the server?

Boaz, were you able to get to any resolution on this?

On Mon, Apr 29, 2013 at 9:19 PM, Jörg Prante joerg...@gmail.comwrote:

No good ideas. Just shooting in the dark. Maybe an issue with
certain queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node searches

coming from the other node work fine. What other resources should I look at?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just to make sure - what does the http stats return on the problematic
node? http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-stats/

On Tuesday, May 21, 2013 1:35:54 PM UTC+2, Itamar Syn-Hershko wrote:

0.90 with some additional recent commits (we are compiling a fork, no
relevant changes to the core)

On Tue, May 21, 2013 at 2:33 PM, Boaz Leskes <b.le...@gmail.com<javascript:>

wrote:

No, sadly nothing in the logs.

Hopefully you can get it more concrete. I'll try to find some time to
look at the code again later. What version are you running?

On Tue, May 21, 2013 at 1:28 PM, Itamar Syn-Hershko <ita...@code972.com<javascript:>

wrote:

The number of connected clients is the same - but its larger than one
and they are actively working against the cluster, mostly issuing searches.
Concurrent requests to the cluster seem to cause thread starvation. And we
weren't able to reproduce this in tests yet.

I'll try to work on something reproducible. In the meantime - were you
able to find anything in the logs?

On Tue, May 21, 2013 at 2:21 PM, Boaz Leskes <b.le...@gmail.com<javascript:>

wrote:

We gave the VM more memory (thus leaving less to the file caches).

It seems you guys are experiencing something slightly different than
what we had - the number connected clients in our case stayed the same. It
might still be a resources problem but of a very specific nature as we
couldn't reproduce it in tests either.

I'd love to try and help but I need some more info. It would be awesome
if you can give a recipe that consistently (/with high percentage) gets it
to block.

On Tue, May 21, 2013 at 1:12 PM, Itamar Syn-Hershko <ita...@code972.com<javascript:>

wrote:

What do you mean "giving"? are you running on a VM, or was it a
setting you changed?

It's any query, as a matter of a fact. We have multi-node setup and
that specific index has 2 replicas so it should be spread between them
(though I admit believed the impl and haven't checked that).

I never seen this happen in our tests, only on our live cluster where
the setup is multi-node. I believe any such setup with any data with
multiple clients accessing it concurrently will evantually get to that
faulty state. I'll be happy to work with you and the ES team on reproducing
and killing this issue. I believe it should be escalated and resolved
quickly - it's definitely a show stopper.

Another big issue I'm seeing at this point is ES not respecting
timeouts - I recall Shay saying its on a best-effort basis. If ES would
reliably timeout lengthy requests this starvation scenario wouldn't happen.

On Tue, May 21, 2013 at 1:56 PM, Boaz Leskes <b.le...@gmail.com<javascript:>

wrote:

Hi Itamar,

We ended up giving the nodes some more memory and it stopped
happening. I was blindly guessing memory might help because it started
happening after we rolled out a memory intensive feature. Can't say I'm
happy with that kind of "solution".

You mention the query is a reliable way to reproduce the issue. Can
you also supply a sample data? Do you need to run in an multi-node scenario
or is one node also enough?

Boaz

On Tuesday, May 21, 2013 12:42:00 PM UTC+2, Itamar Syn-Hershko wrote:

We are seeing similar behavior, the following curl command will
timeout when executed twice in a row, and pretty much sporadically
otherwise:

$ curl -XPOST http://localhost:9200/some-**index/mydocs/_searchhttp://localhost:9200/some-index/mydocs/_search-d '
{
"size":100,
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"and": [{
"term": {
"user": "itamar"
}
}, {
"missing": {
"field": "last_edited",
"null_value": true
}
}]
}
}
}
}'

The index has 2 replicas, and we observe this happening also for
non-REST clients (native Java clients), so this is probably something
related to threading on the server?

Boaz, were you able to get to any resolution on this?

On Mon, Apr 29, 2013 at 9:19 PM, Jörg Prante joerg...@gmail.comwrote:

No good ideas. Just shooting in the dark. Maybe an issue with
certain queries? Cache config? Shard distribution? Thread pools?

Jörg

Am 29.04.13 20:01, schrieb Boaz Leskes:

I had it under visualvm and GC looks good. Also intra node

searches coming from the other node work fine. What other resources should
I look at?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/grou
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com <javascript:>.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/WXIwtR_AgoI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.