Intermittent 505's

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching backend.
We are pretty happy with the switch, except there are a few things that we
are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches, around 2K
milliseconds or so. However, we have some has_child queries, and they seem
to take up a lot of resources.

The main problem though is that the ES will get suddenly extremely slow and
occasionally throw 505's. This will last for around 30 -60 seconds, and
then it goes back to normal. Even with the exact query.

Does anyone have any ideas on what might cause this, or any tools that we
might use to track this down?

Currently, we're considering adding a fifth box that does not store any
data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10 records
per second.

Thank you for any incite that you may have.

--

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds eire1130@gmail.com a écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching backend. We are pretty happy with the switch, except there are a few things that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches, around 2K milliseconds or so. However, we have some has_child queries, and they seem to take up a lot of resources.

The main problem though is that the ES will get suddenly extremely slow and occasionally throw 505's. This will last for around 30 -60 seconds, and then it goes back to normal. Even with the exact query.

Does anyone have any ideas on what might cause this, or any tools that we might use to track this down?

Currently, we're considering adding a fifth box that does not store any data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10 records per second.

Thank you for any incite that you may have.

--

Are you sure these errors are 505 and not 500 or 503? Do you have a load
balancer between your client and elasticsearch? Did you change default
thread pool configuration in any way?

On Tuesday, November 27, 2012 1:46:56 AM UTC-5, David Pilato wrote:

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds <eire...@gmail.com <javascript:>>
a écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching
backend. We are pretty happy with the switch, except there are a few things
that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches, around
2K milliseconds or so. However, we have some has_child queries, and they
seem to take up a lot of resources.

The main problem though is that the ES will get suddenly extremely slow
and occasionally throw 505's. This will last for around 30 -60 seconds, and
then it goes back to normal. Even with the exact query.

Does anyone have any ideas on what might cause this, or any tools that we
might use to track this down?

Currently, we're considering adding a fifth box that does not store any
data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10
records per second.

Thank you for any incite that you may have.

--

--

I will take a look at the GC effect. I have noticed this seems to happen
right before a GC collection.

Ram is:

Large: 3G - Master = True
Large: 3G - Master = False
Xlarge: 10G - Master = True
Xlarge: 10G - Master = False

(no storage ES instance also Xlarge): 10G

On Tuesday, November 27, 2012 1:46:56 AM UTC-5, David Pilato wrote:

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds <eire...@gmail.com <javascript:>>
a écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching
backend. We are pretty happy with the switch, except there are a few things
that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches, around
2K milliseconds or so. However, we have some has_child queries, and they
seem to take up a lot of resources.

The main problem though is that the ES will get suddenly extremely slow
and occasionally throw 505's. This will last for around 30 -60 seconds, and
then it goes back to normal. Even with the exact query.

Does anyone have any ideas on what might cause this, or any tools that we
might use to track this down?

Currently, we're considering adding a fifth box that does not store any
data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10
records per second.

Thank you for any incite that you may have.

--

--

Are you sure these errors are 505 and not 500 or 503?

Positive. Looking at a screen right now that says "505". Haven't gone into
the apache logs though.

Do you have a load balancer between your client and elasticsearch?

Other than an Xlarge that is set to act as a router (storage=False), no.

Did you change default thread pool configuration in any way?

No, this is default. Most settings are default other than memory, file
handlers, and discovery

On Tuesday, November 27, 2012 7:56:26 AM UTC-5, Igor Motov wrote:

Are you sure these errors are 505 and not 500 or 503? Do you have a load
balancer between your client and elasticsearch? Did you change default
thread pool configuration in any way?

On Tuesday, November 27, 2012 1:46:56 AM UTC-5, David Pilato wrote:

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds eire...@gmail.com a écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching
backend. We are pretty happy with the switch, except there are a few things
that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches, around
2K milliseconds or so. However, we have some has_child queries, and they
seem to take up a lot of resources.

The main problem though is that the ES will get suddenly extremely slow
and occasionally throw 505's. This will last for around 30 -60 seconds, and
then it goes back to normal. Even with the exact query.

Does anyone have any ideas on what might cause this, or any tools that we
might use to track this down?

Currently, we're considering adding a fifth box that does not store any
data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10
records per second.

Thank you for any incite that you may have.

--

--

I asked because I don't really see how elasticsearch can possibly return
HTTP 505. Do you see any error messages in elasticsearch logs?

On Tuesday, November 27, 2012 10:46:08 AM UTC-5, James Reynolds wrote:

Are you sure these errors are 505 and not 500 or 503?

Positive. Looking at a screen right now that says "505". Haven't gone into
the apache logs though.

Do you have a load balancer between your client and elasticsearch?

Other than an Xlarge that is set to act as a router (storage=False), no.

Did you change default thread pool configuration in any way?

No, this is default. Most settings are default other than memory, file
handlers, and discovery

On Tuesday, November 27, 2012 7:56:26 AM UTC-5, Igor Motov wrote:

Are you sure these errors are 505 and not 500 or 503? Do you have a load
balancer between your client and elasticsearch? Did you change default
thread pool configuration in any way?

On Tuesday, November 27, 2012 1:46:56 AM UTC-5, David Pilato wrote:

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds eire...@gmail.com a écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching
backend. We are pretty happy with the switch, except there are a few things
that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches, around
2K milliseconds or so. However, we have some has_child queries, and they
seem to take up a lot of resources.

The main problem though is that the ES will get suddenly extremely slow
and occasionally throw 505's. This will last for around 30 -60 seconds, and
then it goes back to normal. Even with the exact query.

Does anyone have any ideas on what might cause this, or any tools that
we might use to track this down?

Currently, we're considering adding a fifth box that does not store any
data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10
records per second.

Thank you for any incite that you may have.

--

--

Hey Igor:

I think that is something that Django throws.

This appears to be the exception in elasticsearch:

[2012-11-26 03:38:49,915][WARN ][transport ] [Master QA]

Received response for a request that has timed out, sent [84129ms] ago,
timed out [54129ms] ago, action [discovery/zen/fd/ping], node [[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase}], id
[21234750]
[2012-11-26 03:38:49,915][WARN ][transport ] [Master QA]
Received response for a request that has timed out, sent [53924ms] ago,
timed out [23924ms] ago, action [discovery/zen/fd/ping], node [[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase}], id
[21234823]
[2012-11-26 14:35:33,675][DEBUG][action.search.type ] [Master QA]
[766977] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [Slave-1
QA][inet[/10.72.243.186:9300]][search/phase/fetch/id]
Caused by: org.elasticsearch.search.SearchContextMissingException: No
search context found for id [766977]
at
org.elasticsearch.search.SearchService.findContext(SearchService.java:459)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:432)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:634)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:623)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On Tuesday, November 27, 2012 12:09:55 PM UTC-5, Igor Motov wrote:

I asked because I don't really see how elasticsearch can possibly return
HTTP 505. Do you see any error messages in elasticsearch logs?

On Tuesday, November 27, 2012 10:46:08 AM UTC-5, James Reynolds wrote:

Are you sure these errors are 505 and not 500 or 503?

Positive. Looking at a screen right now that says "505". Haven't gone
into the apache logs though.

Do you have a load balancer between your client and elasticsearch?

Other than an Xlarge that is set to act as a router (storage=False), no.

Did you change default thread pool configuration in any way?

No, this is default. Most settings are default other than memory, file
handlers, and discovery

On Tuesday, November 27, 2012 7:56:26 AM UTC-5, Igor Motov wrote:

Are you sure these errors are 505 and not 500 or 503? Do you have a load
balancer between your client and elasticsearch? Did you change default
thread pool configuration in any way?

On Tuesday, November 27, 2012 1:46:56 AM UTC-5, David Pilato wrote:

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds eire...@gmail.com a écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching
backend. We are pretty happy with the switch, except there are a few things
that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches,
around 2K milliseconds or so. However, we have some has_child queries, and
they seem to take up a lot of resources.

The main problem though is that the ES will get suddenly extremely slow
and occasionally throw 505's. This will last for around 30 -60 seconds, and
then it goes back to normal. Even with the exact query.

Does anyone have any ideas on what might cause this, or any tools that
we might use to track this down?

Currently, we're considering adding a fifth box that does not store any
data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10
records per second.

Thank you for any incite that you may have.

--

--

It looks like some of the nodes are getting unresponsive. First thing to
check would be to make sure that memory of your nodes is not getting
swapped out. Are you using mlockall flag? What type of xlarge are these
nodes? It's typically a good idea to allocate no more then half of memory
to es heap. If these are m3.xlarge with 15GB, allocating 10GB to
elasticsearch might be too much.

On Tuesday, November 27, 2012 2:23:23 PM UTC-5, James Reynolds wrote:

Hey Igor:

I think that is something that Django throws.

This appears to be the exception in elasticsearch:

[2012-11-26 03:38:49,915][WARN ][transport ] [Master QA]

Received response for a request that has timed out, sent [84129ms] ago,
timed out [54129ms] ago, action [discovery/zen/fd/ping], node [[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase}], id
[21234750]
[2012-11-26 03:38:49,915][WARN ][transport ] [Master QA]
Received response for a request that has timed out, sent [53924ms] ago,
timed out [23924ms] ago, action [discovery/zen/fd/ping], node [[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase}], id
[21234823]
[2012-11-26 14:35:33,675][DEBUG][action.search.type ] [Master QA]
[766977] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [Slave-1
QA][inet[/10.72.243.186:9300]][search/phase/fetch/id]
Caused by: org.elasticsearch.search.SearchContextMissingException: No
search context found for id [766977]
at
org.elasticsearch.search.SearchService.findContext(SearchService.java:459)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:432)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:634)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:623)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On Tuesday, November 27, 2012 12:09:55 PM UTC-5, Igor Motov wrote:

I asked because I don't really see how elasticsearch can possibly return
HTTP 505. Do you see any error messages in elasticsearch logs?

On Tuesday, November 27, 2012 10:46:08 AM UTC-5, James Reynolds wrote:

Are you sure these errors are 505 and not 500 or 503?

Positive. Looking at a screen right now that says "505". Haven't gone
into the apache logs though.

Do you have a load balancer between your client and elasticsearch?

Other than an Xlarge that is set to act as a router (storage=False), no.

Did you change default thread pool configuration in any way?

No, this is default. Most settings are default other than memory, file
handlers, and discovery

On Tuesday, November 27, 2012 7:56:26 AM UTC-5, Igor Motov wrote:

Are you sure these errors are 505 and not 500 or 503? Do you have a
load balancer between your client and elasticsearch? Did you change default
thread pool configuration in any way?

On Tuesday, November 27, 2012 1:46:56 AM UTC-5, David Pilato wrote:

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds eire...@gmail.com a écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching
backend. We are pretty happy with the switch, except there are a few things
that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches,
around 2K milliseconds or so. However, we have some has_child queries, and
they seem to take up a lot of resources.

The main problem though is that the ES will get
suddenly extremely slow and occasionally throw 505's. This will last for
around 30 -60 seconds, and then it goes back to normal. Even with the exact
query.

Does anyone have any ideas on what might cause this, or any tools that
we might use to track this down?

Currently, we're considering adding a fifth box that does not store
any data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10
records per second.

Thank you for any incite that you may have.

--

--

Hey,

Thank you very much for the feedback.

These are 15GB, we'll look into bringing this down. However, we are trying
to avoid restarts on the clusters, it seems the shards sometimes panic upon
restart.

As far as mlockall, We haven't changed that setting, so it's whatever
default is.

We'll look into swap, but the only thing running on each box is
elasticsearch.

On one of the boxes last night, we got this traceback:

[2012-11-29 02:44:00,891][INFO ][discovery.zen ] [Slave-2 QA]

master_left [[Master
QA][eY5WeSRSR4mn9WQUQ8ceLw][inet[/10.214.39.5:9300]]{master=true}], reason
[transport disconnected (with verified connect)]
[2012-11-29 02:44:00,893][INFO ][cluster.service ] [Slave-2 QA]
master {new [Slave-3
QA][Ij9w3kcdTnyuzEt-Xp5urQ][inet[/10.104.7.233:9300]]{master=true},
previous [Master
QA][eY5WeSRSR4mn9WQUQ8ceLw][inet[/10.214.39.5:9300]]{master=true}}, removed
{[Master
QA][eY5WeSRSR4mn9WQUQ8ceLw][inet[/10.214.39.5:9300]]{master=true},},
reason: zen-disco-master_failed ([Master
QA][eY5WeSRSR4mn9WQUQ8ceLw][inet[/10.214.39.5:9300]]{master=true})
[2012-11-29 02:44:22,448][INFO ][discovery.zen ] [Slave-2 QA]
master_left [[Slave-3
QA][Ij9w3kcdTnyuzEt-Xp5urQ][inet[/10.104.7.233:9300]]{master=true}], reason
[no longer master]
[2012-11-29 02:44:22,449][WARN ][discovery.zen ] [Slave-2 QA]
not enough master nodes after master left (reason = no longer master),
current nodes: {[Router
QA][0MFh31WJTz65AjsxYpwGhA][inet[/10.142.134.211:9300]]{data=false,
master=false},[Slave-2
QA][bdPAglqUSRSscuO8HUxoyA][inet[/10.142.148.45:9300]]{master=flase},[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase},}

(two of the slaves have a typo in their definition for master:
master=flase; not sure if this will cause other issues)

I have this below, but for clarity:

Master QA = Master=True
Slave1 = Master=False (flase)
Slave2 = Master=False(flase)
Slave3 = Master=True

It looks like what happened above, Master Qa left for some reason. Slave 3
became master for 22 seconds. "Master" then went back to Master QA. At that
time, we had some kind of panic. This lasted for about five minutes where
we couldn't hit search at all. CPU was at 200% (on bigdesk) loads were like
25/15/10 as I recall.

Moreover, this morning I'm looking at the nodes in bigdesk, and slave2
shows no activity at all. I tried reconnecting, it took a long time and
still not seeing any activity.

I really appreciate your feedback on this one.

On Wednesday, November 28, 2012 6:46:04 PM UTC-5, Igor Motov wrote:

It looks like some of the nodes are getting unresponsive. First thing to
check would be to make sure that memory of your nodes is not getting
swapped out. Are you using mlockall flag? What type of xlarge are these
nodes? It's typically a good idea to allocate no more then half of memory
to es heap. If these are m3.xlarge with 15GB, allocating 10GB to
elasticsearch might be too much.

On Tuesday, November 27, 2012 2:23:23 PM UTC-5, James Reynolds wrote:

Hey Igor:

I think that is something that Django throws.

This appears to be the exception in elasticsearch:

[2012-11-26 03:38:49,915][WARN ][transport ] [Master QA]

Received response for a request that has timed out, sent [84129ms] ago,
timed out [54129ms] ago, action [discovery/zen/fd/ping], node [[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase}], id
[21234750]
[2012-11-26 03:38:49,915][WARN ][transport ] [Master QA]
Received response for a request that has timed out, sent [53924ms] ago,
timed out [23924ms] ago, action [discovery/zen/fd/ping], node [[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase}], id
[21234823]
[2012-11-26 14:35:33,675][DEBUG][action.search.type ] [Master QA]
[766977] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [Slave-1
QA][inet[/10.72.243.186:9300]][search/phase/fetch/id]
Caused by: org.elasticsearch.search.SearchContextMissingException: No
search context found for id [766977]
at
org.elasticsearch.search.SearchService.findContext(SearchService.java:459)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:432)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:634)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:623)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On Tuesday, November 27, 2012 12:09:55 PM UTC-5, Igor Motov wrote:

I asked because I don't really see how elasticsearch can possibly return
HTTP 505. Do you see any error messages in elasticsearch logs?

On Tuesday, November 27, 2012 10:46:08 AM UTC-5, James Reynolds wrote:

Are you sure these errors are 505 and not 500 or 503?

Positive. Looking at a screen right now that says "505". Haven't gone
into the apache logs though.

Do you have a load balancer between your client and elasticsearch?

Other than an Xlarge that is set to act as a router (storage=False),
no.

Did you change default thread pool configuration in any way?

No, this is default. Most settings are default other than memory, file
handlers, and discovery

On Tuesday, November 27, 2012 7:56:26 AM UTC-5, Igor Motov wrote:

Are you sure these errors are 505 and not 500 or 503? Do you have a
load balancer between your client and elasticsearch? Did you change default
thread pool configuration in any way?

On Tuesday, November 27, 2012 1:46:56 AM UTC-5, David Pilato wrote:

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds eire...@gmail.com a écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching
backend. We are pretty happy with the switch, except there are a few things
that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches,
around 2K milliseconds or so. However, we have some has_child queries, and
they seem to take up a lot of resources.

The main problem though is that the ES will get
suddenly extremely slow and occasionally throw 505's. This will last for
around 30 -60 seconds, and then it goes back to normal. Even with the exact
query.

Does anyone have any ideas on what might cause this, or any tools
that we might use to track this down?

Currently, we're considering adding a fifth box that does not store
any data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10
records per second.

Thank you for any incite that you may have.

--

--

After much tinkering, we thought we had this solved, but it seems to come
back and worse then ever. This seems to be the traceback that is thrown
whenever we this problem:

[2012-12-03 00:01:52,576][DEBUG][action.search.type ] [Router QA]

[390172] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [Slave-4
QA][inet[/10.104.37.180:9300]][search/phase/fetch/id]
Caused by: org.elasticsearch.search.SearchContextMissingException: No
search context found for id [390172]
at
org.elasticsearch.search.SearchService.findContext(SearchService.java:459)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:432)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:634)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:623)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

It only seems to do it when we perform a haschild query.

On Thursday, November 29, 2012 10:12:47 AM UTC-5, James Reynolds wrote:

Hey,

Thank you very much for the feedback.

These are 15GB, we'll look into bringing this down. However, we are trying
to avoid restarts on the clusters, it seems the shards sometimes panic upon
restart.

As far as mlockall, We haven't changed that setting, so it's whatever
default is.

We'll look into swap, but the only thing running on each box is
elasticsearch.

On one of the boxes last night, we got this traceback:

[2012-11-29 02:44:00,891][INFO ][discovery.zen ] [Slave-2 QA]

master_left [[Master
QA][eY5WeSRSR4mn9WQUQ8ceLw][inet[/10.214.39.5:9300]]{master=true}], reason
[transport disconnected (with verified connect)]
[2012-11-29 02:44:00,893][INFO ][cluster.service ] [Slave-2 QA]
master {new [Slave-3
QA][Ij9w3kcdTnyuzEt-Xp5urQ][inet[/10.104.7.233:9300]]{master=true},
previous [Master
QA][eY5WeSRSR4mn9WQUQ8ceLw][inet[/10.214.39.5:9300]]{master=true}}, removed
{[Master
QA][eY5WeSRSR4mn9WQUQ8ceLw][inet[/10.214.39.5:9300]]{master=true},},
reason: zen-disco-master_failed ([Master
QA][eY5WeSRSR4mn9WQUQ8ceLw][inet[/10.214.39.5:9300]]{master=true})
[2012-11-29 02:44:22,448][INFO ][discovery.zen ] [Slave-2 QA]
master_left [[Slave-3
QA][Ij9w3kcdTnyuzEt-Xp5urQ][inet[/10.104.7.233:9300]]{master=true}], reason
[no longer master]
[2012-11-29 02:44:22,449][WARN ][discovery.zen ] [Slave-2 QA]
not enough master nodes after master left (reason = no longer master),
current nodes: {[Router
QA][0MFh31WJTz65AjsxYpwGhA][inet[/10.142.134.211:9300]]{data=false,
master=false},[Slave-2
QA][bdPAglqUSRSscuO8HUxoyA][inet[/10.142.148.45:9300]]{master=flase},[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase},}

(two of the slaves have a typo in their definition for master:
master=flase; not sure if this will cause other issues)

I have this below, but for clarity:

Master QA = Master=True
Slave1 = Master=False (flase)
Slave2 = Master=False(flase)
Slave3 = Master=True

It looks like what happened above, Master Qa left for some reason. Slave 3
became master for 22 seconds. "Master" then went back to Master QA. At that
time, we had some kind of panic. This lasted for about five minutes where
we couldn't hit search at all. CPU was at 200% (on bigdesk) loads were like
25/15/10 as I recall.

Moreover, this morning I'm looking at the nodes in bigdesk, and slave2
shows no activity at all. I tried reconnecting, it took a long time and
still not seeing any activity.

I really appreciate your feedback on this one.

On Wednesday, November 28, 2012 6:46:04 PM UTC-5, Igor Motov wrote:

It looks like some of the nodes are getting unresponsive. First thing to
check would be to make sure that memory of your nodes is not getting
swapped out. Are you using mlockall flag? What type of xlarge are these
nodes? It's typically a good idea to allocate no more then half of memory
to es heap. If these are m3.xlarge with 15GB, allocating 10GB to
elasticsearch might be too much.

On Tuesday, November 27, 2012 2:23:23 PM UTC-5, James Reynolds wrote:

Hey Igor:

I think that is something that Django throws.

This appears to be the exception in elasticsearch:

[2012-11-26 03:38:49,915][WARN ][transport ] [Master QA]

Received response for a request that has timed out, sent [84129ms] ago,
timed out [54129ms] ago, action [discovery/zen/fd/ping], node [[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase}], id
[21234750]
[2012-11-26 03:38:49,915][WARN ][transport ] [Master QA]
Received response for a request that has timed out, sent [53924ms] ago,
timed out [23924ms] ago, action [discovery/zen/fd/ping], node [[Slave-1
QA][cupA5ulMReaIli2B4IkR5Q][inet[/10.72.243.186:9300]]{master=flase}], id
[21234823]
[2012-11-26 14:35:33,675][DEBUG][action.search.type ] [Master QA]
[766977] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [Slave-1
QA][inet[/10.72.243.186:9300]][search/phase/fetch/id]
Caused by: org.elasticsearch.search.SearchContextMissingException: No
search context found for id [766977]
at
org.elasticsearch.search.SearchService.findContext(SearchService.java:459)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:432)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:634)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:623)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On Tuesday, November 27, 2012 12:09:55 PM UTC-5, Igor Motov wrote:

I asked because I don't really see how elasticsearch can possibly
return HTTP 505. Do you see any error messages in elasticsearch logs?

On Tuesday, November 27, 2012 10:46:08 AM UTC-5, James Reynolds wrote:

Are you sure these errors are 505 and not 500 or 503?

Positive. Looking at a screen right now that says "505". Haven't gone
into the apache logs though.

Do you have a load balancer between your client and elasticsearch?

Other than an Xlarge that is set to act as a router (storage=False),
no.

Did you change default thread pool configuration in any way?

No, this is default. Most settings are default other than memory, file
handlers, and discovery

On Tuesday, November 27, 2012 7:56:26 AM UTC-5, Igor Motov wrote:

Are you sure these errors are 505 and not 500 or 503? Do you have a
load balancer between your client and elasticsearch? Did you change default
thread pool configuration in any way?

On Tuesday, November 27, 2012 1:46:56 AM UTC-5, David Pilato wrote:

Sounds like GC effect, isn't it?
I think that when GC is taking too much time, you can see it in logs.

How much RAM did you give to ES?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 nov. 2012 à 04:04, James Reynolds eire...@gmail.com a
écrit :

Hello all:

Recently, we switched from Solr to ElasticSearch for our searching
backend. We are pretty happy with the switch, except there are a few things
that we are having trouble with.

But first, some specs:

  • The index is around 200gb
  • Nine Shards
  • Four nodes on four machines - EC2 instances. Two XL and two Large
  • Three indexes
  • 8 Million docs between the three indexes.

Speed seems to be a bit slow on returning basic keyword searches,
around 2K milliseconds or so. However, we have some has_child queries, and
they seem to take up a lot of resources.

The main problem though is that the ES will get
suddenly extremely slow and occasionally throw 505's. This will last for
around 30 -60 seconds, and then it goes back to normal. Even with the exact
query.

Does anyone have any ideas on what might cause this, or any tools
that we might use to track this down?

Currently, we're considering adding a fifth box that does not store
any data, to act as a router.

Also, we are actively indexing, but the amount is trivial. Maybe 10
records per second.

Thank you for any incite that you may have.

--

--