Cluster hangs when closing index and shows strange behavior


(Frederic) #1

Hi there,

This may be a quite long story but I think the strange of the issue
deserves the post :slight_smile:

I'm running a ES cluster in production (6 nodes, 1 index, 20 shards, 1
replica, 50GB of docs), and needed to add a new analyzer and convert
a field of an existing type to a multi_field data type, in order to
define the new analyzer in one of the core types of that field.
I successfully tested all necessary steps on my local machine first
(same index structure), and created scripts for each one of them, in
order to avoid typing mistakes.

I knew that, before adding a new analyzer I should close the Index
first. So, after shutting down all running indexing and searching
processes, I executed the following command from a non-ES server (a
hub for prod servers)

curl -XPOST "http://{SERVER}:9200/items/_close"

... a no answer was received at all. The 'curl' hung indefinitely
waiting for a response.

Trying to check if the Index was still opened, I ran a cluster_health
command and got that all shard numbers were 0 (zero) as if the Index
was closed. The 'close' request was still hung.

Things were getting even weirder when I ran a Delete query I use for
purging old docs and got results, but with 5 failed shards. Even more,
I ran then a search query and got results with all shards OK.

So at this point I decided to kill the 'close' curl request and start
shutting down nodes in order to see the cluster response. Executing a
simple 'kill' command didn't do the work on some nodes, so I used
'kill -9'. After killing one node and checking the health, I got a red
cluster status. Cluster info and different cluster_health outputs are
gisted here https://gist.github.com/1602222

The cluster is configured to work with at least 4 servers, so I shut
down 3 of the 6 severs and started just 2 again. Only at that point
the cluster started to recover properly.

IMPORTANT: I tried to do this task a week ago, with the very same
results. I finished restarting all nodes one by one until status got
green, but not being able to add the new analyzer. So this is not an
occasional issue I guess.

Finally! when the status was green, I tried to add the analyzer again
and everything was happily ok: closing the index, adding the analyzer,
opening the index and setting the new field mapping.

No idea what could be the reason for this, but I found this post that
may refer to a similar issue, as it mentions some DELETE executions
and we run a job every night that executes a Delete_By_Query request
which deletes hundred or thousands of records:
http://groups.google.com/group/elasticsearch/browse_frm/thread/5b83f8ad5d02fb84/49c7ab8c080c4241?lnk=gst&q=hang#49c7ab8c080c4241

Hope this help to find the root of the issue.

Thanks for your patience if you've read till this point :slight_smile:

Frederic


(Shay Banon) #2

Strange…, was there anything in the logs while the request was blocked? Any failures?

On Tuesday, January 31, 2012 at 3:07 PM, Frederic wrote:

Hi there,

This may be a quite long story but I think the strange of the issue
deserves the post :slight_smile:

I'm running a ES cluster in production (6 nodes, 1 index, 20 shards, 1
replica, 50GB of docs), and needed to add a new analyzer and convert
a field of an existing type to a multi_field data type, in order to
define the new analyzer in one of the core types of that field.
I successfully tested all necessary steps on my local machine first
(same index structure), and created scripts for each one of them, in
order to avoid typing mistakes.

I knew that, before adding a new analyzer I should close the Index
first. So, after shutting down all running indexing and searching
processes, I executed the following command from a non-ES server (a
hub for prod servers)

curl -XPOST "http://{SERVER}:9200/items/_close"

... a no answer was received at all. The 'curl' hung indefinitely
waiting for a response.

Trying to check if the Index was still opened, I ran a cluster_health
command and got that all shard numbers were 0 (zero) as if the Index
was closed. The 'close' request was still hung.

Things were getting even weirder when I ran a Delete query I use for
purging old docs and got results, but with 5 failed shards. Even more,
I ran then a search query and got results with all shards OK.

So at this point I decided to kill the 'close' curl request and start
shutting down nodes in order to see the cluster response. Executing a
simple 'kill' command didn't do the work on some nodes, so I used
'kill -9'. After killing one node and checking the health, I got a red
cluster status. Cluster info and different cluster_health outputs are
gisted here https://gist.github.com/1602222

The cluster is configured to work with at least 4 servers, so I shut
down 3 of the 6 severs and started just 2 again. Only at that point
the cluster started to recover properly.

IMPORTANT: I tried to do this task a week ago, with the very same
results. I finished restarting all nodes one by one until status got
green, but not being able to add the new analyzer. So this is not an
occasional issue I guess.

Finally! when the status was green, I tried to add the analyzer again
and everything was happily ok: closing the index, adding the analyzer,
opening the index and setting the new field mapping.

No idea what could be the reason for this, but I found this post that
may refer to a similar issue, as it mentions some DELETE executions
and we run a job every night that executes a Delete_By_Query request
which deletes hundred or thousands of records:
http://groups.google.com/group/elasticsearch/browse_frm/thread/5b83f8ad5d02fb84/49c7ab8c080c4241?lnk=gst&q=hang#49c7ab8c080c4241

Hope this help to find the root of the issue.

Thanks for your patience if you've read till this point :slight_smile:

Frederic


(Frederic) #3

I've updated the gist with the output of one of the nodes. It's cut as
some exceptions (especially 'UnresolvedAddressException') repeat a
lot. This may be expected as the cluster discovery module is
configured in 'unicast' mode and all 6 servers are defined in the list
of reachable servers, so I assume those were logged when I started to
stop servers.

Anything else I could help, please let me know. Apart from these
issues the system is doing great.

Thanks,

On 31 ene, 14:46, Shay Banon kim...@gmail.com wrote:

Strange…, was there anything in the logs while the request was blocked? Any failures?

On Tuesday, January 31, 2012 at 3:07 PM, Frederic wrote:

Hi there,

This may be a quite long story but I think the strange of the issue
deserves the post :slight_smile:

I'm running a ES cluster in production (6 nodes, 1 index, 20 shards, 1
replica, 50GB of docs), and needed to add a new analyzer and convert
a field of an existing type to a multi_field data type, in order to
define the new analyzer in one of the core types of that field.
I successfully tested all necessary steps on my local machine first
(same index structure), and created scripts for each one of them, in
order to avoid typing mistakes.

I knew that, before adding a new analyzer I should close the Index
first. So, after shutting down all running indexing and searching
processes, I executed the following command from a non-ES server (a
hub for prod servers)

curl -XPOST "http://{SERVER}:9200/items/_close"

... a no answer was received at all. The 'curl' hung indefinitely
waiting for a response.

Trying to check if the Index was still opened, I ran a cluster_health
command and got that all shard numbers were 0 (zero) as if the Index
was closed. The 'close' request was still hung.

Things were getting even weirder when I ran a Delete query I use for
purging old docs and got results, but with 5 failed shards. Even more,
I ran then a search query and got results with all shards OK.

So at this point I decided to kill the 'close' curl request and start
shutting down nodes in order to see the cluster response. Executing a
simple 'kill' command didn't do the work on some nodes, so I used
'kill -9'. After killing one node and checking the health, I got a red
cluster status. Cluster info and different cluster_health outputs are
gisted herehttps://gist.github.com/1602222

The cluster is configured to work with at least 4 servers, so I shut
down 3 of the 6 severs and started just 2 again. Only at that point
the cluster started to recover properly.

IMPORTANT: I tried to do this task a week ago, with the very same
results. I finished restarting all nodes one by one until status got
green, but not being able to add the new analyzer. So this is not an
occasional issue I guess.

Finally! when the status was green, I tried to add the analyzer again
and everything was happily ok: closing the index, adding the analyzer,
opening the index and setting the new field mapping.

No idea what could be the reason for this, but I found this post that
may refer to a similar issue, as it mentions some DELETE executions
and we run a job every night that executes a Delete_By_Query request
which deletes hundred or thousands of records:
http://groups.google.com/group/elasticsearch/browse_frm/thread/5b83f8...

Hope this help to find the root of the issue.

Thanks for your patience if you've read till this point :slight_smile:

Frederic


(Shay Banon) #4

It seems ilke there is a problem in resolving an address (probably from host name to IP) in your system...

On Tuesday, January 31, 2012 at 3:07 PM, Frederic wrote:

Hi there,

This may be a quite long story but I think the strange of the issue
deserves the post :slight_smile:

I'm running a ES cluster in production (6 nodes, 1 index, 20 shards, 1
replica, 50GB of docs), and needed to add a new analyzer and convert
a field of an existing type to a multi_field data type, in order to
define the new analyzer in one of the core types of that field.
I successfully tested all necessary steps on my local machine first
(same index structure), and created scripts for each one of them, in
order to avoid typing mistakes.

I knew that, before adding a new analyzer I should close the Index
first. So, after shutting down all running indexing and searching
processes, I executed the following command from a non-ES server (a
hub for prod servers)

curl -XPOST "http://{SERVER}:9200/items/_close"

... a no answer was received at all. The 'curl' hung indefinitely
waiting for a response.

Trying to check if the Index was still opened, I ran a cluster_health
command and got that all shard numbers were 0 (zero) as if the Index
was closed. The 'close' request was still hung.

Things were getting even weirder when I ran a Delete query I use for
purging old docs and got results, but with 5 failed shards. Even more,
I ran then a search query and got results with all shards OK.

So at this point I decided to kill the 'close' curl request and start
shutting down nodes in order to see the cluster response. Executing a
simple 'kill' command didn't do the work on some nodes, so I used
'kill -9'. After killing one node and checking the health, I got a red
cluster status. Cluster info and different cluster_health outputs are
gisted here https://gist.github.com/1602222

The cluster is configured to work with at least 4 servers, so I shut
down 3 of the 6 severs and started just 2 again. Only at that point
the cluster started to recover properly.

IMPORTANT: I tried to do this task a week ago, with the very same
results. I finished restarting all nodes one by one until status got
green, but not being able to add the new analyzer. So this is not an
occasional issue I guess.

Finally! when the status was green, I tried to add the analyzer again
and everything was happily ok: closing the index, adding the analyzer,
opening the index and setting the new field mapping.

No idea what could be the reason for this, but I found this post that
may refer to a similar issue, as it mentions some DELETE executions
and we run a job every night that executes a Delete_By_Query request
which deletes hundred or thousands of records:
http://groups.google.com/group/elasticsearch/browse_frm/thread/5b83f8ad5d02fb84/49c7ab8c080c4241?lnk=gst&q=hang#49c7ab8c080c4241

Hope this help to find the root of the issue.

Thanks for your patience if you've read till this point :slight_smile:

Frederic


(Frederic) #5

Strange, but probably as we are hosting our apps in a brand new
private cloud. I'll check it out with Tech Ops.
Thanks a lot Kimchy,

Frederic

On 1 feb, 06:55, Shay Banon kim...@gmail.com wrote:

It seems ilke there is a problem in resolving an address (probably from host name to IP) in your system...

On Tuesday, January 31, 2012 at 3:07 PM, Frederic wrote:

Hi there,

This may be a quite long story but I think the strange of the issue
deserves the post :slight_smile:

I'm running a ES cluster in production (6 nodes, 1 index, 20 shards, 1
replica, 50GB of docs), and needed to add a new analyzer and convert
a field of an existing type to a multi_field data type, in order to
define the new analyzer in one of the core types of that field.
I successfully tested all necessary steps on my local machine first
(same index structure), and created scripts for each one of them, in
order to avoid typing mistakes.

I knew that, before adding a new analyzer I should close the Index
first. So, after shutting down all running indexing and searching
processes, I executed the following command from a non-ES server (a
hub for prod servers)

curl -XPOST "http://{SERVER}:9200/items/_close"

... a no answer was received at all. The 'curl' hung indefinitely
waiting for a response.

Trying to check if the Index was still opened, I ran a cluster_health
command and got that all shard numbers were 0 (zero) as if the Index
was closed. The 'close' request was still hung.

Things were getting even weirder when I ran a Delete query I use for
purging old docs and got results, but with 5 failed shards. Even more,
I ran then a search query and got results with all shards OK.

So at this point I decided to kill the 'close' curl request and start
shutting down nodes in order to see the cluster response. Executing a
simple 'kill' command didn't do the work on some nodes, so I used
'kill -9'. After killing one node and checking the health, I got a red
cluster status. Cluster info and different cluster_health outputs are
gisted herehttps://gist.github.com/1602222

The cluster is configured to work with at least 4 servers, so I shut
down 3 of the 6 severs and started just 2 again. Only at that point
the cluster started to recover properly.

IMPORTANT: I tried to do this task a week ago, with the very same
results. I finished restarting all nodes one by one until status got
green, but not being able to add the new analyzer. So this is not an
occasional issue I guess.

Finally! when the status was green, I tried to add the analyzer again
and everything was happily ok: closing the index, adding the analyzer,
opening the index and setting the new field mapping.

No idea what could be the reason for this, but I found this post that
may refer to a similar issue, as it mentions some DELETE executions
and we run a job every night that executes a Delete_By_Query request
which deletes hundred or thousands of records:
http://groups.google.com/group/elasticsearch/browse_frm/thread/5b83f8...

Hope this help to find the root of the issue.

Thanks for your patience if you've read till this point :slight_smile:

Frederic


(system) #6