Unexpected ES shut down - nothing there in logs to identify the problem


(jagdeep) #1

I am facing a very unexpected shutdown of ES data node. It dumps the
memory and automatically getting shut(even console is getting closed).
I have already dealt with memory related issues but this one is pretty
weird because nothing is there in the logs.

Server configuration

OS - Windows server 2008 RT standard
CPU - 24 cores
RAM - 32 GB

ES configuration

Data node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=20g
MAster node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=5g

Cluster state is {
cluster_name: elasticsearch_****
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 1
active_primary_shards: 330
active_shards: 330
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Data node state

{
name: dmz14_data_node
transport_address: inet[/***.***.*.:9301]
hostname: **********
attributes: {
tag: ************
master: false
}
indices: {
store: {
size: 21.2gb
size_in_bytes: 22845246770
}
docs: {
count: 2525295
deleted: 551707
}
indexing: {
index_total: 1207877
index_time: 21.1m
index_time_in_millis: 1266718
index_current: 0
delete_total: 0
delete_time: 0s
delete_time_in_millis: 0
delete_current: 0
}
get: {
total: 0
time: 0s
time_in_millis: 0
exists_total: 0
exists_time: 0s
exists_time_in_millis: 0
missing_total: 0
missing_time: 0s
missing_time_in_millis: 0
current: 0
}
search: {
query_total: 8075
query_time: 31.7s
query_time_in_millis: 31702
query_current: 0
fetch_total: 1242
fetch_time: 907ms
fetch_time_in_millis: 907
fetch_current: 0
}
cache: {
field_evictions: 0
field_size: 73.9mb
field_size_in_bytes: 77579988
filter_count: 217
filter_evictions: 0
filter_size: 981.7kb
filter_size_in_bytes: 1005320
}
merges: {
current: 0
current_docs: 0
current_size: 0b
current_size_in_bytes: 0
total: 2073
total_time: 28m
total_time_in_millis: 1681573
total_docs: 5233813
total_size: 20.6gb
total_size_in_bytes: 22176964666
}
refresh: {
total: 18970
total_time: 22.8m
total_time_in_millis: 1373282
}
flush: {
total: 4072
total_time: 1.6m
total_time_in_millis: 98095
}
}
}

This cluster works fine with search, indexing, facteing etc but after
some irregular interval(between 24-48hrs) data node is getting shut.
Surprisingly there is no activity on the cluster when this happens, i
mean no search and indexing. Logs are also pretty much clear.
Logs when system is crashed and after manual restart of the cluster
are

[2012-06-20 08:00:04,678][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188842]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@75f34e16, error [true], resetting
[2012-06-20 08:00:12,681][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188852]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@799b0299, error [true], resetting
[2012-06-20 08:00:21,105][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188862]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@f75cb00, error [true], resetting
[2012-06-20 08:00:30,200][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188873]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@1ffcf6e7, error [true], resetting
[2012-06-20 08:00:39,544][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188884]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@5892644d, error [true], resetting
[2012-06-20 08:00:49,076][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188896]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@557e75f5, error [true], resetting
[2012-06-20 08:00:59,341][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188908]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@21dadca4, error [true], resetting
[2012-06-20 08:01:09,434][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188920]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@10dbd700, error [true], resetting
[2012-06-20 08:01:17,249][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188930]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@35cbd34b, error [true], resetting
[2012-06-20 08:01:23,942][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188938]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@a2a8fed, error [true], resetting
[2012-06-20 08:01:30,307][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188947]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@62411fb4, error [true], resetting
[2012-06-20 08:01:39,292][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188957]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@36ea5908, error [true], resetting
[2012-06-20 10:08:20,126][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initializing ...
[2012-06-20 10:08:20,194][INFO ][plugins ]
[dmz14_data_node] loaded [es-river-rss-oe, river-rss, lang-
javascript], sites [head]
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initialized
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: starting ...
[2012-06-20 10:08:26,176][INFO ][transport ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/192.168.150.14:9301]}
[2012-06-20 10:08:56,190][WARN ][discovery ]
[dmz14_data_node] waited for 30s and no initial state was set by the
discovery
[2012-06-20 10:08:56,191][INFO ][discovery ]
[dmz14_data_node] elasticsearch_dmz/WXMk_K-VSaulKWfPYcAHXQ
[2012-06-20 10:08:56,403][INFO ][http ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:8081]},
publish_address {inet[/192.168.150.14:8081]}
[2012-06-20 10:08:56,404][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: started

I am working on ES for quite some time now and dealt with many issues
sometime by tuning my setting and analyzing logs to identify the
problem but this time I am clueless as there is no error in the logs.

Please suggest how can I hunt this down.


(Shay Banon) #2

I assume you look at the logs and oyu don't see any "shutdown" / closing
messages. This can happen for several reasons (I've see all of those):

  • an external process is calling "kill -9": Tricky to find, you need to
    find the process that does that...
  • the service wrapper (if you use it) kills the process. Check the service
    wrapper logs.
  • The JVM crashed. Check for jvm crash logs (hs_err files).

On Wed, Jun 20, 2012 at 12:53 PM, jagdeep reach.jagdeep@gmail.com wrote:

I am facing a very unexpected shutdown of ES data node. It dumps the
memory and automatically getting shut(even console is getting closed).
I have already dealt with memory related issues but this one is pretty
weird because nothing is there in the logs.

Server configuration

OS - Windows server 2008 RT standard
CPU - 24 cores
RAM - 32 GB

ES configuration

Data node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=20g
MAster node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=5g

Cluster state is {
cluster_name: elasticsearch_****
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 1
active_primary_shards: 330
active_shards: 330
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Data node state

{
name: dmz14_data_node
transport_address: inet[/***.***.*.:9301]
hostname: **********
attributes: {
tag: ************
master: false
}
indices: {
store: {
size: 21.2gb
size_in_bytes: 22845246770
}
docs: {
count: 2525295
deleted: 551707
}
indexing: {
index_total: 1207877
index_time: 21.1m
index_time_in_millis: 1266718
index_current: 0
delete_total: 0
delete_time: 0s
delete_time_in_millis: 0
delete_current: 0
}
get: {
total: 0
time: 0s
time_in_millis: 0
exists_total: 0
exists_time: 0s
exists_time_in_millis: 0
missing_total: 0
missing_time: 0s
missing_time_in_millis: 0
current: 0
}
search: {
query_total: 8075
query_time: 31.7s
query_time_in_millis: 31702
query_current: 0
fetch_total: 1242
fetch_time: 907ms
fetch_time_in_millis: 907
fetch_current: 0
}
cache: {
field_evictions: 0
field_size: 73.9mb
field_size_in_bytes: 77579988
filter_count: 217
filter_evictions: 0
filter_size: 981.7kb
filter_size_in_bytes: 1005320
}
merges: {
current: 0
current_docs: 0
current_size: 0b
current_size_in_bytes: 0
total: 2073
total_time: 28m
total_time_in_millis: 1681573
total_docs: 5233813
total_size: 20.6gb
total_size_in_bytes: 22176964666
}
refresh: {
total: 18970
total_time: 22.8m
total_time_in_millis: 1373282
}
flush: {
total: 4072
total_time: 1.6m
total_time_in_millis: 98095
}
}
}

This cluster works fine with search, indexing, facteing etc but after
some irregular interval(between 24-48hrs) data node is getting shut.
Surprisingly there is no activity on the cluster when this happens, i
mean no search and indexing. Logs are also pretty much clear.
Logs when system is crashed and after manual restart of the cluster
are

[2012-06-20 08:00:04,678][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188842]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@75f34e16, error [true], resetting
[2012-06-20 08:00:12,681][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188852]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@799b0299, error [true], resetting
[2012-06-20 08:00:21,105][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188862]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@f75cb00, error [true], resetting
[2012-06-20 08:00:30,200][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188873]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@1ffcf6e7, error [true], resetting
[2012-06-20 08:00:39,544][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188884]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@5892644d, error [true], resetting
[2012-06-20 08:00:49,076][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188896]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@557e75f5, error [true], resetting
[2012-06-20 08:00:59,341][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188908]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@21dadca4, error [true], resetting
[2012-06-20 08:01:09,434][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188920]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@10dbd700, error [true], resetting
[2012-06-20 08:01:17,249][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188930]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@35cbd34b, error [true], resetting
[2012-06-20 08:01:23,942][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188938]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@a2a8fed, error [true], resetting
[2012-06-20 08:01:30,307][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188947]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@62411fb4, error [true], resetting
[2012-06-20 08:01:39,292][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188957]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@36ea5908, error [true], resetting
[2012-06-20 10:08:20,126][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initializing ...
[2012-06-20 10:08:20,194][INFO ][plugins ]
[dmz14_data_node] loaded [es-river-rss-oe, river-rss, lang-
javascript], sites [head]
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initialized
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: starting ...
[2012-06-20 10:08:26,176][INFO ][transport ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/192.168.150.14:9301]}
[2012-06-20 10:08:56,190][WARN ][discovery ]
[dmz14_data_node] waited for 30s and no initial state was set by the
discovery
[2012-06-20 10:08:56,191][INFO ][discovery ]
[dmz14_data_node] elasticsearch_dmz/WXMk_K-VSaulKWfPYcAHXQ
[2012-06-20 10:08:56,403][INFO ][http ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:8081]},
publish_address {inet[/192.168.150.14:8081]}
[2012-06-20 10:08:56,404][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: started

I am working on ES for quite some time now and dealt with many issues
sometime by tuning my setting and analyzing logs to identify the
problem but this time I am clueless as there is no error in the logs.

Please suggest how can I hunt this down.


(jagdeep) #3

Thanks Shay for the response.

  1. I stopped using service wrapper to track this problem but its still
    persisting.
  2. Will definitely check JVM crash files.
  3. It dumps the memory before getting crashed, heap dump of size 21-26
    GB. I can understand the heap dump at the time of faceting or
    indexing but at this time there is no activity what so ever.

Any pointer towards this heap dump?

On Jun 21, 12:33 pm, Shay Banon kim...@gmail.com wrote:

I assume you look at the logs and oyu don't see any "shutdown" / closing
messages. This can happen for several reasons (I've see all of those):

  • an external process is calling "kill -9": Tricky to find, you need to
    find the process that does that...
  • the service wrapper (if you use it) kills the process. Check the service
    wrapper logs.
  • The JVM crashed. Check for jvm crash logs (hs_err files).

On Wed, Jun 20, 2012 at 12:53 PM, jagdeep reach.jagd...@gmail.com wrote:

I am facing a very unexpected shutdown of ES data node. It dumps the
memory and automatically getting shut(even console is getting closed).
I have already dealt with memory related issues but this one is pretty
weird because nothing is there in the logs.

Server configuration

OS - Windows server 2008 RT standard
CPU - 24 cores
RAM - 32 GB

ES configuration

Data node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=20g
MAster node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=5g

Cluster state is {
cluster_name: elasticsearch_****
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 1
active_primary_shards: 330
active_shards: 330
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Data node state

{
name: dmz14_data_node
transport_address: inet[/***.***.*.:9301]
hostname: **********
attributes: {
tag: ************
master: false
}
indices: {
store: {
size: 21.2gb
size_in_bytes: 22845246770
}
docs: {
count: 2525295
deleted: 551707
}
indexing: {
index_total: 1207877
index_time: 21.1m
index_time_in_millis: 1266718
index_current: 0
delete_total: 0
delete_time: 0s
delete_time_in_millis: 0
delete_current: 0
}
get: {
total: 0
time: 0s
time_in_millis: 0
exists_total: 0
exists_time: 0s
exists_time_in_millis: 0
missing_total: 0
missing_time: 0s
missing_time_in_millis: 0
current: 0
}
search: {
query_total: 8075
query_time: 31.7s
query_time_in_millis: 31702
query_current: 0
fetch_total: 1242
fetch_time: 907ms
fetch_time_in_millis: 907
fetch_current: 0
}
cache: {
field_evictions: 0
field_size: 73.9mb
field_size_in_bytes: 77579988
filter_count: 217
filter_evictions: 0
filter_size: 981.7kb
filter_size_in_bytes: 1005320
}
merges: {
current: 0
current_docs: 0
current_size: 0b
current_size_in_bytes: 0
total: 2073
total_time: 28m
total_time_in_millis: 1681573
total_docs: 5233813
total_size: 20.6gb
total_size_in_bytes: 22176964666
}
refresh: {
total: 18970
total_time: 22.8m
total_time_in_millis: 1373282
}
flush: {
total: 4072
total_time: 1.6m
total_time_in_millis: 98095
}
}
}

This cluster works fine with search, indexing, facteing etc but after
some irregular interval(between 24-48hrs) data node is getting shut.
Surprisingly there is no activity on the cluster when this happens, i
mean no search and indexing. Logs are also pretty much clear.
Logs when system is crashed and after manual restart of the cluster
are

[2012-06-20 08:00:04,678][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188842]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@75f34e16, error [true], resetting
[2012-06-20 08:00:12,681][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188852]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@799b0299, error [true], resetting
[2012-06-20 08:00:21,105][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188862]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@f75cb00, error [true], resetting
[2012-06-20 08:00:30,200][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188873]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@1ffcf6e7, error [true], resetting
[2012-06-20 08:00:39,544][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188884]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@5892644d, error [true], resetting
[2012-06-20 08:00:49,076][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188896]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@557e75f5, error [true], resetting
[2012-06-20 08:00:59,341][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188908]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@21dadca4, error [true], resetting
[2012-06-20 08:01:09,434][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188920]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@10dbd700, error [true], resetting
[2012-06-20 08:01:17,249][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188930]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@35cbd34b, error [true], resetting
[2012-06-20 08:01:23,942][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188938]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@a2a8fed, error [true], resetting
[2012-06-20 08:01:30,307][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188947]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@62411fb4, error [true], resetting
[2012-06-20 08:01:39,292][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188957]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@36ea5908, error [true], resetting
[2012-06-20 10:08:20,126][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initializing ...
[2012-06-20 10:08:20,194][INFO ][plugins ]
[dmz14_data_node] loaded [es-river-rss-oe, river-rss, lang-
javascript], sites [head]
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initialized
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: starting ...
[2012-06-20 10:08:26,176][INFO ][transport ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/192.168.150.14:9301]}
[2012-06-20 10:08:56,190][WARN ][discovery ]
[dmz14_data_node] waited for 30s and no initial state was set by the
discovery
[2012-06-20 10:08:56,191][INFO ][discovery ]
[dmz14_data_node] elasticsearch_dmz/WXMk_K-VSaulKWfPYcAHXQ
[2012-06-20 10:08:56,403][INFO ][http ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:8081]},
publish_address {inet[/192.168.150.14:8081]}
[2012-06-20 10:08:56,404][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: started

I am working on ES for quite some time now and dealt with many issues
sometime by tuning my setting and analyzing logs to identify the
problem but this time I am clueless as there is no error in the logs.

Please suggest how can I hunt this down.


(Otis Gospodnetić) #4

This smells like OOM (which doesn't always seem to get logged).

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Thursday, June 21, 2012 3:50:22 AM UTC-4, jagdeep wrote:

Thanks Shay for the response.

  1. I stopped using service wrapper to track this problem but its still
    persisting.
  2. Will definitely check JVM crash files.
  3. It dumps the memory before getting crashed, heap dump of size 21-26
    GB. I can understand the heap dump at the time of faceting or
    indexing but at this time there is no activity what so ever.

Any pointer towards this heap dump?

On Jun 21, 12:33 pm, Shay Banon kim...@gmail.com wrote:

I assume you look at the logs and oyu don't see any "shutdown" / closing
messages. This can happen for several reasons (I've see all of those):

  • an external process is calling "kill -9": Tricky to find, you need to
    find the process that does that...
  • the service wrapper (if you use it) kills the process. Check the
    service
    wrapper logs.
  • The JVM crashed. Check for jvm crash logs (hs_err files).

On Wed, Jun 20, 2012 at 12:53 PM, jagdeep reach.jagd...@gmail.com
wrote:

I am facing a very unexpected shutdown of ES data node. It dumps the
memory and automatically getting shut(even console is getting closed).
I have already dealt with memory related issues but this one is pretty
weird because nothing is there in the logs.

Server configuration

OS - Windows server 2008 RT standard
CPU - 24 cores
RAM - 32 GB

ES configuration

Data node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=20g
MAster node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=5g

Cluster state is {
cluster_name: elasticsearch_****
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 1
active_primary_shards: 330
active_shards: 330
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Data node state

{
name: dmz14_data_node
transport_address: inet[/***.***.*.:9301]
hostname: **********
attributes: {
tag: ************
master: false
}
indices: {
store: {
size: 21.2gb
size_in_bytes: 22845246770
}
docs: {
count: 2525295
deleted: 551707
}
indexing: {
index_total: 1207877
index_time: 21.1m
index_time_in_millis: 1266718
index_current: 0
delete_total: 0
delete_time: 0s
delete_time_in_millis: 0
delete_current: 0
}
get: {
total: 0
time: 0s
time_in_millis: 0
exists_total: 0
exists_time: 0s
exists_time_in_millis: 0
missing_total: 0
missing_time: 0s
missing_time_in_millis: 0
current: 0
}
search: {
query_total: 8075
query_time: 31.7s
query_time_in_millis: 31702
query_current: 0
fetch_total: 1242
fetch_time: 907ms
fetch_time_in_millis: 907
fetch_current: 0
}
cache: {
field_evictions: 0
field_size: 73.9mb
field_size_in_bytes: 77579988
filter_count: 217
filter_evictions: 0
filter_size: 981.7kb
filter_size_in_bytes: 1005320
}
merges: {
current: 0
current_docs: 0
current_size: 0b
current_size_in_bytes: 0
total: 2073
total_time: 28m
total_time_in_millis: 1681573
total_docs: 5233813
total_size: 20.6gb
total_size_in_bytes: 22176964666
}
refresh: {
total: 18970
total_time: 22.8m
total_time_in_millis: 1373282
}
flush: {
total: 4072
total_time: 1.6m
total_time_in_millis: 98095
}
}
}

This cluster works fine with search, indexing, facteing etc but after
some irregular interval(between 24-48hrs) data node is getting shut.
Surprisingly there is no activity on the cluster when this happens, i
mean no search and indexing. Logs are also pretty much clear.
Logs when system is crashed and after manual restart of the cluster
are

[2012-06-20 08:00:04,678][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188842]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@75f34e16, error [true], resetting
[2012-06-20 08:00:12,681][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188852]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@799b0299, error [true], resetting
[2012-06-20 08:00:21,105][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188862]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@f75cb00, error [true], resetting
[2012-06-20 08:00:30,200][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188873]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@1ffcf6e7, error [true], resetting
[2012-06-20 08:00:39,544][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188884]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@5892644d, error [true], resetting
[2012-06-20 08:00:49,076][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188896]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@557e75f5, error [true], resetting
[2012-06-20 08:00:59,341][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188908]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@21dadca4, error [true], resetting
[2012-06-20 08:01:09,434][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188920]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@10dbd700, error [true], resetting
[2012-06-20 08:01:17,249][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188930]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@35cbd34b, error [true], resetting
[2012-06-20 08:01:23,942][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188938]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@a2a8fed, error [true], resetting
[2012-06-20 08:01:30,307][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188947]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@62411fb4, error [true], resetting
[2012-06-20 08:01:39,292][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188957]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@36ea5908, error [true], resetting
[2012-06-20 10:08:20,126][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initializing ...
[2012-06-20 10:08:20,194][INFO ][plugins ]
[dmz14_data_node] loaded [es-river-rss-oe, river-rss, lang-
javascript], sites [head]
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initialized
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: starting ...
[2012-06-20 10:08:26,176][INFO ][transport ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/192.168.150.14:9301]}
[2012-06-20 10:08:56,190][WARN ][discovery ]
[dmz14_data_node] waited for 30s and no initial state was set by the
discovery
[2012-06-20 10:08:56,191][INFO ][discovery ]
[dmz14_data_node] elasticsearch_dmz/WXMk_K-VSaulKWfPYcAHXQ
[2012-06-20 10:08:56,403][INFO ][http ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:8081]},
publish_address {inet[/192.168.150.14:8081]}
[2012-06-20 10:08:56,404][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: started

I am working on ES for quite some time now and dealt with many issues
sometime by tuning my setting and analyzing logs to identify the
problem but this time I am clueless as there is no error in the logs.

Please suggest how can I hunt this down.


(jagdeep) #5

Hey Otis thanks.. I have dealt with OOM earlier, this does not seem
oom because cluster is completely ideal when this is happening. I am
keeping a track on JVM crash as suggested by Shay.

On Jun 23, 12:50 am, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

This smells like OOM (which doesn't always seem to get logged).

Otis

Search Analytics -http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring -http://sematext.com/spm/index.html

On Thursday, June 21, 2012 3:50:22 AM UTC-4, jagdeep wrote:

Thanks Shay for the response.

  1. I stopped using service wrapper to track this problem but its still
    persisting.
  2. Will definitely check JVM crash files.
  3. It dumps the memory before getting crashed, heap dump of size 21-26
    GB. I can understand the heap dump at the time of faceting or
    indexing but at this time there is no activity what so ever.

Any pointer towards this heap dump?

On Jun 21, 12:33 pm, Shay Banon kim...@gmail.com wrote:

I assume you look at the logs and oyu don't see any "shutdown" / closing
messages. This can happen for several reasons (I've see all of those):

  • an external process is calling "kill -9": Tricky to find, you need to
    find the process that does that...
  • the service wrapper (if you use it) kills the process. Check the
    service
    wrapper logs.
  • The JVM crashed. Check for jvm crash logs (hs_err files).

On Wed, Jun 20, 2012 at 12:53 PM, jagdeep reach.jagd...@gmail.com
wrote:

I am facing a very unexpected shutdown of ES data node. It dumps the
memory and automatically getting shut(even console is getting closed).
I have already dealt with memory related issues but this one is pretty
weird because nothing is there in the logs.

Server configuration

OS - Windows server 2008 RT standard
CPU - 24 cores
RAM - 32 GB

ES configuration

Data node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=20g
MAster node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=5g

Cluster state is {
cluster_name: elasticsearch_****
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 1
active_primary_shards: 330
active_shards: 330
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Data node state

{
name: dmz14_data_node
transport_address: inet[/***.***.*.:9301]
hostname: **********
attributes: {
tag: ************
master: false
}
indices: {
store: {
size: 21.2gb
size_in_bytes: 22845246770
}
docs: {
count: 2525295
deleted: 551707
}
indexing: {
index_total: 1207877
index_time: 21.1m
index_time_in_millis: 1266718
index_current: 0
delete_total: 0
delete_time: 0s
delete_time_in_millis: 0
delete_current: 0
}
get: {
total: 0
time: 0s
time_in_millis: 0
exists_total: 0
exists_time: 0s
exists_time_in_millis: 0
missing_total: 0
missing_time: 0s
missing_time_in_millis: 0
current: 0
}
search: {
query_total: 8075
query_time: 31.7s
query_time_in_millis: 31702
query_current: 0
fetch_total: 1242
fetch_time: 907ms
fetch_time_in_millis: 907
fetch_current: 0
}
cache: {
field_evictions: 0
field_size: 73.9mb
field_size_in_bytes: 77579988
filter_count: 217
filter_evictions: 0
filter_size: 981.7kb
filter_size_in_bytes: 1005320
}
merges: {
current: 0
current_docs: 0
current_size: 0b
current_size_in_bytes: 0
total: 2073
total_time: 28m
total_time_in_millis: 1681573
total_docs: 5233813
total_size: 20.6gb
total_size_in_bytes: 22176964666
}
refresh: {
total: 18970
total_time: 22.8m
total_time_in_millis: 1373282
}
flush: {
total: 4072
total_time: 1.6m
total_time_in_millis: 98095
}
}
}

This cluster works fine with search, indexing, facteing etc but after
some irregular interval(between 24-48hrs) data node is getting shut.
Surprisingly there is no activity on the cluster when this happens, i
mean no search and indexing. Logs are also pretty much clear.
Logs when system is crashed and after manual restart of the cluster
are

[2012-06-20 08:00:04,678][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188842]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@75f34e16, error [true], resetting
[2012-06-20 08:00:12,681][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188852]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@799b0299, error [true], resetting
[2012-06-20 08:00:21,105][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188862]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@f75cb00, error [true], resetting
[2012-06-20 08:00:30,200][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188873]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@1ffcf6e7, error [true], resetting
[2012-06-20 08:00:39,544][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188884]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@5892644d, error [true], resetting
[2012-06-20 08:00:49,076][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188896]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@557e75f5, error [true], resetting
[2012-06-20 08:00:59,341][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188908]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@21dadca4, error [true], resetting
[2012-06-20 08:01:09,434][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188920]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@10dbd700, error [true], resetting
[2012-06-20 08:01:17,249][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188930]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@35cbd34b, error [true], resetting
[2012-06-20 08:01:23,942][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188938]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@a2a8fed, error [true], resetting
[2012-06-20 08:01:30,307][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188947]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@62411fb4, error [true], resetting
[2012-06-20 08:01:39,292][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188957]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@36ea5908, error [true], resetting
[2012-06-20 10:08:20,126][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initializing ...
[2012-06-20 10:08:20,194][INFO ][plugins ]
[dmz14_data_node] loaded [es-river-rss-oe, river-rss, lang-
javascript], sites [head]
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initialized
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: starting ...
[2012-06-20 10:08:26,176][INFO ][transport ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/192.168.150.14:9301]}
[2012-06-20 10:08:56,190][WARN ][discovery ]
[dmz14_data_node] waited for 30s and no initial state was set by the
discovery
[2012-06-20 10:08:56,191][INFO ][discovery ]
[dmz14_data_node] elasticsearch_dmz/WXMk_K-VSaulKWfPYcAHXQ
[2012-06-20 10:08:56,403][INFO ][http ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:8081]},
publish_address {inet[/192.168.150.14:8081]}
[2012-06-20 10:08:56,404][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: started

I am working on ES for quite some time now and dealt with many issues
sometime by tuning my setting and analyzing logs to identify the
problem but this time I am clueless as there is no error in the logs.

Please suggest how can I hunt this down.


(jagdeep) #6

Hi Shay,

I have observed one more thing.. Earlier this crash was happening
after 24-48 hours. 3 indexes had replica count of 1 whereas other
indexes had replica count of 0. I configured only one data node for
this cluster hence cluster state was yellow. I was ignoring this for a
while.
3 days before I changed replica count to 0 and cluster state turned
green. Its been 72 hours and system is still up and running. I am
keeping a close eye on this cluster.
As per my understanding this is normal and expected behavior of ES and
dont sense yellow state or replica count behind this crash. Please
share any pointer towards this observation.

Thanks and regards
Jagdeep

On Jun 21, 12:33 pm, Shay Banon kim...@gmail.com wrote:

I assume you look at the logs and oyu don't see any "shutdown" / closing
messages. This can happen for several reasons (I've see all of those):

  • an external process is calling "kill -9": Tricky to find, you need to
    find the process that does that...
  • the service wrapper (if you use it) kills the process. Check the service
    wrapper logs.
  • The JVM crashed. Check for jvm crash logs (hs_err files).

On Wed, Jun 20, 2012 at 12:53 PM, jagdeep reach.jagd...@gmail.com wrote:

I am facing a very unexpected shutdown of ES data node. It dumps the
memory and automatically getting shut(even console is getting closed).
I have already dealt with memory related issues but this one is pretty
weird because nothing is there in the logs.

Server configuration

OS - Windows server 2008 RT standard
CPU - 24 cores
RAM - 32 GB

ES configuration

Data node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=20g
MAster node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=5g

Cluster state is {
cluster_name: elasticsearch_****
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 1
active_primary_shards: 330
active_shards: 330
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Data node state

{
name: dmz14_data_node
transport_address: inet[/***.***.*.:9301]
hostname: **********
attributes: {
tag: ************
master: false
}
indices: {
store: {
size: 21.2gb
size_in_bytes: 22845246770
}
docs: {
count: 2525295
deleted: 551707
}
indexing: {
index_total: 1207877
index_time: 21.1m
index_time_in_millis: 1266718
index_current: 0
delete_total: 0
delete_time: 0s
delete_time_in_millis: 0
delete_current: 0
}
get: {
total: 0
time: 0s
time_in_millis: 0
exists_total: 0
exists_time: 0s
exists_time_in_millis: 0
missing_total: 0
missing_time: 0s
missing_time_in_millis: 0
current: 0
}
search: {
query_total: 8075
query_time: 31.7s
query_time_in_millis: 31702
query_current: 0
fetch_total: 1242
fetch_time: 907ms
fetch_time_in_millis: 907
fetch_current: 0
}
cache: {
field_evictions: 0
field_size: 73.9mb
field_size_in_bytes: 77579988
filter_count: 217
filter_evictions: 0
filter_size: 981.7kb
filter_size_in_bytes: 1005320
}
merges: {
current: 0
current_docs: 0
current_size: 0b
current_size_in_bytes: 0
total: 2073
total_time: 28m
total_time_in_millis: 1681573
total_docs: 5233813
total_size: 20.6gb
total_size_in_bytes: 22176964666
}
refresh: {
total: 18970
total_time: 22.8m
total_time_in_millis: 1373282
}
flush: {
total: 4072
total_time: 1.6m
total_time_in_millis: 98095
}
}
}

This cluster works fine with search, indexing, facteing etc but after
some irregular interval(between 24-48hrs) data node is getting shut.
Surprisingly there is no activity on the cluster when this happens, i
mean no search and indexing. Logs are also pretty much clear.
Logs when system is crashed and after manual restart of the cluster
are

[2012-06-20 08:00:04,678][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188842]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@75f34e16, error [true], resetting
[2012-06-20 08:00:12,681][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188852]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@799b0299, error [true], resetting
[2012-06-20 08:00:21,105][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188862]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@f75cb00, error [true], resetting
[2012-06-20 08:00:30,200][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188873]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@1ffcf6e7, error [true], resetting
[2012-06-20 08:00:39,544][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188884]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@5892644d, error [true], resetting
[2012-06-20 08:00:49,076][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188896]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@557e75f5, error [true], resetting
[2012-06-20 08:00:59,341][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188908]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@21dadca4, error [true], resetting
[2012-06-20 08:01:09,434][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188920]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@10dbd700, error [true], resetting
[2012-06-20 08:01:17,249][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188930]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@35cbd34b, error [true], resetting
[2012-06-20 08:01:23,942][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188938]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@a2a8fed, error [true], resetting
[2012-06-20 08:01:30,307][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188947]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@62411fb4, error [true], resetting
[2012-06-20 08:01:39,292][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188957]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@36ea5908, error [true], resetting
[2012-06-20 10:08:20,126][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initializing ...
[2012-06-20 10:08:20,194][INFO ][plugins ]
[dmz14_data_node] loaded [es-river-rss-oe, river-rss, lang-
javascript], sites [head]
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initialized
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: starting ...
[2012-06-20 10:08:26,176][INFO ][transport ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/192.168.150.14:9301]}
[2012-06-20 10:08:56,190][WARN ][discovery ]
[dmz14_data_node] waited for 30s and no initial state was set by the
discovery
[2012-06-20 10:08:56,191][INFO ][discovery ]
[dmz14_data_node] elasticsearch_dmz/WXMk_K-VSaulKWfPYcAHXQ
[2012-06-20 10:08:56,403][INFO ][http ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:8081]},
publish_address {inet[/192.168.150.14:8081]}
[2012-06-20 10:08:56,404][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: started

I am working on ES for quite some time now and dealt with many issues
sometime by tuning my setting and analyzing logs to identify the
problem but this time I am clueless as there is no error in the logs.

Please suggest how can I hunt this down.


(Shay Banon) #7

Heya, no, yellow state would not lead to a crash of elasticsearch under no
circumstances...

On Sun, Jun 24, 2012 at 6:17 PM, jagdeep reach.jagdeep@gmail.com wrote:

Hi Shay,

I have observed one more thing.. Earlier this crash was happening
after 24-48 hours. 3 indexes had replica count of 1 whereas other
indexes had replica count of 0. I configured only one data node for
this cluster hence cluster state was yellow. I was ignoring this for a
while.
3 days before I changed replica count to 0 and cluster state turned
green. Its been 72 hours and system is still up and running. I am
keeping a close eye on this cluster.
As per my understanding this is normal and expected behavior of ES and
dont sense yellow state or replica count behind this crash. Please
share any pointer towards this observation.

Thanks and regards
Jagdeep

On Jun 21, 12:33 pm, Shay Banon kim...@gmail.com wrote:

I assume you look at the logs and oyu don't see any "shutdown" / closing
messages. This can happen for several reasons (I've see all of those):

  • an external process is calling "kill -9": Tricky to find, you need to
    find the process that does that...
  • the service wrapper (if you use it) kills the process. Check the
    service
    wrapper logs.
  • The JVM crashed. Check for jvm crash logs (hs_err files).

On Wed, Jun 20, 2012 at 12:53 PM, jagdeep reach.jagd...@gmail.com
wrote:

I am facing a very unexpected shutdown of ES data node. It dumps the
memory and automatically getting shut(even console is getting closed).
I have already dealt with memory related issues but this one is pretty
weird because nothing is there in the logs.

Server configuration

OS - Windows server 2008 RT standard
CPU - 24 cores
RAM - 32 GB

ES configuration

Data node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=20g
MAster node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=5g

Cluster state is {
cluster_name: elasticsearch_****
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 1
active_primary_shards: 330
active_shards: 330
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Data node state

{
name: dmz14_data_node
transport_address: inet[/***.***.*.:9301]
hostname: **********
attributes: {
tag: ************
master: false
}
indices: {
store: {
size: 21.2gb
size_in_bytes: 22845246770
}
docs: {
count: 2525295
deleted: 551707
}
indexing: {
index_total: 1207877
index_time: 21.1m
index_time_in_millis: 1266718
index_current: 0
delete_total: 0
delete_time: 0s
delete_time_in_millis: 0
delete_current: 0
}
get: {
total: 0
time: 0s
time_in_millis: 0
exists_total: 0
exists_time: 0s
exists_time_in_millis: 0
missing_total: 0
missing_time: 0s
missing_time_in_millis: 0
current: 0
}
search: {
query_total: 8075
query_time: 31.7s
query_time_in_millis: 31702
query_current: 0
fetch_total: 1242
fetch_time: 907ms
fetch_time_in_millis: 907
fetch_current: 0
}
cache: {
field_evictions: 0
field_size: 73.9mb
field_size_in_bytes: 77579988
filter_count: 217
filter_evictions: 0
filter_size: 981.7kb
filter_size_in_bytes: 1005320
}
merges: {
current: 0
current_docs: 0
current_size: 0b
current_size_in_bytes: 0
total: 2073
total_time: 28m
total_time_in_millis: 1681573
total_docs: 5233813
total_size: 20.6gb
total_size_in_bytes: 22176964666
}
refresh: {
total: 18970
total_time: 22.8m
total_time_in_millis: 1373282
}
flush: {
total: 4072
total_time: 1.6m
total_time_in_millis: 98095
}
}
}

This cluster works fine with search, indexing, facteing etc but after
some irregular interval(between 24-48hrs) data node is getting shut.
Surprisingly there is no activity on the cluster when this happens, i
mean no search and indexing. Logs are also pretty much clear.
Logs when system is crashed and after manual restart of the cluster
are

[2012-06-20 08:00:04,678][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188842]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@75f34e16, error [true], resetting
[2012-06-20 08:00:12,681][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188852]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@799b0299, error [true], resetting
[2012-06-20 08:00:21,105][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188862]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@f75cb00, error [true], resetting
[2012-06-20 08:00:30,200][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188873]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@1ffcf6e7, error [true], resetting
[2012-06-20 08:00:39,544][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188884]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@5892644d, error [true], resetting
[2012-06-20 08:00:49,076][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188896]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@557e75f5, error [true], resetting
[2012-06-20 08:00:59,341][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188908]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@21dadca4, error [true], resetting
[2012-06-20 08:01:09,434][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188920]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@10dbd700, error [true], resetting
[2012-06-20 08:01:17,249][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188930]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@35cbd34b, error [true], resetting
[2012-06-20 08:01:23,942][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188938]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@a2a8fed, error [true], resetting
[2012-06-20 08:01:30,307][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188947]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@62411fb4, error [true], resetting
[2012-06-20 08:01:39,292][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188957]
handler

org.elasticsearch.action.support.master.TransportMasterNodeOperationAction

$4@36ea5908, error [true], resetting
[2012-06-20 10:08:20,126][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initializing ...
[2012-06-20 10:08:20,194][INFO ][plugins ]
[dmz14_data_node] loaded [es-river-rss-oe, river-rss, lang-
javascript], sites [head]
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initialized
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: starting ...
[2012-06-20 10:08:26,176][INFO ][transport ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/192.168.150.14:9301]}
[2012-06-20 10:08:56,190][WARN ][discovery ]
[dmz14_data_node] waited for 30s and no initial state was set by the
discovery
[2012-06-20 10:08:56,191][INFO ][discovery ]
[dmz14_data_node] elasticsearch_dmz/WXMk_K-VSaulKWfPYcAHXQ
[2012-06-20 10:08:56,403][INFO ][http ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:8081]},
publish_address {inet[/192.168.150.14:8081]}
[2012-06-20 10:08:56,404][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: started

I am working on ES for quite some time now and dealt with many issues
sometime by tuning my setting and analyzing logs to identify the
problem but this time I am clueless as there is no error in the logs.

Please suggest how can I hunt this down.


(system) #8