I am facing a very unexpected shutdown of ES data node. It dumps the
memory and automatically getting shut(even console is getting closed).
I have already dealt with memory related issues but this one is pretty
weird because nothing is there in the logs.
Server configuration
OS - Windows server 2008 RT standard
CPU - 24 cores
RAM - 32 GB
ES configuration
Data node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=20g
MAster node 1 - ES_MIN_MEM = 1024m
ES_MAX_MEM=5g
Cluster state is {
cluster_name: elasticsearch_****
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 1
active_primary_shards: 330
active_shards: 330
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}
Data node state
{
name: dmz14_data_node
transport_address: inet[/***.***.*.:9301]
hostname: **********
attributes: {
tag: ************
master: false
}
indices: {
store: {
size: 21.2gb
size_in_bytes: 22845246770
}
docs: {
count: 2525295
deleted: 551707
}
indexing: {
index_total: 1207877
index_time: 21.1m
index_time_in_millis: 1266718
index_current: 0
delete_total: 0
delete_time: 0s
delete_time_in_millis: 0
delete_current: 0
}
get: {
total: 0
time: 0s
time_in_millis: 0
exists_total: 0
exists_time: 0s
exists_time_in_millis: 0
missing_total: 0
missing_time: 0s
missing_time_in_millis: 0
current: 0
}
search: {
query_total: 8075
query_time: 31.7s
query_time_in_millis: 31702
query_current: 0
fetch_total: 1242
fetch_time: 907ms
fetch_time_in_millis: 907
fetch_current: 0
}
cache: {
field_evictions: 0
field_size: 73.9mb
field_size_in_bytes: 77579988
filter_count: 217
filter_evictions: 0
filter_size: 981.7kb
filter_size_in_bytes: 1005320
}
merges: {
current: 0
current_docs: 0
current_size: 0b
current_size_in_bytes: 0
total: 2073
total_time: 28m
total_time_in_millis: 1681573
total_docs: 5233813
total_size: 20.6gb
total_size_in_bytes: 22176964666
}
refresh: {
total: 18970
total_time: 22.8m
total_time_in_millis: 1373282
}
flush: {
total: 4072
total_time: 1.6m
total_time_in_millis: 98095
}
}
}
This cluster works fine with search, indexing, facteing etc but after
some irregular interval(between 24-48hrs) data node is getting shut.
Surprisingly there is no activity on the cluster when this happens, i
mean no search and indexing. Logs are also pretty much clear.
Logs when system is crashed and after manual restart of the cluster
are
[2012-06-20 08:00:04,678][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188842]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@75f34e16, error [true], resetting
[2012-06-20 08:00:12,681][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188852]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@799b0299, error [true], resetting
[2012-06-20 08:00:21,105][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188862]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@f75cb00, error [true], resetting
[2012-06-20 08:00:30,200][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188873]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@1ffcf6e7, error [true], resetting
[2012-06-20 08:00:39,544][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188884]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@5892644d, error [true], resetting
[2012-06-20 08:00:49,076][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188896]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@557e75f5, error [true], resetting
[2012-06-20 08:00:59,341][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188908]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@21dadca4, error [true], resetting
[2012-06-20 08:01:09,434][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188920]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@10dbd700, error [true], resetting
[2012-06-20 08:01:17,249][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188930]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@35cbd34b, error [true], resetting
[2012-06-20 08:01:23,942][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188938]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@a2a8fed, error [true], resetting
[2012-06-20 08:01:30,307][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188947]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@62411fb4, error [true], resetting
[2012-06-20 08:01:39,292][WARN ][transport.netty ]
[dmz14_data_node] Message not fully read (response) for [188957]
handler
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$4@36ea5908, error [true], resetting
[2012-06-20 10:08:20,126][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initializing ...
[2012-06-20 10:08:20,194][INFO ][plugins ]
[dmz14_data_node] loaded [es-river-rss-oe, river-rss, lang-
javascript], sites [head]
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: initialized
[2012-06-20 10:08:25,750][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: starting ...
[2012-06-20 10:08:26,176][INFO ][transport ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/192.168.150.14:9301]}
[2012-06-20 10:08:56,190][WARN ][discovery ]
[dmz14_data_node] waited for 30s and no initial state was set by the
discovery
[2012-06-20 10:08:56,191][INFO ][discovery ]
[dmz14_data_node] elasticsearch_dmz/WXMk_K-VSaulKWfPYcAHXQ
[2012-06-20 10:08:56,403][INFO ][http ]
[dmz14_data_node] bound_address {inet[/0:0:0:0:0:0:0:0:8081]},
publish_address {inet[/192.168.150.14:8081]}
[2012-06-20 10:08:56,404][INFO ][node ]
[dmz14_data_node] {0.19.2}[16084]: started
I am working on ES for quite some time now and dealt with many issues
sometime by tuning my setting and analyzing logs to identify the
problem but this time I am clueless as there is no error in the logs.
Please suggest how can I hunt this down.