Hi,
My ES cluster works great, for a couple of days, and then becomes virtually unusable. It refuses to accept data from logstash and queries becomes extremely slow. Optimizing the indices doesn't seem to help but pruning old log entries makes it work again (keeping 3-4 days). The document count is usually around 50-60 Mil when it stops working. Any idea of what's going on?
I have a classic ELK setup with Elasticsearch-1.5.2, Logstash-1.5 and Kibana-4. 4 nodes: 3 ES data nodes, 1 ES no data/logstash/kibana.
ES config:
node.name: ccdlog04
index.number_of_replicas: 2
discovery.zen.ping.unicast.hosts: ["ccdlog01","ccdlog02","ccdlog03","ccdlog04"]
node.master: true
node.data: false
http.cors.enabled: true
Excerpts from LS config (in case it's the timestamps):
grok {
match => [
"message", "%{YEAR:time}%{MONTHNUM:time}%{MONTHDAY:time} %{TIME:time}...GREEDYDATA:logMessage}"
...
mutate {
add_field => [ "app_timestamp", "%{[time][0]}-%{[time][1]}-%{[time][2]}T%{[time][3]}Z" ]
}
date {
match => ["app_timestamp","ISO8601"]
}
LS log example of what happens after a couple of days:
{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}