Cluster deteriorates after a couple of days

Hi,

My ES cluster works great, for a couple of days, and then becomes virtually unusable. It refuses to accept data from logstash and queries becomes extremely slow. Optimizing the indices doesn't seem to help but pruning old log entries makes it work again (keeping 3-4 days). The document count is usually around 50-60 Mil when it stops working. Any idea of what's going on?

I have a classic ELK setup with Elasticsearch-1.5.2, Logstash-1.5 and Kibana-4. 4 nodes: 3 ES data nodes, 1 ES no data/logstash/kibana.

ES config:
node.name: ccdlog04
index.number_of_replicas: 2
discovery.zen.ping.unicast.hosts: ["ccdlog01","ccdlog02","ccdlog03","ccdlog04"]
node.master: true
node.data: false
http.cors.enabled: true

Excerpts from LS config (in case it's the timestamps):

grok {
match => [
"message", "%{YEAR:time}%{MONTHNUM:time}%{MONTHDAY:time} %{TIME:time}...GREEDYDATA:logMessage}"
...
mutate {
add_field => [ "app_timestamp", "%{[time][0]}-%{[time][1]}-%{[time][2]}T%{[time][3]}Z" ]
}
date {
match => ["app_timestamp","ISO8601"]
}

LS log example of what happens after a couple of days:

{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}

What's in the Elasticsearch logs? I suspect that you're running out of heap. How big is your heap?

I am getting the same problem right now. It doesn't index any logs and I keep getting the same response code as he does.

I actually found the answer in https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html. ES will keep all the field data in memory until it chokes unless you control it. Configuring indices.fielddata.cache.size solved it for me.

Where do I find this and how do I control it?

indices.fielddata.cache.size is a configuration parameter that you can set in elasticsearch.yml.

I have done this. It still isn't really working. I put it to 40%. Yesterday I managed to index a few old logs and I'm guessing it became to much. The last log that was indexed into elasticsearch was indexed yesterday at 17:06. After that it just stopped indexing.

Is there a way to offload some of the memory burden by storing on disk?

@steverobbins, unless this is directly related to this old thread perhaps you can start a thread of your own? There's no silver bullet for reducing the memory pressure.