Cluster deteriorates after a couple of days

helium · May 26, 2015, 7:49am

Hi,

My ES cluster works great, for a couple of days, and then becomes virtually unusable. It refuses to accept data from logstash and queries becomes extremely slow. Optimizing the indices doesn't seem to help but pruning old log entries makes it work again (keeping 3-4 days). The document count is usually around 50-60 Mil when it stops working. Any idea of what's going on?

I have a classic ELK setup with Elasticsearch-1.5.2, Logstash-1.5 and Kibana-4. 4 nodes: 3 ES data nodes, 1 ES no data/logstash/kibana.

ES config:node.name: ccdlog04 index.number_of_replicas: 2 discovery.zen.ping.unicast.hosts: ["ccdlog01","ccdlog02","ccdlog03","ccdlog04"] node.master: true node.data: false http.cors.enabled: true

Excerpts from LS config (in case it's the timestamps):
grok { match => [ "message", "%{YEAR:time}%{MONTHNUM:time}%{MONTHDAY:time} %{TIME:time}...GREEDYDATA:logMessage}" ... mutate { add_field => [ "app_timestamp", "%{[time][0]}-%{[time][1]}-%{[time][2]}T%{[time][3]}Z" ] } date { match => ["app_timestamp","ISO8601"] }

LS log example of what happens after a couple of days:
{:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.381000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.382000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} {:timestamp=>"2015-05-26T05:27:27.383000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn}

magnusbaeck · May 26, 2015, 8:20am

What's in the Elasticsearch logs? I suspect that you're running out of heap. How big is your heap?

simonrisberg · July 3, 2015, 7:26am

I am getting the same problem right now. It doesn't index any logs and I keep getting the same response code as he does.

helium · July 3, 2015, 7:53am

I actually found the answer in https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html. ES will keep all the field data in memory until it chokes unless you control it. Configuring indices.fielddata.cache.size solved it for me.

simonrisberg · July 3, 2015, 8:55am

Where do I find this and how do I control it?

magnusbaeck · July 3, 2015, 10:51am

indices.fielddata.cache.size is a configuration parameter that you can set in elasticsearch.yml.

simonrisberg · July 3, 2015, 10:58am

I have done this. It still isn't really working. I put it to 40%. Yesterday I managed to index a few old logs and I'm guessing it became to much. The last log that was indexed into elasticsearch was indexed yesterday at 17:06. After that it just stopped indexing.

steverobbins · November 16, 2015, 9:13pm

Is there a way to offload some of the memory burden by storing on disk?

magnusbaeck · November 16, 2015, 9:38pm

@steverobbins, unless this is directly related to this old thread perhaps you can start a thread of your own? There's no silver bullet for reducing the memory pressure.

Topic		Replies	Views
New ELK env. stopping working Elasticsearch	12	1967	July 5, 2017
ES cluster fails at random times Elasticsearch	5	1247	December 29, 2016
ES cluster becomes unresponsive Elasticsearch	2	696	July 6, 2017
Cluster (ES 5.2) performance degrading after indexing Elasticsearch	3	508	June 6, 2017
ES instability Elasticsearch	8	429	July 6, 2017

Cluster deteriorates after a couple of days

Related topics