Cluster going down because of garbage collector

dihpierrick · February 28, 2019, 11:24am

Hello,

We just have an issue on our cluster :
1 Coordinator Nodes
3 Master Eligible Nodes
Version 6.1.2

Yesterday, around 16PM our cluster stopped working well.
In the logs files of our coordinator node we had :

[2019-02-27T17:46:51,620][WARN ][logstash.outputs.elasticsearch] UNEXPECTED POOL ERROR {:e=>#<LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError: No Available connections>}

It was like Logstash wasn't able to connect to the Elasticsearch Data Nodes.

On the nodew we had this :

The GarbageCollector was taking a lot of time, and ressources, that's why Logstash output wasn't working.

We have rebooted all the Data Nodes, one by one and everything returned OK after that.

At 16PM I was requesting on Kibana and Timelion. I might have choose a large time (many days). Do you think it can be the root cause ?
It's sound like it was a Java memory leak ? Was it ?

Regards

DavidTurner · February 28, 2019, 11:46am

Did you take a heap dump before restarting any of the nodes? Without that, we've no real way of investigating where all the memory was going.

dihpierrick · February 28, 2019, 12:56pm

No I didn't ...

dihpierrick · March 1, 2019, 3:34pm

Do you think it can be related to a small sizing in any of the parameters ?
JVM Memory ? Cache size ?

Here is the cluster's stats :

Cluster Stats #1

{                                                              
  "_nodes" : {                                                 
    "total" : 5,                                               
    "successful" : 5,                                          
    "failed" : 0                                               
  },                                                           
  "cluster_name" : "CLUSTERELK",                          
  "timestamp" : 1551454097176,                                 
  "status" : "green",                                          
  "indices" : {                                                
    "count" : 88,                                              
    "shards" : {                                               
      "total" : 560,                                           
      "primaries" : 280,                                       
      "replication" : 1.0,                                     
      "index" : {                                              
        "shards" : {                                           
          "min" : 2,                                           
          "max" : 10,                                          
          "avg" : 6.363636363636363                            
        },                                                     
        "primaries" : {                                        
          "min" : 1,                                           
          "max" : 5,                                           
          "avg" : 3.1818181818181817                           
        },                                                     
        "replication" : {                                      
          "min" : 1.0,                                         
          "max" : 1.0,                                         
          "avg" : 1.0                                          
        }                                                      
      }                                                        
    },                                                         
    "docs" : {                                                 
      "count" : 808754070,                                     
      "deleted" : 48908                                        
    },                                                         
    "store" : {                                                
      "size" : "1.3tb",                                        
      "size_in_bytes" : 1504137708555                          
    },                                                         
    "fielddata" : {                                            
      "memory_size" : "104.8kb",                               
      "memory_size_in_bytes" : 107384,                         
      "evictions" : 0                                          
    },                                                         
    "query_cache" : {                                          
      "memory_size" : "4.9mb",                                 
      "memory_size_in_bytes" : 5225918,                        
      "total_count" : 1511704,                                 
      "hit_count" : 1133855,                                   
      "miss_count" : 377849,                                   
      "cache_size" : 2857,                                     
      "cache_count" : 8658,                                    
      "evictions" : 5801                                       
    },                                                         
    "completion" : {                                           
      "size" : "0b",                                           
      "size_in_bytes" : 0                                      
    },                                                         
    "segments" : {                                             
      "count" : 6995,                                          
      "memory" : "1.9gb",                                      
      "memory_in_bytes" : 2114440236,                          
      "terms_memory" : "1.1gb",                                
      "terms_memory_in_bytes" : 1240524681,                    
      "stored_fields_memory" : "625.5mb",                      
      "stored_fields_memory_in_bytes" : 655913392,             
      "term_vectors_memory" : "0b",                            
      "term_vectors_memory_in_bytes" : 0,                      
      "norms_memory" : "41.3mb",                               
      "norms_memory_in_bytes" : 43348544,                      
      "points_memory" : "133.1mb",                             
      "points_memory_in_bytes" : 139569895,                    
      "doc_values_memory" : "33.4mb",                          
      "doc_values_memory_in_bytes" : 35083724,                 
      "index_writer_memory" : "79mb",                          
      "index_writer_memory_in_bytes" : 82875984,               
      "version_map_memory" : "226.2kb",                        
      "version_map_memory_in_bytes" : 231675,                  
      "fixed_bit_set" : "2.9mb",                               
      "fixed_bit_set_memory_in_bytes" : 3093264,               
      "max_unsafe_auto_id_timestamp" : 1551398503583,          
      "file_sizes" : { }                                       
    }                                                          
  },

dihpierrick · March 1, 2019, 3:34pm

the end :

Cluster Stats #2

  > "nodes" : {                                                  
>     "count" : {                                                
>       "total" : 5,                                             
>       "data" : 3,                                              
>       "coordinating_only" : 2,                                 
>       "master" : 3,                                            
>       "ingest" : 3                                             
>     },                                                         
>     "versions" : [                                             
>       "6.1.2"                                                  
>     ],                                                         
>     "os" : {                                                   
>       "available_processors" : 20,                             
>       "allocated_processors" : 20,                             
>       "names" : [                                              
>         {                                                      
>           "name" : "Linux",                                    
>           "count" : 5                                          
>         }                                                      
>       ],                                                       
>       "mem" : {                                                
>         "total" : "38.9gb",                                    
>         "total_in_bytes" : 41856786432,                        
>         "free" : "2.6gb",                                      
>         "free_in_bytes" : 2846814208,                          
>         "used" : "36.3gb",                                     
>         "used_in_bytes" : 39009972224,                         
>         "free_percent" : 7,                                    
>         "used_percent" : 93                                    
>       }                                                        
>     },                                                         
>     "process" : {                                              
>       "cpu" : {                                                
>         "percent" : 33                                         
>       },                                                       
>       "open_file_descriptors" : {                              
>         "min" : 307,                                           
>         "max" : 896,                                           
>         "avg" : 614                                            
>       }                                                        
>     },                                                         
>     "jvm" : {                                                  
>       "max_uptime" : "1.3d",                                   
>       "max_uptime_in_millis" : 113352480,                      
>       "versions" : [                                           
>         {                                                      
>           "version" : "1.8.0_191",                             
>           "vm_name" : "OpenJDK 64-Bit Server VM",              
>           "vm_version" : "25.191-b12",                         
>           "vm_vendor" : "Oracle Corporation",                  
>           "count" : 2                                          
>         },                                                     
>         {                                                      
>           "version" : "1.8.0_151",                             
>           "vm_name" : "OpenJDK 64-Bit Server VM",              
>           "vm_version" : "25.151-b12",                         
>           "vm_vendor" : "Oracle Corporation",                  
>           "count" : 3                                          
>         }                                                      
>       ],                                                       
>       "mem" : {                                                
>         "heap_used" : "7.8gb",                                 
>         "heap_used_in_bytes" : 8387071480,                     
>         "heap_max" : "19.8gb",                                 
>         "heap_max_in_bytes" : 21300510720                      
>       },                                                       
>       "threads" : 309                                          
>     },                                                         
>     "fs" : {                                                   
>       "total" : "2tb",                                         
>       "total_in_bytes" : 2289633509376,                        
>       "free" : "709.6gb",                                      
>       "free_in_bytes" : 761997557760,                          
>       "available" : "601.2gb",                                 
>       "available_in_bytes" : 645572706304                      
>     },                                                         
>     "plugins" : [                                              
>       {                                                        
>         "name" : "x-pack",                                     
>         "version" : "6.1.2",                                   
>         "description" : "Elasticsearch Expanded Pack Plugin",  
>         "classname" : "org.elasticsearch.xpack.XPackPlugin",   
>         "has_native_controller" : true,                        
>         "requires_keystore" : true                             
>       }                                                        
>     ],                                                         
>     "network_types" : {                                        
>       "transport_types" : {                                    
>         "netty4" : 5                                           
>       },                                                       
>       "http_types" : {                                         
>         "netty4" : 5                                           
>       }                                                        
>     }                                                          
>   }                                                            
> }

We have increased the memory on the data nodes, so it was lower when the problem occured.

Thx

DavidTurner · March 1, 2019, 3:52pm

It's certainly possible that you executed a query that overloaded the cluster, and also possible that giving your nodes more heap will let them cope with the situation better.

How many shards do you have in this cluster?

dihpierrick · March 1, 2019, 3:56pm

560 (primary + replica)

DavidTurner · March 1, 2019, 4:06pm

I did wonder if there was perhaps an excess of shards, but 560 seems reasonable for a cluster with 38GB of heap.

Sorry, without something like a heap dump we can only really speculate on what was consuming all the memory. If it happens again, grab one before restarting the nodes.

dihpierrick · March 1, 2019, 4:14pm

Well we didn't have that much heap memory when it happened, and I guess it takes in account the 2 coordinators nodes, isn't it ?

We have only three data nodes and on each one we had only 2Go heap -> so 6Go for the 3 nodes when the problem occured.

Now we upgraded to 4Go / data nodes.

Our first cluster sizing was probably be too small

I'll for sure !

system · March 29, 2019, 4:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Main problem with garbage collector Elasticsearch	9	1177	August 2, 2021
Cluster failure Elasticsearch	1	280	July 6, 2017
A few general questions about Elasticsearch Elasticsearch	14	866	April 6, 2018
Cluster down after an autoreboot? Elasticsearch	5	576	March 8, 2018
Memory Issues with our cluser set up Elasticsearch	4	621	July 6, 2017

Cluster going down because of garbage collector

Related topics