Cluster going down because of garbage collector

Hello,

We just have an issue on our cluster :
1 Coordinator Nodes
3 Master Eligible Nodes
Version 6.1.2

Yesterday, around 16PM our cluster stopped working well.
In the logs files of our coordinator node we had :

[2019-02-27T17:46:51,620][WARN ][logstash.outputs.elasticsearch] UNEXPECTED POOL ERROR {:e=>#<LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError: No Available connections>}

It was like Logstash wasn't able to connect to the Elasticsearch Data Nodes.

On the nodew we had this :


The GarbageCollector was taking a lot of time, and ressources, that's why Logstash output wasn't working.

We have rebooted all the Data Nodes, one by one and everything returned OK after that.

At 16PM I was requesting on Kibana and Timelion. I might have choose a large time (many days). Do you think it can be the root cause ?
It's sound like it was a Java memory leak ? Was it ?

Regards

Did you take a heap dump before restarting any of the nodes? Without that, we've no real way of investigating where all the memory was going.

No I didn't ...

Do you think it can be related to a small sizing in any of the parameters ?
JVM Memory ? Cache size ?

Here is the cluster's stats :

Cluster Stats #1
{                                                              
  "_nodes" : {                                                 
    "total" : 5,                                               
    "successful" : 5,                                          
    "failed" : 0                                               
  },                                                           
  "cluster_name" : "CLUSTERELK",                          
  "timestamp" : 1551454097176,                                 
  "status" : "green",                                          
  "indices" : {                                                
    "count" : 88,                                              
    "shards" : {                                               
      "total" : 560,                                           
      "primaries" : 280,                                       
      "replication" : 1.0,                                     
      "index" : {                                              
        "shards" : {                                           
          "min" : 2,                                           
          "max" : 10,                                          
          "avg" : 6.363636363636363                            
        },                                                     
        "primaries" : {                                        
          "min" : 1,                                           
          "max" : 5,                                           
          "avg" : 3.1818181818181817                           
        },                                                     
        "replication" : {                                      
          "min" : 1.0,                                         
          "max" : 1.0,                                         
          "avg" : 1.0                                          
        }                                                      
      }                                                        
    },                                                         
    "docs" : {                                                 
      "count" : 808754070,                                     
      "deleted" : 48908                                        
    },                                                         
    "store" : {                                                
      "size" : "1.3tb",                                        
      "size_in_bytes" : 1504137708555                          
    },                                                         
    "fielddata" : {                                            
      "memory_size" : "104.8kb",                               
      "memory_size_in_bytes" : 107384,                         
      "evictions" : 0                                          
    },                                                         
    "query_cache" : {                                          
      "memory_size" : "4.9mb",                                 
      "memory_size_in_bytes" : 5225918,                        
      "total_count" : 1511704,                                 
      "hit_count" : 1133855,                                   
      "miss_count" : 377849,                                   
      "cache_size" : 2857,                                     
      "cache_count" : 8658,                                    
      "evictions" : 5801                                       
    },                                                         
    "completion" : {                                           
      "size" : "0b",                                           
      "size_in_bytes" : 0                                      
    },                                                         
    "segments" : {                                             
      "count" : 6995,                                          
      "memory" : "1.9gb",                                      
      "memory_in_bytes" : 2114440236,                          
      "terms_memory" : "1.1gb",                                
      "terms_memory_in_bytes" : 1240524681,                    
      "stored_fields_memory" : "625.5mb",                      
      "stored_fields_memory_in_bytes" : 655913392,             
      "term_vectors_memory" : "0b",                            
      "term_vectors_memory_in_bytes" : 0,                      
      "norms_memory" : "41.3mb",                               
      "norms_memory_in_bytes" : 43348544,                      
      "points_memory" : "133.1mb",                             
      "points_memory_in_bytes" : 139569895,                    
      "doc_values_memory" : "33.4mb",                          
      "doc_values_memory_in_bytes" : 35083724,                 
      "index_writer_memory" : "79mb",                          
      "index_writer_memory_in_bytes" : 82875984,               
      "version_map_memory" : "226.2kb",                        
      "version_map_memory_in_bytes" : 231675,                  
      "fixed_bit_set" : "2.9mb",                               
      "fixed_bit_set_memory_in_bytes" : 3093264,               
      "max_unsafe_auto_id_timestamp" : 1551398503583,          
      "file_sizes" : { }                                       
    }                                                          
  },                                                           

the end :

Cluster Stats #2
  > "nodes" : {                                                  
>     "count" : {                                                
>       "total" : 5,                                             
>       "data" : 3,                                              
>       "coordinating_only" : 2,                                 
>       "master" : 3,                                            
>       "ingest" : 3                                             
>     },                                                         
>     "versions" : [                                             
>       "6.1.2"                                                  
>     ],                                                         
>     "os" : {                                                   
>       "available_processors" : 20,                             
>       "allocated_processors" : 20,                             
>       "names" : [                                              
>         {                                                      
>           "name" : "Linux",                                    
>           "count" : 5                                          
>         }                                                      
>       ],                                                       
>       "mem" : {                                                
>         "total" : "38.9gb",                                    
>         "total_in_bytes" : 41856786432,                        
>         "free" : "2.6gb",                                      
>         "free_in_bytes" : 2846814208,                          
>         "used" : "36.3gb",                                     
>         "used_in_bytes" : 39009972224,                         
>         "free_percent" : 7,                                    
>         "used_percent" : 93                                    
>       }                                                        
>     },                                                         
>     "process" : {                                              
>       "cpu" : {                                                
>         "percent" : 33                                         
>       },                                                       
>       "open_file_descriptors" : {                              
>         "min" : 307,                                           
>         "max" : 896,                                           
>         "avg" : 614                                            
>       }                                                        
>     },                                                         
>     "jvm" : {                                                  
>       "max_uptime" : "1.3d",                                   
>       "max_uptime_in_millis" : 113352480,                      
>       "versions" : [                                           
>         {                                                      
>           "version" : "1.8.0_191",                             
>           "vm_name" : "OpenJDK 64-Bit Server VM",              
>           "vm_version" : "25.191-b12",                         
>           "vm_vendor" : "Oracle Corporation",                  
>           "count" : 2                                          
>         },                                                     
>         {                                                      
>           "version" : "1.8.0_151",                             
>           "vm_name" : "OpenJDK 64-Bit Server VM",              
>           "vm_version" : "25.151-b12",                         
>           "vm_vendor" : "Oracle Corporation",                  
>           "count" : 3                                          
>         }                                                      
>       ],                                                       
>       "mem" : {                                                
>         "heap_used" : "7.8gb",                                 
>         "heap_used_in_bytes" : 8387071480,                     
>         "heap_max" : "19.8gb",                                 
>         "heap_max_in_bytes" : 21300510720                      
>       },                                                       
>       "threads" : 309                                          
>     },                                                         
>     "fs" : {                                                   
>       "total" : "2tb",                                         
>       "total_in_bytes" : 2289633509376,                        
>       "free" : "709.6gb",                                      
>       "free_in_bytes" : 761997557760,                          
>       "available" : "601.2gb",                                 
>       "available_in_bytes" : 645572706304                      
>     },                                                         
>     "plugins" : [                                              
>       {                                                        
>         "name" : "x-pack",                                     
>         "version" : "6.1.2",                                   
>         "description" : "Elasticsearch Expanded Pack Plugin",  
>         "classname" : "org.elasticsearch.xpack.XPackPlugin",   
>         "has_native_controller" : true,                        
>         "requires_keystore" : true                             
>       }                                                        
>     ],                                                         
>     "network_types" : {                                        
>       "transport_types" : {                                    
>         "netty4" : 5                                           
>       },                                                       
>       "http_types" : {                                         
>         "netty4" : 5                                           
>       }                                                        
>     }                                                          
>   }                                                            
> } 

We have increased the memory on the data nodes, so it was lower when the problem occured.

Thx

It's certainly possible that you executed a query that overloaded the cluster, and also possible that giving your nodes more heap will let them cope with the situation better.

How many shards do you have in this cluster?

560 (primary + replica)

I did wonder if there was perhaps an excess of shards, but 560 seems reasonable for a cluster with 38GB of heap.

Sorry, without something like a heap dump we can only really speculate on what was consuming all the memory. If it happens again, grab one before restarting the nodes.

1 Like

Well we didn't have that much heap memory when it happened, and I guess it takes in account the 2 coordinators nodes, isn't it ?

We have only three data nodes and on each one we had only 2Go heap -> so 6Go for the 3 nodes when the problem occured.

Now we upgraded to 4Go / data nodes.

Our first cluster sizing was probably be too small :smiley:

I'll for sure !

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.