Elasticsearch cluster 5.5 under heavy load

Hey,

I work on Elasticsearch cluster in 5.5 version.
About 16 physical workers : 60 java instances (6 coordinators, 3 masters and data nodes)

Coord/Data : 30 Go heap / instance
Master : 15 Go / instance

About 6 To data indexing / day.

Many Timeout when collecting data, somes freez (impossible to join cluster with curl) and many rejected search
exemples :

node_name     active rejected completed
node1                7165    621950
node2                7803    537168

How we can optimize this?

Thanks for help

Please upgrade as a matter of urgency, 5.X is EOL :slight_smile:

What is the output from the _cluster/stats?pretty&human endpoint?

Ahhh yes.... But migrate 5.5 to 7x seems very hard (137 To)

We can see many rejecte (about 40 000 for somes nodes), many timeout when collecting data, stats, many timeout between nodes...

Below the content of cluster stats :

{
  "_nodes": {
    "total": 60,
    "successful": 60,
    "failed": 0
  },
  "cluster_name": "BIGDATAFRANCE_ES_5_PROD",
  "timestamp": 1601455721219,
  "status": "green",
  "indices": {
    "count": 2804,
    "shards": {
      "total": 48088,
      "primaries": 23780,
      "replication": 1.0222035323801515,
      "index": {
        "shards": {
          "min": 2,
          "max": 117,
          "avg": 17.14978601997147
        },
        "primaries": {
          "min": 1,
          "max": 39,
          "avg": 8.48074179743224
        },
        "replication": {
          "min": 0,
          "max": 14,
          "avg": 1.022111269614836
        }
      }
    },
    "docs": {
      "count": 23593672461,
      "deleted": 68470499
    },
    "store": {
      "size": "137tb",
      "size_in_bytes": 150706328929657,
      "throttle_time": "0s",
      "throttle_time_in_millis": 0
    },
    "fielddata": {
      "memory_size": "537mb",
      "memory_size_in_bytes": 563124512,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "53.8gb",
      "memory_size_in_bytes": 57868575183,
      "total_count": 1306220894,
      "hit_count": 289971069,
      "miss_count": 1016249825,
      "cache_size": 2515199,
      "cache_count": 5249389,
      "evictions": 2734190
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 734719,
      "memory": "276.7gb",
      "memory_in_bytes": 297170241934,
      "terms_memory": "225.3gb",
      "terms_memory_in_bytes": 241914757812,
      "stored_fields_memory": "33.3gb",
      "stored_fields_memory_in_bytes": 35838284328,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "3.5gb",
      "norms_memory_in_bytes": 3792822656,
      "points_memory": "6.3gb",
      "points_memory_in_bytes": 6848774010,
      "doc_values_memory": "8.1gb",
      "doc_values_memory_in_bytes": 8775603128,
      "index_writer_memory": "11.4gb",
      "index_writer_memory_in_bytes": 12295667531,
      "version_map_memory": "30.2mb",
      "version_map_memory_in_bytes": 31728195,
      "fixed_bit_set": "15.7mb",
      "fixed_bit_set_memory_in_bytes": 16558352,
      "max_unsafe_auto_id_timestamp": 1601424017942,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 60,
      "data": 50,
      "coordinating_only": 7,
      "master": 3,
      "ingest": 0
    },
    "versions": [
      "5.5.1"
    ],
    "os": {
      "available_processors": 1920,
      "allocated_processors": 1920,
      "names": [
        {
          "name": "Linux",
          "count": 60
        }
      ],
      "mem": {
        "total": "16.5tb",
        "total_in_bytes": 18240380289024,
        "free": "257.1gb",
        "free_in_bytes": 276165922816,
        "used": "16.3tb",
        "used_in_bytes": 17964214366208,
        "free_percent": 2,
        "used_percent": 98
      }
    },
    "process": {
      "cpu": {
        "percent": 192
      },
      "open_file_descriptors": {
        "min": 2007,
        "max": 4153,
        "avg": 3751
      }
    },
    "jvm": {
      "max_uptime": "6.6d",
      "max_uptime_in_millis": 573004466,
      "versions": [
        {
          "version": "1.8.0_161",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.161-b14",
          "vm_vendor": "Oracle Corporation",
          "count": 52
        },
        {
          "version": "1.8.0_181",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.181-b13",
          "vm_vendor": "Oracle Corporation",
          "count": 8
        }
      ],
      "mem": {
        "heap_used": "1.1tb",
        "heap_used_in_bytes": 1212767807704,
        "heap_max": "1.7tb",
        "heap_max_in_bytes": 1884416901120
      },
      "threads": 19016
    },
    "fs": {
      "total": "412.6tb",
      "total_in_bytes": 453696445579264,
      "free": "275.4tb",
      "free_in_bytes": 302888487100416,
      "available": "264.1tb",
      "available_in_bytes": 290421842284544,
      "spins": "true"
    },
    "plugins": [
      {
        "name": "search-guard-5",
        "version": "5.5.1-15",
        "description": "Provide access control related features for Elasticsearch 5",
        "classname": "com.floragunn.searchguard.SearchGuardPlugin",
        "has_native_controller": false
      },
      {
        "name": "x-pack",
        "version": "5.5.1",
        "description": "Elasticsearch Expanded Pack Plugin",
        "classname": "org.elasticsearch.xpack.XPackPlugin",
        "has_native_controller": true
      }
    ],
    "network_types": {
      "transport_types": {
        "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport": 60
      },
      "http_types": {
        "com.floragunn.searchguard.http.SearchGuardHttpServerTransport": 60
      }
    }
  }
}

That's probably causing quite a few issues. You need to reduce that to start.
Upgrading will also help quite a lot.

Ok, i will planned a merge. I will hope that help (during somes month again :slight_smile: )

48000 shards for 137TB of data gives an average shard size of less than 3GB, which is quite small in a cluster like this. I would recommend trying to reduce the shard count by a factor of 10 or so.

As you are on an old version, are you forcemerging old indices that are no longer written to (if any) down to a single segment per shard? This can reduce heap usage quite significantly.

I also see you are using SearchGuard. I have no experience with this so do not know if this could potentially have any effect on the cluster as well.

Hello,

I will be shrink somes index (according to max doc for shrink).

Recently, we have many messages about :

EsRejected : unexpected error while indexing monitoring document
    org.elasticsearch.xpack.monitoring.exporter.ExportException
EsThreadPoolExecutor[bulk, queue capacity = 200, 
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@35d290a7[Running, pool size = 32, active threads = 32, queued tasks = 340, completed tasks = 3028101]]

Cluster sees its indexation rate drastically reduced.
APIs become unresponsive.

Also, i use following parameters for jvp options (i know risks about data corrumption - never appears) :

-XX:+UseG1GC
-XX:MaxGCPauseMillis=400
-XX:G1HeapWastePercent=15
-XX:ParallelGCThreads=20
-XX:ConcGCThreads=5
## optimizations
# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.