Elastic Unstable

Hi Team Elastic,

I have been stressful latelty because my logs are coming to Elasticsearch delay for about 10 hours.

I have 3 nodes,
Node 1: master, ingest, transform, resource: 16vCPU, 16GB, 500GB
Node 2: data_hot, resource: 16vCPU, 64GB, 2 TB
Node 3: data_, data_warm : 16vCPU, 64TB, 10TB

Please help, I have been troubleshooting for 1 month and have no result

Hi @Dea_Agra,

Welcome back! Which version of Elasticsearch are you using? Can you give us more information on the troubleshooting you've done. For example have you checked the output of the _cluster/_health API?

What is the full output of the cluster stats API?

What type of hardware is the cluster deployed on? What type of storage are you using?

How much data are you indexing per day (or should the cluster be indexing if it was keeping up)?

What are you using to index data into the cluster?

Do you have monitoring installed?

1 Like

Continuing the discussion from Elastic Unstable:

{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"cluster_name" : "xforce-cluster",
"cluster_uuid" : "aspcDuHYQ6qRfE8flD5RXQ",
"timestamp" : 1704796161773,
"status" : "green",
"indices" : {
"count" : 130,
"shards" : {
"total" : 164,
"primaries" : 130,
"replication" : 0.26153846153846155,
"index" : {
"shards" : {
"min" : 1,
"max" : 2,
"avg" : 1.2615384615384615
},
"primaries" : {
"min" : 1,
"max" : 1,
"avg" : 1.0
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.26153846153846155
}
}
},
"docs" : {
"count" : 1386300702,
"deleted" : 1490912
},
"store" : {
"size_in_bytes" : 1845130506044,
"total_data_set_size_in_bytes" : 1845130506044,
"reserved_in_bytes" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 102296,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 3533173,
"total_count" : 17121836,
"hit_count" : 4152,
"miss_count" : 17117684,
"cache_size" : 614,
"cache_count" : 614,
"evictions" : 0
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 2272,
"memory_in_bytes" : 265190208,
"terms_memory_in_bytes" : 213271552,
"stored_fields_memory_in_bytes" : 5241472,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 26851968,
"points_memory_in_bytes" : 0,
"doc_values_memory_in_bytes" : 19825216,
"index_writer_memory_in_bytes" : 21342876,
"version_map_memory_in_bytes" : 5526,
"fixed_bit_set_memory_in_bytes" : 63466048,
"max_unsafe_auto_id_timestamp" : 1704795307981,
"file_sizes" : { }
},
"mappings" : {
"field_types" : [
{
"name" : "alias",
"count" : 920,
"index_count" : 32,
"script_count" : 0
},
{
"name" : "binary",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "boolean",
"count" : 2921,
"index_count" : 94,
"script_count" : 0
},
{
"name" : "byte",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "constant_keyword",
"count" : 10,
"index_count" : 4,
"script_count" : 0
},
{
"name" : "date",
"count" : 4229,
"index_count" : 110,
"script_count" : 0
},
{
"name" : "date_nanos",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "date_range",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "double",
"count" : 823,
"index_count" : 30,
"script_count" : 0
},
{
"name" : "double_range",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "flattened",
"count" : 289,
"index_count" : 23,
"script_count" : 0
},
{
"name" : "float",
"count" : 815,
"index_count" : 48,
"script_count" : 0
},
{
"name" : "float_range",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "geo_point",
"count" : 222,
"index_count" : 32,
"script_count" : 0
},
{
"name" : "geo_shape",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "half_float",
"count" : 57,
"index_count" : 15,
"script_count" : 0
},
{
"name" : "integer",
"count" : 206,
"index_count" : 32,
"script_count" : 0
},
{
"name" : "integer_range",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "ip",
"count" : 2879,
"index_count" : 36,
"script_count" : 0
},
{
"name" : "ip_range",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "keyword",
"count" : 118612,
"index_count" : 110,
"script_count" : 0
},
{
"name" : "long",
"count" : 28756,
"index_count" : 97,
"script_count" : 0
},
{
"name" : "long_range",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "nested",
"count" : 105,
"index_count" : 33,
"script_count" : 0
},
{
"name" : "object",
"count" : 22071,
"index_count" : 109,
"script_count" : 0
},
{
"name" : "scaled_float",
"count" : 2,
"index_count" : 2,
"script_count" : 0
},
{
"name" : "shape",
"count" : 1,
"index_count" : 1,
"script_count" : 0
},
{
"name" : "short",
"count" : 2122,
"index_count" : 22,
"script_count" : 0
},
{
"name" : "text",
"count" : 28353,
"index_count" : 102,
"script_count" : 0
},
{
"name" : "version",
"count" : 3,
"index_count" : 3,
"script_count" : 0
}
],
"runtime_field_types" :
},
"analysis" : {
"char_filter_types" : ,
"tokenizer_types" : ,
"filter_types" : ,
"analyzer_types" : ,
"built_in_char_filters" : ,
"built_in_tokenizers" : ,
"built_in_filters" : ,
"built_in_analyzers" :
},
"versions" : [
{
"version" : "7.17.12",
"index_count" : 130,
"primary_shard_count" : 130,
"total_primary_bytes" : 1839085416891
}
]
},
"nodes" : {
"count" : {
"total" : 3,
"coordinating_only" : 0,
"data" : 3,
"data_cold" : 0,
"data_content" : 0,
"data_frozen" : 0,
"data_hot" : 1,
"data_warm" : 1,
"ingest" : 3,
"master" : 1,
"ml" : 0,
"remote_cluster_client" : 1,
"transform" : 1,
"voting_only" : 0
},
"versions" : [
"7.17.12"
],
"os" : {
"available_processors" : 48,
"allocated_processors" : 48,
"names" : [
{
"name" : "Linux",
"count" : 3
}
],
"pretty_names" : [
{
"pretty_name" : "Ubuntu 22.04.3 LTS",
"count" : 3
}
],
"architectures" : [
{
"arch" : "amd64",
"count" : 3
}
],
"mem" : {
"total_in_bytes" : 151832961024,
"free_in_bytes" : 9252524032,
"used_in_bytes" : 142580436992,
"free_percent" : 6,
"used_percent" : 94
}
},
"process" : {
"cpu" : {
"percent" : 65
},
"open_file_descriptors" : {
"min" : 509,
"max" : 1543,
"avg" : 1105
}
},
"jvm" : {
"max_uptime_in_millis" : 322513325,
"versions" : [
{
"version" : "20.0.2",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "20.0.2+9-78",
"vm_vendor" : "Oracle Corporation",
"bundled_jdk" : true,
"using_bundled_jdk" : true,
"count" : 3
}
],
"mem" : {
"heap_used_in_bytes" : 42634670272,
"heap_max_in_bytes" : 73014444032
},
"threads" : 400
},
"fs" : {
"total_in_bytes" : 13591541923840,
"free_in_bytes" : 11706572861440,
"available_in_bytes" : 11020294971392
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"security4" : 3
},
"http_types" : {
"security4" : 3
}
},
"discovery_types" : {
"zen" : 3
},
"packaging_types" : [
{
"flavor" : "default",
"type" : "deb",
"count" : 3
}
],
"ingest" : {
"number_of_pipelines" : 55,
"processor_stats" : {
"append" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
},
"conditional" : {
"count" : 1366046566,
"failed" : 6983433,
"current" : 7,
"time_in_millis" : 454680900
},
"convert" : {
"count" : 112513696,
"failed" : 0,
"current" : 0,
"time_in_millis" : 1622979
},
"csv" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
},
"date" : {
"count" : 224609970,
"failed" : 224609970,
"current" : 0,
"time_in_millis" : 6425730
},
"foreach" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
},
"geoip" : {
"count" : 226514709,
"failed" : 502100,
"current" : 0,
"time_in_millis" : 5775351
},
"grok" : {
"count" : 16412993,
"failed" : 3369972,
"current" : 7,
"time_in_millis" : 1457124475
},
"gsub" : {
"count" : 13702480,
"failed" : 0,
"current" : 0,
"time_in_millis" : 109622
},
"join" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
},
"kv" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
},
"lowercase" : {
"count" : 449219942,
"failed" : 0,
"current" : 0,
"time_in_millis" : 7622456
},
"remove" : {
"count" : 224818681,
"failed" : 112304985,
"current" : 0,
"time_in_millis" : 2601647
},
"rename" : {
"count" : 4063909737,
"failed" : 0,
"current" : 0,
"time_in_millis" : 2750237
},
"script" : {
"count" : 450013516,
"failed" : 112305172,
"current" : 0,
"time_in_millis" : 11491545
},
"set" : {
"count" : 487857540,
"failed" : 0,
"current" : 0,
"time_in_millis" : 5728724
},
"set_security_user" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
},
"split" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
},
"uppercase" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
},
"uri_parts" : {
"count" : 208711,
"failed" : 0,
"current" : 0,
"time_in_millis" : 2203
},
"user_agent" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time_in_millis" : 0
}
}
}
}
}

this is the result of the cluster stats

  1. I am using SAS HDD

  2. it’s about 29.000.000 data /per daya

  3. Logstash

  4. Yes i am using monitoring installed

hi is there any update? please help😭

What does I/O statistics, e.g. await and disk utilisation, look like on the nodes? You can get this through e.g. iostat -x if you are running Linux.

Do you see any high CPU usage or heap usage? Do you see anything in the Elasticsearch logs around long or frequent garbage collection on any of the nodes?

How have you determined this? That is a huge lag. Are you setting timestamp fields properly? Timestamps must be in UTC and I have seen many users not account for this and therefore insert data that is futuredated as far as Elasticsearch is converned. This can result in it only showing up in Kibana once the timestamp is no longer in the future.

Which timezone are you in?

Hi, I found it because the ingest pipeline, do u have any idea so the logs won't delay if i am using the ingest pipeline?

How did you identify this?

Ingest pipelines are often limited by CPU. If you are seeing high CPU usage on your ingest nodes you may need to increase this. If you do not see high CPU, you may want to ingest data into Elasticsearch with a higher level of parallelism.

Another way to address this would be to make your ingest pipeline(s) more efficient. It looks like your grok and conditional filters are taking up most processing time so it may be worth starting there.

I delete the ingest pipeline configuration in index template and the elasticsearch could write until 15.000/s before is only max 2000/s and average 500/s.

What if I am using Logstash pipeline?

How about script painless? does it need high performance to process that?

How are you sending data to Elasticsearch? What is your batch size? How many concurrent connections are you sending data over?

Do you see high CPU on your ingest node?

i send elasticsearch using Logstash. I attach my Logstash configuration:

batch.size: 3000
batch.delay: 100
queue.type: memory
queue.max_bytes: 2048mb
queue.checkpoint.writes: 4096

and this is my CPU usage

I have three nodes of elasticsearch and I put ingest node in every node

How many CPU cores does the Logstash node(s) have?

This is a quite large batch size. It would be interesting to see what the effect of decreasing this to the default value (or maybe 500) would have. Can you comment this out/change the value and restart Logstash?

You can move the processing to Logstash, but be aware that it may require additional CPU resources. If your pipelines are very inefficient, this may not help much though.

What should I do to make the logs in real-time but still using the ingest pipeline? It takes time to convert the ingest pipeline to Logstash pipeline and also we can see from the CPU usage I sent before, I think it still enough for the CPU to processing the Logs but I don’t know why elasticsearch didn’t use the CPU

This is why I asked about the number of CPU cores allocated to Logstash. If I recall correctly the number of processing threads is set depending on the number of CPU cores available, and this will determine the level of processing parallelism, which limits how many concurrent connections to Elasticsearch that will be used.

Given that you have quite large batch sizes, it is possible you are sending relatively few batches to Elasticsearch concurrently. I believe batches are processed in a single thread, and if this is the case it could limit the amount of CPU used.

I would like to try increasing the number of threads Logstash uses and also decrease the batch size so multiple smaller loads are sent to Elasticsearch in parallel, thereby increasing CPU usage. Given the extensive processing done in ingest pipelines it is quite likely that each bulk request takes a long time to process.

Once we know what impact this have (if any) we can look at further potential improvements.

For Logstash, I am using 8vCPU cores. I would try to set the batch size as you recommended

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.