Tracking the cause of write queues

Hello,

I've been trying to find the cause of a high load the happens randomly on one of my hot nodes for a couple of months.

Currently I have a 25 nodes cluster, with 3 dedicated masters, 18 warm nodes and 4 hot nodes, all the indexing is done in the hot nodes.

The version is 7.12.1, self-managed on VMs, Platinum licensed, an update is already planned for next year, and is not possible to be done now.

All the 4 nodes have the same specs, 12 vCPU, 64 GB of RAM with 30 GB of heap (bellow the java oops threshold) and SSD backed disks.

The ingestion is done using Logstash, but some Filebeats send data directly to Elasticsearch, the logstash outputs are configured with the four hosts, and the same applies to the filebeat outputs, so the requests should be balanced between the 4 nodes.

My indices are mostly daily indices (indexname-yyyy.MM.dd) and some are monthly (indexname-yyyy.MM), they are all sized in a way that the shard size is kept under 50 GB, I only use ILM to move indices from hot nodes to warm nodes and to delete them after some time.

Some indices, the larger ones with high e/s rates, have 4 shards and 1 or 0 replicas, other indices have 2 shards and 1 replica or 1 shard and 1 replica, so in some cases I could end with index that have shards on only 2 of the hot nodes.

The issue is that from time to time one of the nodes will have a high load that can be the double of the other ones for example, and higher than the number of vCPUs, which would suggest some I/O wait.

The problem is that I wasn't still able to track what can be the cause, I suspect that it could be caused by one or more index that will concentrate on some node, but I couldn't find an easy way to track this.

For example running GET _nodes/current-high-load-node/hot_threads?threads=5, sometimes return [write] threads as one or more hot threads, but there is no further information about what is the cause.

Running:

GET /_cat/thread_pool/write?v=true&h=node_name,active,queue,rejected,completed&s=node_name

Shows me that one of the nodes is using all the thread pools and have some requests in the queue, but how can I find what is causing this queue? Anyone has any idea?

node_name      active queue rejected completed
node-name-01      1     0        0  57832635
node-name-02     12    63        0  61280527
node-name-03      5     0        0  57961637
node-name-04      2     0        0  51770382

How many different indices are you actively indexing into? Are the primary shards of these indices evenly distributed across the how nodes? Are you using different pipelines for different indices so that bulk requests target a specific index or at least small set of indices? Is your data immutable or do you also perform updates?

Hello,

How many different indices are you actively indexing into?

I'm using Elastic as a SIEM so I have many different types of log, each one in its own index, I have something around 30 different index daily.

Are the primary shards of these indices evenly distributed across the how nodes?

Not for all index, some will be evenly distributed, some will have just 2 primaries and a few have only 1 primary, the biggest ones, both on events per second and size, are evenly distributed accross the hot nodes.

Making every index evenly distributed accross the 4 hot nodes could lead to many small shards because I use daily index and cannot change that now.

Are you using different pipelines for different indices so that bulk requests target a specific index or at least small set of indices?

I have different pipelines for each index, for example, logs from the network-device will be stored in the network-device-yyyy.MM.dd and the ingestion is done by 2 Logstash servers with the index => network-device-{+yyyy.MM.dd} configuration.

So, the target of each ingestion pipeline from logstash or filebeat is always one specific index pattern, sometimes I have 2 logstash/filebeat working in parallel sending data to the index. Is that what you mean?

Is your data immutable or do you also perform updates?

The data is immutables, it is timeseries data, no updates are being made.

If the threads are blocked on IO then they might not be in hottest 5. Pass a larger ?threads= parameter to the hot threads API to see what all the threads are doing. Try 9999, that should be big enough. Share the output here if you need help understanding it.

Also you can use GET _tasks?detailed to get more details about what tasks are running on each node, which should include information about which indices they're writing to.

Hello @DavidTurner and @Christian_Dahlqvist

I was finally able to track the cause using the following request:

GET _tasks?nodes=node-name&actions=*write*&detailed

Then I wrote a small python script to parse the json response and get only the task description with the index name and the task running_time_in_nanos, this allowed me to easily see which was the index that was taking more time in the indices:data/write/bulk action.

It was the only pipeline that sends data directly from filebeat to Elasticsearch, this pipeline uses a third party module with an ingest pipeline (wazuh) and we also created a final_pipeline to do some extra processing, recently we added another ingest pipeline to do some enriching, this pipeline is called from the final_pipeline using the pipeline processor.

The enrich pipeline is composed of some set processors, something around a hundred set processors, that have the following format:

{
    "set": {
        "field": "event.metadata",
        "value": "authentication;start;logged-in",
        "override": false,
        "if": "ctx.event?.code == '4624'",
        "ignore_failure": true
    }
},
{
    "set": {
        "field": "event.metadata",
        "value": "authentication;start;logon-failed",
        "override": false,
        "if": "ctx.event?.code == '4625'",
        "ignore_failure": true
    }
}

We created this enrich pipeline using set processors to replace an enrich processor we tried to use before and the performance was even worst, in that case the load on all hot nodes doubled as soon as the enrich processor was enabeld, I've made a topic about it if you want to read.

After I removed this pipeline with hundreds of set processors, the load of the node returned to a normal value, similar to the load of the other hot nodes.

What I do not understand now is why just one node had performance issues. Shouldn't the ingest be balanced between all the four nodes? Would ingest nodes only help solve this issue and it is possible to make just one pipeline use some specific ingest node?

Also, what are the improvements of ingests pipelines in newer versions? We have an update planned for next month, and I'm thinking if i should give another shot to do the enrich directly on Elasticsearch or move the ingesting to Logstash that can do what I need pretty quickly.

Thanks, you can close the topic.

3 Likes

If the coordinating node is an ingest node then IIRC it will just do all the ingest work itself, whereas if it's not then it will delegate the work across all the ingest nodes. So either you need to load-balance across multiple coordinating nodes or else point Filebeat at a node that is not an ingest node to get Elasticsearch to do the load balancing.

You didn't share the hot threads output but I think that would shed more light on why this particular pipeline is so resource-intensive.

I'm also pretty surprised that an enrich processor didn't work for you here. Looking up a value from a table is exactly what enrich processors are for. Either there's something else going on here or else this is a bug that deserves further investigation.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.