Enrich processor high cpu load

Hello,

I have a couple of filebeats shipping data directly to elasticsearch, when this data hits elasticsearch two ingest pipelines are executed, one from the module, and another one using the final_pipeline setting, I've created this final pipeline to make some transformations and enrich and avoid changing the original module pipeline, this works without any problem or performance impact.

Now I need to add some fields based on the value of another field, something similar to the translate filter in logstash, looking at the documentation I saw that the way to do this in an ingest pipeline is using the enrich processor.

The enrich seems pretty simple, it is something to emulate how winlogbeat set the event.* fields.

Something like this:

{ "code": "4624", "category": "authentication", "type": "start", "action": "logged-in" }
{ "code": "4625", "category": "authentication", "type": "start", "action": "logon-failed" }
{ "code": "4634", "category": "authentication", "type": "end", "action": "logged-out" }
{ "code": "4647", "category": "authentication", "type": "end", "action": "logged-out" }
{ "code": "4648", "category": "authentication", "type": "start", "action": "logged-in-explicit" }

So I've created the index with the enrich data, the enricy policy and the enrich processor, everything worked almost as expected, but as soon as the enrich processor was enabled, the CPU Load on the nodes more than doubled.

The ingestion is done in 4 nodes with 10 vCPU, 64 GB of RAM, 30 GB of HEAP and SSD based disks, the average CPU Load is 6, 7, with the enrich processor enabled it goes up to 14, 15, sometimes even more. Theses nodes are also the hot data nodes in a hot/warm architecture.

This ingest pipeline receives an average of 2500 e/s.

Is there any way to improve the performance of the enrich processor? What would be the best approach in this case? This is only one of the enrichs that I need to do.

Since each enrich would be no more than 100 or maybe 200 lines, I thought of just forget the enrich processor and use a lot of set processors with one simple if conditional coupled with dissect processors to create multiple fields.

Anyone has a benchmark of how an ingest pipeline with hundreds of set and dissect processors will perform? Since they do not query any index I'm assuming that they are lighter than an equivalent enrich processor.

While this is my only type of data ingested direct from filebeat to elasticsearch, I'm trying to avoid the work of migrate the ingest pipeline to a logstash pipeline and change the entire ingestion strategy on a couple of servers.

Hmmm interesting

I usually have pretty good luck with enrich processors.

What version are you on...

I know you're pretty experienced. assuming the execute policy was successful and that you're matching on a single term / keyword.

My understanding that the execute policy "compiles" the enrich data into a special compiled index that is then distributed to each of the nodes that needs to use it So it should be local.

Have you tried to use an ingest only nodes? Enrich is a bit CPU driven It's possible that the enrich is competing with the CPU for the data reads and writes.

Curious if these data nodes are also your master nodes?

Hello Stephen,

Currently I'm on version 7.12.1, an update is planned but it could take a couple of months more to be done.

The enrich is pretty simple, I think.

This is an example of a document in the source index for the enrich.

{ "code": "4624", "category": "authentication", "type": "start", "action": "logged-in" }

This is my enrich policy:

{
  "match": {
    "indices": "source-index",
    "match_field": "code",
    "enrich_fields": ["category", "type", "action"]
  }
}

After I execute this policy, the .enrich-policyName*-epoch-time index is created with the same number of documents of the source index, as expected.

This is my enrich processor.

{
  "processors" : [
    {
      "enrich" : {
        "description": "some description",
        "policy_name": "policyName",
        "field" : "event.code",
        "target_field": "event",
        "max_matches": "1"
      }
    }
  ]
}

The event.code is a field that is created earlier in the ingest pipeline and it is overwritten by the enrich processor, this is expected (not sure if this can impact, i don't think so but i can test it tomorrow using the original field).

In the final document the following fields are added:

{ 
    "event":  {
        "category": "authentication",
        "type": "start",
        "action": "logged-in",
        "code": "4624"
    }
}

This work, the only issue is with the load on the nodes.

On the cluster I have 3 dedicated master nodes and 22 data nodes, 18 are set as warm nodes and 4 are set as hot nodes using custom attributes, I do not have dedicated ingest nodes, in fact all the data nodes are also ingest nodes.

Currently we configured the indexing to only happen on the hot nodes, using the setting:
"index.routing.allocation.require.node_type" : "hot", where node_type is my custom attribute.

I noticed that the .enrich-* index has expand replicas set to 0-all and it created one replica in every node, I changed the allocation to hot to make sure that this index existed only in the hot nodes but it didn't changed anything.

The index where we need to do the enrich has 4 primary shards and 1 replica (the average daily size is ~ 200 GB), so it is balanced between the 4 hot nodes and when the enrich processor was active we could see that all 4 hot nodes had an increase in the CPU load.

Also, when the CPU Load was high, I checked the hot_threads in each one of the nodes and the write action was the one using more CPU in every one of the nodes, normally when I check hot_threads it is only because of lucene merges or expensive queries.

One more information is that our cluster is mostly busy in work hours, from 9:00 till 18:00 and I tested the enrich processor after those busy hours because I didn't know how it would behave.

Do you think that dedicated Ingest nodes could help? I never used them, so I'm not sure how they work, do I need to point my outputs from logstash and filebeat to the ingest nodes or the requests can be sent to the hot nodes and the ingest tasks will be redirected to the ingest ones? Also, do ingest nodes count against the license or they are 'free' like the coordinating nodes? We are on platinum on-premises.

Do you know of any benchmark related to the number of processors in an ingest pipeline?

My approach to solve this quickly will be to create an auxiliary ingest pipeline composed of set and dissect and remove and call this pipeline in the final_pipeline that is present in the index template.

I will try something like this:

{
    "set": {
        "field": "tempEnrich",
        "value": "authentication;start;logged-in",
        "override": false,
        "if": "ctx.event?.code == '4624'",
        "ignore_failure": true
    }
},
{
    "set": {
        other set processors
    }
},
{
    "dissect": {
      "field": "tempEnrich",
      "pattern" : "%{event.category};%{event.type};%{event.action}",
      "ignore_failure" : true
    }
},
{
    "remove": {
        "field": "tempEnrich",
        "ignore_missing": true,
        "ignore_failure": true
    }
}

On this case I have 107 documenst in the enrich index, so I would need 107 set processors, I think that this could perform well as the set processor with a static value and the dissect seems to be very fast, will test this tomorrow.

But any suggestion would help.

Hi @leandrojmp Thanks for the detail.

At this point I will just be "Armchair Architect", I am still not clear why enrich is so much CPU.
Overwriting the field event.code does seem a bit weird to me.

I suggest what works for you ... if that is the set processor approach so be it.

Perhaps and yes I have used them successfully when I have CPU intensive ingest pipelines, below is just a perspective not a prescription :slight_smile:

So if you set up ingest nodes yes you will point Filebeat / Logstash etc to the ingest node. The ingest pipeline which is executed pre-write is executed on the ingest node then the ingest node will send the write to where you have the routing allocation set.

For Self Managed Licensing:
Ingest and Coordinator nodes are not licensed.
Licensed nodes include. Master, Data, ML, CCS Nodes
Non Licensed: Ingest / Coordinator Only Nodes, Kibana, Logstash, Beats, Agents, APM Server etc.

Also you should remove the ingest role from your warm and hot nodes then. (technically you should remove ingest from your warm anyways)

An ingest Node (folks often use them as Coordinators as well i.e. Pointing Kibana and Queries to them) so as to leave the Data Nodes to only Read / Write... maximize your licensed nodes.

You could try to set up 2 Ingest Nodes say 4CPU 8GB or so and give it a try you might find it helps (probably will) BUT that is more to manage etc... etc..

Perhaps take a look here

Prerequisites

  • Nodes with the ingest node role handle pipeline processing. To use ingest pipelines, your cluster must have at least one node with the ingest role. For heavy ingest loads, we recommend creating dedicated ingest nodes.

Yeah, it also still puzzles me the high CPU load for a simple enrich processor, I read the recommendation to use dedicated ingest nodes in the case of heavy loads, the issue is that I don't consider 2500 e/s as a heavy load.

I will test the set processors to see if it solves my problem now, I will also open an issue with support to see if they can help me dig deeper on what could be the issue.

Thanks!

I agree... something does not seem right...
Ohh and YES open a support ticket ... hehe I should had said that first.

@leandrojmp Did you make sure your heap setting is below the Compressed Object Pointers setting? If it is over it is inefficient . In 7.12.1 you can let Elasticsearch size itself. Not probably an issue just noticed it not sure if you manually set to 30GB.

Yeah, we checked that in the last upgrade from 7.9.1 to 7.12.1, some nodes were above the cutoff limit and we fixed it.

Checking with _nodes/nodeName/jvm?pretty I can confirm it with the following line in the response:

"using_compressed_ordinary_object_pointers" : "true"