Is elastic Ingest pipelines resource intensive?

Hi,
I am currently using injest pipelines to enrich my document before it wrote into index.
I am wondering if this process would have a potenial hugh resouce(heap ram or cpu) comsumed for my Elasticsearch node behind the scene.

Below is the processors that composes my injest pipeline

PUT _ingest/pipeline/traces-apm@custom
{
  "processors": [
    {
      "script": {
        "source": """
       
        if(ctx.labels.region == null){
          ctx.db_tag=null
        } else if(ctx.labels.deviceId == null){
           ctx.db_tag=null
        } else{
           ctx.db_tag = ctx.labels.database_name + '-' + ctx.labels.region +'-' + ctx.labels.deviceId
        }
       
       """
      }
    },
    {
      "enrich": {
        "description": "Add extra data based on 'unique_db_tag'",
        "policy_name": "acav_limit_v1_policy",
        "field": "db_tag",
        "target_field": "unique_tag_v2",
        "max_matches": "1"
      }
    },
    {
      "set": {
        "field": "unique_tag_v2.pa_key",
        "value": "{{{labels.pa_key}}}"
      }
    },
    {
      "set": {
        "field": "unique_tag_v2.project_id",
        "value": "{{{labels.project_id}}}"
      }
    },
    {
      "set": {
        "field": "unique_tag_v2.brand_id",
        "value": "{{{labels.brandId}}}"
      }
    },
    {
      "set": {
        "field": "unique_tag_v2.function_id",
        "value": "{{{labels.function_id}}}"
      }
    },
    {
      "set": {
        "field": "unique_tag_v2.codeNum",
        "value": "{{{labels.codeNum}}}"
      }
    }
  ]
}

The reason I have this question is because, Elasticsearch has to loop through every document to see if it match to my enrich policy, executing the script processor, and other set processor, and it sounds to have a lot of work for Elasticsearch since I am using this injest pipeline for my data stream, therefore, I am wondering if this could have a huge resource comsumed, if it does, is there any way I can optimize it?

What version Elasticsearch are you on?

ingest pipeline have consume various amounts of CPU and RAM depending on the complexity on the ingest pipeline and what it does.

Things like poor regex can be very expensive.

A quick look at your pipeline it is fairly low complexity.

This is not the correct way to look at that ... Elastic does not "loop" through the enrich index to enrich the source it does a term lookup which is extremely fast and efficient in Elasticsearch...

So I would say your pipeline is fairly low resource intensive... run it .. test it.

You can run

GET /_nodes/hot_threads

to see what is taking up resources

Hi @stephenb ,

I am currently on 8.8 version. Good to know that my injest pipeline is in low complexity and thanks for pointing out that poor regex will make intensive resouce comsumed, I will pay attension in future query writting.

And I run GET /_nodes/hot_threads, it looks very normal :smiley:
1.4% [cpu=1.4%, idle=98.6%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000004][transport_worker][T#1]'

I am a bit intersted about the term look up regarding how Elastic enrich the document. I am not sure if I understand the flow correctly
I found this picture in offical document.

For example, my incoming document has a field called email and my source index also has an email field, as long as the incoming document email value match the email value in the source index, this document will be write into an index called "enrich index", I assum that would look like something like below inside the "enrich" index.
termlookup

Then Elastic perform a term look up which like jump exactly to the document location, for example, the docuemnt just be indexed into enrich index, and then elastic take it out for the rest of processor execution. Because of this, it is extremely fast.

The first picture is correct

The 2nd picture is doc values which are not really used for this part. :slight_smile:

The enrich index is compacted and optimized for the term lookup.
Pretty much depends on how deep you want to get but a term lookup uses the inverted index to find the record extremely quickly. Doc Values are used for aggregations sorting not really lookup.

So yeah..its fast and efficient... especially for reasonable-sized enrich indices (lets say 100Ks to Low Millions)

There are lots of data structures in elastic it may take a while to figure it all out.

@stephenb , Thanks for the quick reply!

I see, if it is using inverted index similar to the picture below, it makes so much sense why it can be so quick
invertindex

pretty much, in the chart, term could the email field value, and doc Id can be those matched email document's ID

Appreciated for answering my question, it is very helpful!

@stephenb

Sorry, I have one more question that I am curious, let say we have 500000 different email address in the source index, and every incoming document contains an email address, does Elastic search has to loop through the soruce index and compare one by one with the incoming document to find the matched email so that then it can injest the data to the incoming data?

if this is the case, would it be resource intensive when my source index's size keep going up?

I am not following.. elastic does not loop through anything.

If by source index you mean the source of the enrich index i.e
The look up index no it uses the term query / inverted index as discussed to find the matching enrich data very efficiently

As the picture shows as each new document comes in the enrich / lookup happens... Very fast.

1 Like

@stephenb

Thank you! I think I got it now

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.