Ingest pipeline gsub processor back reference not working

I am having trouble getting a back reference to work using the gsub processor in the elasticsearch ingest node pipeline. I am trying to get just the TLD from a dns.question.name field and using very similar syntax to what I used to use in Logstash to do the same, however, each character of the dns.question.name field is being replaced with a "1". Here are some sample values I was using to test with:

 www.example.com
 test-example6.domain.com
 another.test-example.another-domain.com

What I was hoping for was the following:

 example.com
 domain.com
 another-domain.com

But what I got was:

11111111111
1111111111
111111111111111111

Here was my gsub processor:

      {
        "gsub" : {
          "description" : "Populate dns.question.registered_domain from dns.question.name",
          "field" : "dns.question.name",
          "target_field" : "dns.question.registered_domain",
          "pattern" : "[.*\\.([^.]+\\.[^.]+)]",
          "replacement" : "\\1",
          "ignore_missing" : true,
          "ignore_failure" : true
        }
      }

Does the ingest node gsub processor support back references?

If so, I am not sure what I am doing wrong.

Any help would be much appreciated! Thanks in advance.

Hi @kmfreder1

Are you trying to substitute or just extract the TLD I am confused?

gsub... substitutes....

Did you look at ....

Then you could set it to the field you want... and removed the temp data if you want...

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "registered_domain": {
          "field": "dns.question.name",
          "target_field": "dns_details"
        }
      },
      {
        "set": {
          "field": "dns.question.registered_domain",
          "value": "{{dns_details.top_level_domain}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "dns": {
          "question": {
            "name": "www.example.ac.uk"
          }
        }
      }
    },
    {
      "_source": {
        "dns": {
          "question": {
            "name": "www.example.com"
          }
        }
      }
    },
    {
      "_source": {
        "dns": {
          "question": {
            "name": "test-example6.domain.com"
          }
        }
      }
    },
    {
      "_source": {
        "dns": {
          "question": {
            "name": "another.test-example.another-domain.com"
          }
        }
      }
    }
  ]
}
#Results
{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "dns": {
            "question": {
              "registered_domain": "ac.uk",
              "name": "www.example.ac.uk"
            }
          },
          "dns_details": {
            "registered_domain": "example.ac.uk",
            "top_level_domain": "ac.uk",
            "domain": "www.example.ac.uk",
            "subdomain": "www"
          }
        },
        "_ingest": {
          "timestamp": "2023-04-01T01:48:11.925942306Z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "dns": {
            "question": {
              "registered_domain": "com",
              "name": "www.example.com"
            }
          },
          "dns_details": {
            "registered_domain": "example.com",
            "top_level_domain": "com",
            "domain": "www.example.com",
            "subdomain": "www"
          }
        },
        "_ingest": {
          "timestamp": "2023-04-01T01:48:11.926014019Z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "dns": {
            "question": {
              "registered_domain": "com",
              "name": "test-example6.domain.com"
            }
          },
          "dns_details": {
            "registered_domain": "domain.com",
            "top_level_domain": "com",
            "domain": "test-example6.domain.com",
            "subdomain": "test-example6"
          }
        },
        "_ingest": {
          "timestamp": "2023-04-01T01:48:11.926022263Z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "dns": {
            "question": {
              "registered_domain": "com",
              "name": "another.test-example.another-domain.com"
            }
          },
          "dns_details": {
            "registered_domain": "another-domain.com",
            "top_level_domain": "com",
            "domain": "another.test-example.another-domain.com",
            "subdomain": "another.test-example"
          }
        },
        "_ingest": {
          "timestamp": "2023-04-01T01:48:11.926024921Z"
        }
      }
    }
  ]
}

Add the following if you want to clean up

      ,
      {
        "remove": {
          "field": "dns_details"
        }
      }

@stephenb, thanks for your response. I saw that processor and though it looks awesome and would obviously do the trick in this particular case, we are stuck on an older version of Elasticsearch.

Our pipeline moves generally have to work for 7.17.x. Unfortunately, it will be a while before we could use the Registered domain processor.

That said, with the transforms we find ourselves doing, this is likely not going to be the only case where we want to substitute only a portion of a string while keeping the rest of it in place.

The gsub in Logstash allows for the back reference syntax, which we previously had used a few times prior to moving to the ingest pipelines for a majority of our transforms. Does the gsub for the ingest pipelines not have this capability or is something just wrong about my syntax?

I would have expected gsub syntax and capabilities between those two things to be the same, other than the yaml vs. json differences, but there may be more at play than I realize.

Logstash and Ingest are 2 related but separate code bases... so they are not 100% Aligned.

I am not a regex expert... I will ask internally...

Thanks for looking deeper. I really appreciate that!

Answer in progress Engineer away from keyboard but the initial response is yes,

From Eng:

So backreferences are supported, IIUC, and there’s some kind of $ syntax to use them.
Matcher (Java Platform SE 7 )
Maybe $1 rather than \1?

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "gsub": {
          "description": "Populate dns.question.registered_domain from dns.question.name",
          "field": "url",
          "target_field": "dns.question.registered_domain",
          "pattern": """[.*\.([^.]+\.[^.]+)]""",
          "replacement": "$1",
          "ignore_missing": true,
          "ignore_failure": false
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "url": "www.domain.com"
      }
    }
  ]
}

Result Seems to be trying but not working because the group is not identified

{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "index_out_of_bounds_exception",
            "reason": "No group 1"
          }
        ],
        "type": "index_out_of_bounds_exception",
        "reason": "No group 1"
      }
    }
  ]
}

I am not regex but I tried but I made this work I don't think it does everything you want but I think it shows the Substitution works

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "gsub": {
          "description": "Populate dns.question.registered_domain from dns.question.name",
          "field": "url",
          "target_field": "dns.question.registered_domain",
          "pattern": """^\w+\.(\w+\.\w+)*$""",
          "replacement": "$1",
          "ignore_missing": true,
          "ignore_failure": false
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "url": "www.domain.uk"
      }
    }
  ]
}

Result

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "dns": {
            "question": {
              "registered_domain": "domain.uk"
            }
          },
          "url": "www.domain.uk"
        },
        "_ingest": {
          "timestamp": "2023-04-04T20:02:36.700639172Z"
        }
      }
    }
  ]
}

So, my syntax was indeed wrong. The square brackets inside my pattern were causing the gsub to interpret my pattern as a character class instead of as a pattern, thus it was not matching nor was it seeing a group at all for the back reference, thus your "No group 1" error in your first example. I'm glad gsub has the same functionality. I really appreciate your help!

@kmfreder1 Thanks good to know... AND

Please post your solution so others can learn!!!

Absolutely! Here is the processor that is confirmed to work, including the back reference I was looking for.

      {
        "gsub": {
          "description": "Populate dns.question.registered_domain from dns.question.name",
          "field": "dns.question.name",
          "target_field": "dns.question.registered_domain",
          "pattern": ".*\\.([^.]+\\.[^.]+)",
          "replacement": "$1",
          "ignore_missing": true,
          "ignore_failure": false
        }
      }

And it transforms these:

radio.twc.weather.com
screenhub-builds.s3.amazonaws.com
mcs2-cloudstation-us-east-2.prod.hydra.sophos.com

to these:

weather.com
amazonaws.com
sophos.com

Thanks again!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.