Uri_parts Ingest Processor - domain field

I'm working with the Zscaler integration and noticed that when passing the logs through the pipeline the uri_parts ingest processor does not create url.domain out of the URL that Zscaler sends based off the test logs from the repo.

I built a pipeline to just test the uri_parts processor and it seems that not including a scheme is the cause for not populating the domain, so adding https:// to the URL will then populate url.domain. My question then, is this expected behavior? Zscaler does not send the scheme with the URL and while it probably isn't that big a deal to modify the pipeline, I'm trying to keep it as close to vanilla so upgrading the integration is simple. Sample outputs below:

## Sample without scheme
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "test uri_parts",
    "processors": [
      {
        "uri_parts" : {
          "field" : "message"
        }
      }
    ]
  },
  "docs":[
    {
      "_source":{
        "message":"www.example.com/testpath/index.php?user=joe"
      }
    }
  ]
}

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_id" : "_id",
        "_source" : {
          "message" : "www.example.com/testpath/index.php?user=joe",
          "url" : {
            "path" : "www.example.com/testpath/index.php",
            "extension" : "php",
            "original" : "www.example.com/testpath/index.php?user=joe",
            "scheme" : null,
            "domain" : null,
            "query" : "user=joe"
          }
        },
        "_ingest" : {
          "timestamp" : "2022-03-20T18:35:01.534115808Z"
        }
      }
    }
  ]
}

## Sample with scheme
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "test uri_parts",
    "processors": [
      {
        "uri_parts" : {
          "field" : "message"
        }
      }
    ]
  },
  "docs":[
    {
      "_source":{
        "message":"https://www.example.com/testpath/index.php?user=joe"
      }
    }
  ]
}

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_id" : "_id",
        "_source" : {
          "message" : "https://www.example.com/testpath/index.php?user=joe",
          "url" : {
            "path" : "/testpath/index.php",
            "extension" : "php",
            "original" : "https://www.example.com/testpath/index.php?user=joe",
            "scheme" : "https",
            "domain" : "www.example.com",
            "query" : "user=joe"
          }
        },
        "_ingest" : {
          "timestamp" : "2022-03-20T18:35:13.970828081Z"
        }
      }
    }
  ]
}

Hey,

so my assumption here is, that in order to be a valid URI the schema/protocol like http:// or https:// is missing.

Please open an issue in Issues · elastic/elasticsearch · GitHub for a feature request for this.

--Alex

Thanks, created Update uri_parts processor to extract domain without a prefix/scheme · Issue #85161 · elastic/elasticsearch · GitHub

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.