Elasticsearc 5 pipeline grok pattern_definitions issue

I need define an aditional pattern_definition for my pipeline. We have hostnames virtual host with an invalid character "_" than figure in apache log files.
Standard definition for grok filter pattern HOSTNAME is:
HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(.?|\b)

With a little modificacion I get a valid pattern for my hostnames. This is tested with https://grokdebug.herokuapp.com/ and it seems OK.

HOSTNAME_BAD \b(?:[0-9A-Za-z][0-9A-Za-z_-]{0,62})(?:.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(.?|\b)

When I try save my new definition for pipeline.

PUT _ingest/pipeline/apache-combined_01
{
  "description": "grok_apache_combined_01",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{COMBINEDAPACHELOG} %{HOSTNAME:virtual_host} %{NUMBER:response_time}"],
        "pattern_definitions" : {
          "HOSTNAME_BAD" : "\b(?:[0-9A-Za-z][0-9A-Za-z_-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)"
        }
      }
	},
	{
      "date": {
        "field": "timestamp",
        "formats": [ "dd/MMM/YYYY:HH:mm:ss Z" ]
      }
    },
	{
	  "script": {
	    "lang": "painless",
		"inline": "ctx.response_time_segs = Float.parseFloat(ctx.response_time) / params.microstosecs",
		"params": {
		  "microstosecs": 1000000
		}
	  }
	}	
  ]
}

I get an error:

{
  "error": {
    "root_cause": [
      {
        "type": "parse_exception",
        "reason": "Failed to parse content to map"
      }
    ],
    "type": "parse_exception",
    "reason": "Failed to parse content to map",
    "caused_by": {
      "type": "i_o_exception",
      "reason": "Unrecognized character escape '.' (code 46)\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@41e3e446; line: 9, column: 72]"
    }
  },
  "status": 400
}

I can get a 'valid'? pattern without scaping ".":

      "patterns": ["%{COMBINEDAPACHELOG} %{HOSTNAME_BAD:virtual_host} %{NUMBER:response_time}"],
        "pattern_definitions" : {
          "HOSTNAME_BAD" : "\b(?:[0-9A-Za-z][0-9A-Za-z_-]{0,62})(?:.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(.?|\b)"
        }

and I can save pipeline definition with this change but is not working for filter apache log.
Error in Filebeat is: java.lang.IllegalArgumentException: Provided Grok expressions do not match field value:

I have test with double scape without success either.
"HOSTNAME_BAD" : "\b(?:[0-9A-Za-z][0-9A-Za-z_-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(.\.?|\b)"

How can I solve that?
Is it possible to see the predefined patterns in Elasticsearch?

I have found a solution using an alternative pattern without using backslash for match clasic HOSTNAME and with invalid character "_" too.
"HOSTNAME" : "(?:(?:(?:(?:[a-zA-Z0-9][-_a-zA-Z0-9]{0,61})?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]{0,61}[a-zA-Z0-9]|[a-zA-Z])[.]?)"

...
    {
      "grok": {
        "field": "message",
        "patterns": ["%{COMBINEDAPACHELOG} %{HOSTNAME_BAD:virtual_host} %{NUMBER:response_time}"],
          "pattern_definitions" : {
          "HOSTNAME_BAD" : "(?:(?:(?:(?:[a-zA-Z0-9][-_a-zA-Z0-9]{0,61})?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]{0,61}[a-zA-Z0-9]|[a-zA-Z])[.]?)"
        }
      }
	},
...

It seems impossible to use backslashs.
Perhaps with an "patterns_dir" option as in Logstash It would be possible?

Hi there, to which part of your pattern are you referring to when you say it cannot be escaped?

here is an example call where I am matching a literal . using escaping

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
  "description": "grok_apache_combined_01",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{MYPAT}"],
        "pattern_definitions" : {
          "MYPAT" : "\\."
        }
      }
	}]},
  "docs": [
    {
      "_index": "index",
      "_type": "type",
      "_id": "id",
      "_source": {
        "message": "."
      }
    }
  ]
}

because of how the string is parsed, a double \\ is required.

Thanks Tal, You're right.
Now I have learned how to test the patterns faster and cleaner with your example :slight_smile:
I have tested it with real single pattern now for HOSTNAME. My mistake was not to escape all the backslash. In the creation of the pipeline, ES only gives error by the backslash with a point after it "." but it is necessary to escape all the backslashes in the pattern to make it work.

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
  "description": "grok_apache_combined_01",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{HOSTNAME_BAD:virtual_host}"],
        "pattern_definitions" : {
          "HOSTNAME_BAD" : "\\b(?:[0-9A-Za-z][0-9A-Za-z_-]{0,62})(?:\\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\\.?|\\b)"
        }
      }
	}]},
  "docs": [
    {
      "_index": "index",
      "_type": "type",
      "_id": "id",
      "_source": {
        "message": "hrz-hostname-0008.domain.local"
      }
    }
  ]
}

1.Original expression for HOSTNAME (grok)

\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)

2.- "Identical" expression for insert in pipeline processor "pattern_definition" with scaped backslash

\\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\\.?|\\b)

3.- Modified expression supporting "_" in hostnames for insert in pipeline processor "pattern_definition" with scaped backslash

\\b(?:[0-9A-Za-z][0-9A-Za-z_-]{0,62})(?:\\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\\.?|\\b)

Thanks a lot.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.