Scripted fields regex


(Pavel) #1

Hi there,

I collect data from ntopng in Elasticsearch
I have a "HTTP_HOST.keyword" field that contains the FQDN
I need to create a field containing only the domain name of the first and second level
Regex: [^.] + . [^.] + $
For example:
7.tlu.dl.delivery.mp.microsoft.com -> microsoft.com

I created the script, as described in the example
Match a string and return that match

def m = /[^.]+.[^.]+$/.matcher(doc['HTTP_HOST.keyword'].value);
if (m.matches ()) {
return m.group (1)
} else {
return "no match"
}

But I get the error:

Error: Request to Elasticsearch failed: {"error": {"root_cause": [{"type": "script_exception", "reason": "runtime error", "script_stack": ["java.util.regex.Matcher. getTextLength (Matcher.java:1283) "," java.util.regex.Matcher.reset (Matcher.java:309) "," java.util.regex.Matcher. (Matcher.java:229) ", "java.util.regex.Pattern.matcher (Pattern.java:1093)", "m = /[^.]+\.[^.]+$/.matcher(doc['HTTP_HOST.keyword ']. value); \ n "," ^ ---- HERE "]," script ":" def m = /[^.]+\.[^.]+$/.matcher(doc['HTTP_HOST.keyword '] .value); \ nif (m.matches ()) {\ n return m.group (1) \ n} else {\ n return \ "no match " \ n} "," lang ":" painless "}," type ":" search_phase_execution_exception "," reason ":" all shards failed "," phase ":" query "," grouped ": true," failed_shards ": [{" shard ": 0," index ":" ntopng-2018.01.18 "," node ":" lHy542BoQ1-7g6ifEFLHcw "," reason ": {" type ":" script_exception "," reason ":" runtime error "," script_stack ": [" java. util.regex.Matcher.getTextLength (Matcher.java:1283) "," java.util.regex.Ma tcher.reset (Matcher.java:309) "," java.util.regex.Matcher. (Matcher.java:229) "," java.util.regex.Pattern.matcher (Pattern.java:1093) "," m = /[^.]+\.[^.]+$/.matcher(doc['HTTP_HOST.keyword'].value);\n "," ^ ---- HERE "], "script": "def m = /[^.]+\.[^.]+$/.matcher(doc['HTTP_HOST.keyword'].value);\nif (m.matches ()) {\ n return m.group (1) \ n} else {\ n return \ "no match " \ n} "," lang ":" painless "," caused_by ": {" type ":" null_pointer_exception "," reason ": null}}}]}," status ": 500}

What am I doing wrong?


(Matt Bargar) #2

You're getting a null pointer exception, so I suspect HTTP_HOST.keyword is a sparse field. Try adding a null check on the value of that field before attempting to perform the match.


(Pavel) #3

Hi,
I added null check as you suggested

if (doc['HTTP_HOST.keyword'].value != null){
def m = /[^.]+.[^.]+$/.matcher(doc['HTTP_HOST.keyword'].value);
if (m.matches ()) {
return m.group (1)
}
}

And now I see another mistake:

{
"error": {
"root_cause": [
{
"type": "null_pointer_exception",
"reason": null
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "ntopng-2018.01.18",
"node": "lHy542BoQ1-7g6ifEFLHcw",
"reason": {
"type": "null_pointer_exception",
"reason": null
}
}
]
},
"status": 500
}


(Matt Bargar) #4

In the case where the null check fails, add an additional return statement as a catch-all. Return a string that represents an empty value like "EMPTY" or "NULL" and then you can search on those values as well if need be. In general, you want to make sure your scripts always return the same data type.


(Pavel) #5

Thank you.
That helped
Now I see that the problem with the match
However, I can not figure out what's wrong yet. Any regex online debuger tells me that I wrote the correct expression, but Elastic gives errors
I tried different expressions that give the correct result in the online debugger.

[^.]+.[^.]+$
(\w+.\w+)$
([a-zA-Z_0-9]+.[a-zA-Z_0-9]+)$

Tell me where to go? How to debug scripts in elastic?


(Matt Bargar) #6

What error are you seeing?


(Pavel) #7

Via DevTool I sent request:

GET ntopng-*/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "test1": {
      "script": "if (doc['HTTP_HOST.keyword'].value != null){ def m = /(\w+\.\w+)$/.matcher(doc['HTTP_HOST.keyword'].value); if (m.matches ()) { return m.group (1) } else { return 'no match' } } else { return 'NULL' }"
    }
  }
}

I receive response:

{
  "error": {
    "root_cause": [
      {
        "type": "json_parse_exception",
        "reason": "Unrecognized character escape 'w' (code 119)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@684b1449; line: 7, column: 75]"
      }
    ],
    "type": "json_parse_exception",
    "reason": "Unrecognized character escape 'w' (code 119)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@684b1449; line: 7, column: 75]"
  },
  "status": 500
}

I understand that he does not like the character "w", but as far as I know, the expression is spelled correctly

If I change \w to [a-zA-Z_0-9], it starts swearing at the symbol "\."


(Matt Bargar) #8

Ah this is a bit tricky because there are two levels of escaping going on here. It's actually complaining because the JSON in invalid. Because \ is the escape character in JSON, you need a double \\ for every \ you want in your regex. So \w should become \\w and \. should become \\..

I think this may only apply in the DevTools app though. Kibana should automatically handle the JSON escaping for your scripted fields. If you put the same script in a Kibana scripted field, what error do you get when you run a query in Discover? (if the error format in the UI isn't very readable, you can also grab it from the network tab in your browser's devtools, look for the _msearch request).


(Pavel) #9

Hi Bargs,
This strange behavior
Well take it as it is.
By the way, this rule is valid not only in DevTool, elasticsearch also requires double backslash

Request:

curl -XGET "http://localhost:9200/ntopng-*/_search" -H 'Content-Type: application/json' -d'
{
  "_source": ["HTTP_HOST"],
  "query" : {
        "term" : { "L7_PROTO_NAME.keyword" : "HTTP" }
    },
  "script_fields": {
    "test1": {
      "script": "if (doc[\"HTTP_HOST.keyword\"].value != null){ def m = /([a-zA-Z_0-9]+\.[a-zA-Z_0-9]+)$/.matcher(doc[\"HTTP_HOST.keyword\"].value); if (m.matches ()) { return m.group (1) } else { return \"no match\" } } else { return \"NULL\"}"
    }
  }
}'

Response:

{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Unrecognized character escape '.' (code 46)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7df29cde; line: 9, column: 90]"}],"type":"json_parse_exception","reason":"Unrecognized character escape '.' (code 46)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7df29cde; line: 9, column: 90]"},"status":500}

When using an expression with a double backslash, I do not get any errors, but not the fields are not parsed. For example:

curl -XGET "http://localhost:9200/ntopng-*/_search" -H 'Content-Type: application/json' -d'
{
  "_source": ["HTTP_HOST"],
  "query" : {
        "term" : { "L7_PROTO_NAME.keyword" : "HTTP" }
    },
  "script_fields": {
    "test1": {
      "script": "if (doc[\"HTTP_HOST.keyword\"].value != null){ def m = /([a-zA-Z_0-9]+\\.[a-zA-Z_0-9]+)$/.matcher(doc[\"HTTP_HOST.keyword\"].value); if (m.matches ()) { return m.group (1) } else { return \"no match\" } } else { return \"NULL\"}"
    }
  }
}'

Response:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 25,
    "successful": 25,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 230282,
    "max_score": 3.6334908,
    "hits": [
      {
        "_index": "ntopng-2018.01.23",
        "_type": "ntopng",
        "_id": "VF7GIGEBjdV4feqd0IhT",
        "_score": 3.6334908,
        "_source": {
          "HTTP_HOST": "hs.enforta.ru"
        },
        "fields": {
          "test1": [
            "no match"
          ]
        }
      },
      {
        "_index": "ntopng-2018.01.23",
        "_type": "ntopng",
        "_id": "IF7HIGEBjdV4feqdoo9H",
        "_score": 3.6334908,
        "_source": {
          "HTTP_HOST": "portal.domru.loc"
        },
        "fields": {
          "test1": [
            "no match"
          ]
        }
      },

This is logical, because the given expression with a double backslash is not correct

My brain boils


(Matt Bargar) #10

@htechno I don't think the double backslash is the reason the regex doesn't match. When ES parses the JSON it unescapes the string so the script that Painless compiles doesn't contain double backslashes.

I played around with your script a bit and I think I see the problem. If you look at the Matcher docs you'll see that the matches method attempts to match the entire string against the pattern. The find method in contrast is capable of matching a subsection of the string. So try creating a scripted field based on the following script, switching out matches for find:

if (doc["HTTP_HOST.keyword"].value != null) { 
  def m = /([a-zA-Z_0-9]+\.[a-zA-Z_0-9]+)$/.matcher(doc["HTTP_HOST.keyword"].value); 
  if (m.find()) { return m.group(1) } 
  else { return "no match" } 
} 
else { return "NULL"}

(Pavel) #11

@Bargs You're right.
You are my savior!
+100 to Karma


(Matt Bargar) #12

Glad I could help!


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.