Scripted fields regex

htechno · January 19, 2018, 2:27pm

Hi there,

I collect data from ntopng in Elasticsearch
I have a "HTTP_HOST.keyword" field that contains the FQDN
I need to create a field containing only the domain name of the first and second level
Regex: [^.] + . [^.] + $
For example:
7.tlu.dl.delivery.mp.microsoft.com -> microsoft.com

I created the script, as described in the example
Match a string and return that match

def m = /[^.]+.[^.]+$/.matcher(doc['HTTP_HOST.keyword'].value);
if (m.matches ()) {
return m.group (1)
} else {
return "no match"
}

But I get the error:

Error: Request to Elasticsearch failed: {"error": {"root_cause": [{"type": "script_exception", "reason": "runtime error", "script_stack": ["java.util.regex.Matcher. getTextLength (Matcher.java:1283) "," java.util.regex.Matcher.reset (Matcher.java:309) "," java.util.regex.Matcher. (Matcher.java:229) ", "java.util.regex.Pattern.matcher (Pattern.java:1093)", "m = /[^.]+\.[^.]+$/.matcher(doc['HTTP_HOST.keyword ']. value); \ n "," ^ ---- HERE "]," script ":" def m = /[^.]+\.[^.]+$/.matcher(doc['HTTP_HOST.keyword '] .value); \ nif (m.matches ()) {\ n return m.group (1) \ n} else {\ n return \ "no match " \ n} "," lang ":" painless "}," type ":" search_phase_execution_exception "," reason ":" all shards failed "," phase ":" query "," grouped ": true," failed_shards ": [{" shard ": 0," index ":" ntopng-2018.01.18 "," node ":" lHy542BoQ1-7g6ifEFLHcw "," reason ": {" type ":" script_exception "," reason ":" runtime error "," script_stack ": [" java. util.regex.Matcher.getTextLength (Matcher.java:1283) "," java.util.regex.Ma tcher.reset (Matcher.java:309) "," java.util.regex.Matcher. (Matcher.java:229) "," java.util.regex.Pattern.matcher (Pattern.java:1093) "," m = /[^.]+\.[^.]+$/.matcher(doc['HTTP_HOST.keyword'].value);\n "," ^ ---- HERE "], "script": "def m = /[^.]+\.[^.]+$/.matcher(doc['HTTP_HOST.keyword'].value);\nif (m.matches ()) {\ n return m.group (1) \ n} else {\ n return \ "no match " \ n} "," lang ":" painless "," caused_by ": {" type ":" null_pointer_exception "," reason ": null}}}]}," status ": 500}

What am I doing wrong?

Bargs · January 19, 2018, 4:47pm

You're getting a null pointer exception, so I suspect HTTP_HOST.keyword is a sparse field. Try adding a null check on the value of that field before attempting to perform the match.

htechno · January 22, 2018, 8:47am

Hi,
I added null check as you suggested

if (doc['HTTP_HOST.keyword'].value != null){
def m = /[^.]+.[^.]+$/.matcher(doc['HTTP_HOST.keyword'].value);
if (m.matches ()) {
return m.group (1)
}
}

And now I see another mistake:

{
"error": {
"root_cause": [
{
"type": "null_pointer_exception",
"reason": null
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "ntopng-2018.01.18",
"node": "lHy542BoQ1-7g6ifEFLHcw",
"reason": {
"type": "null_pointer_exception",
"reason": null
}
}
]
},
"status": 500
}

Bargs · January 22, 2018, 3:28pm

In the case where the null check fails, add an additional return statement as a catch-all. Return a string that represents an empty value like "EMPTY" or "NULL" and then you can search on those values as well if need be. In general, you want to make sure your scripts always return the same data type.

htechno · January 22, 2018, 5:18pm

Thank you.
That helped
Now I see that the problem with the match
However, I can not figure out what's wrong yet. Any regex online debuger tells me that I wrote the correct expression, but Elastic gives errors
I tried different expressions that give the correct result in the online debugger.

[^.]+.[^.]+$
(\w+.\w+)$
([a-zA-Z_0-9]+.[a-zA-Z_0-9]+)$

Tell me where to go? How to debug scripts in elastic?

Bargs · January 22, 2018, 9:04pm

What error are you seeing?

htechno · January 23, 2018, 7:01am

Via DevTool I sent request:

GET ntopng-*/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "test1": {
      "script": "if (doc['HTTP_HOST.keyword'].value != null){ def m = /(\w+\.\w+)$/.matcher(doc['HTTP_HOST.keyword'].value); if (m.matches ()) { return m.group (1) } else { return 'no match' } } else { return 'NULL' }"
    }
  }
}

I receive response:

{
  "error": {
    "root_cause": [
      {
        "type": "json_parse_exception",
        "reason": "Unrecognized character escape 'w' (code 119)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@684b1449; line: 7, column: 75]"
      }
    ],
    "type": "json_parse_exception",
    "reason": "Unrecognized character escape 'w' (code 119)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@684b1449; line: 7, column: 75]"
  },
  "status": 500
}

I understand that he does not like the character "w", but as far as I know, the expression is spelled correctly

If I change \w to [a-zA-Z_0-9], it starts swearing at the symbol "\."

Bargs · January 23, 2018, 4:50pm

Ah this is a bit tricky because there are two levels of escaping going on here. It's actually complaining because the JSON in invalid. Because \ is the escape character in JSON, you need a double \\ for every \ you want in your regex. So \w should become \\w and \. should become \\..

I think this may only apply in the DevTools app though. Kibana should automatically handle the JSON escaping for your scripted fields. If you put the same script in a Kibana scripted field, what error do you get when you run a query in Discover? (if the error format in the UI isn't very readable, you can also grab it from the network tab in your browser's devtools, look for the _msearch request).

htechno · January 24, 2018, 8:50am

Hi Bargs,
This strange behavior
Well take it as it is.
By the way, this rule is valid not only in DevTool, elasticsearch also requires double backslash

Request:

curl -XGET "http://localhost:9200/ntopng-*/_search" -H 'Content-Type: application/json' -d'
{
  "_source": ["HTTP_HOST"],
  "query" : {
        "term" : { "L7_PROTO_NAME.keyword" : "HTTP" }
    },
  "script_fields": {
    "test1": {
      "script": "if (doc[\"HTTP_HOST.keyword\"].value != null){ def m = /([a-zA-Z_0-9]+\.[a-zA-Z_0-9]+)$/.matcher(doc[\"HTTP_HOST.keyword\"].value); if (m.matches ()) { return m.group (1) } else { return \"no match\" } } else { return \"NULL\"}"
    }
  }
}'

Response:

{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Unrecognized character escape '.' (code 46)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7df29cde; line: 9, column: 90]"}],"type":"json_parse_exception","reason":"Unrecognized character escape '.' (code 46)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7df29cde; line: 9, column: 90]"},"status":500}

When using an expression with a double backslash, I do not get any errors, but not the fields are not parsed. For example:

curl -XGET "http://localhost:9200/ntopng-*/_search" -H 'Content-Type: application/json' -d'
{
  "_source": ["HTTP_HOST"],
  "query" : {
        "term" : { "L7_PROTO_NAME.keyword" : "HTTP" }
    },
  "script_fields": {
    "test1": {
      "script": "if (doc[\"HTTP_HOST.keyword\"].value != null){ def m = /([a-zA-Z_0-9]+\\.[a-zA-Z_0-9]+)$/.matcher(doc[\"HTTP_HOST.keyword\"].value); if (m.matches ()) { return m.group (1) } else { return \"no match\" } } else { return \"NULL\"}"
    }
  }
}'

Response:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 25,
    "successful": 25,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 230282,
    "max_score": 3.6334908,
    "hits": [
      {
        "_index": "ntopng-2018.01.23",
        "_type": "ntopng",
        "_id": "VF7GIGEBjdV4feqd0IhT",
        "_score": 3.6334908,
        "_source": {
          "HTTP_HOST": "hs.enforta.ru"
        },
        "fields": {
          "test1": [
            "no match"
          ]
        }
      },
      {
        "_index": "ntopng-2018.01.23",
        "_type": "ntopng",
        "_id": "IF7HIGEBjdV4feqdoo9H",
        "_score": 3.6334908,
        "_source": {
          "HTTP_HOST": "portal.domru.loc"
        },
        "fields": {
          "test1": [
            "no match"
          ]
        }
      },

This is logical, because the given expression with a double backslash is not correct

My brain boils

Bargs · January 24, 2018, 4:09pm

@htechno I don't think the double backslash is the reason the regex doesn't match. When ES parses the JSON it unescapes the string so the script that Painless compiles doesn't contain double backslashes.

I played around with your script a bit and I think I see the problem. If you look at the Matcher docs you'll see that the matches method attempts to match the entire string against the pattern. The find method in contrast is capable of matching a subsection of the string. So try creating a scripted field based on the following script, switching out matches for find:

if (doc["HTTP_HOST.keyword"].value != null) { 
  def m = /([a-zA-Z_0-9]+\.[a-zA-Z_0-9]+)$/.matcher(doc["HTTP_HOST.keyword"].value); 
  if (m.find()) { return m.group(1) } 
  else { return "no match" } 
} 
else { return "NULL"}

htechno · January 24, 2018, 4:16pm

@Bargs You're right.
You are my savior!
+100 to Karma

Bargs · January 24, 2018, 4:21pm

Glad I could help!

system · February 21, 2018, 4:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Scripted fields - regex always return false Kibana	5	1004	September 25, 2017
Error related to scripted fields Kibana	9	4621	May 31, 2017
Scripted field throws exception Kibana	3	1314	July 2, 2018
Painless scripted fields with regex Elasticsearch	8	8683	March 30, 2018
Problem with regex scripted field kibana using painless lang! Elasticsearch painless	1	412	October 19, 2020

Scripted fields regex

Related topics