Need help with regular expression filtering


(United Marsupials) #1

I'm trying to filter by HTTP "Referer" header. The values are too "hairy" for a simple matching, requiring regular expressions. I start with a simple query:

{
  "query": {
    "regexp": {
      "http_referer": {
        "value": "http"
      }
    }
  }
}

and get a bunch of hits, as expected. So far so good. However, when I try to expand the expression to read "http:", I suddenly get none at all -- and there certainly are some, because not all referrers begin with https.

If I change the value to "https", I get a couple of hits, where the string "https" occurs in the middle of the referrer-field -- which is bizarre for two reasons:

  • The manual states, regular expressions are always anchored
  • We have a whole bunch of referrers that begin with "https" -- but those aren't shown

Something is seriously wrong -- I must not be understanding, how the regexps work in Kibana. Any advice? Our Kibana version 4.5.0 (build 9889). Thanks!

(Just double-checked -- the field http_referer is marked as both "Analyzed" and "Indexed" in Kibana's settings.)


(CJ Cenizal) #2

This is a bit of a long-shot but have you tried escaping the colon character?

{
  "query": {
    "regexp": {
      "http_referer": {
        "value": "http\:"
      }
    }
  }
}

Also, you might want to try asking this question in the Elasticsearch forum, since it is a little more relevant to that part of our stack.

Thanks,
CJ


(United Marsupials) #3

With a single backslash I can not even press the "Done" button in Kibana's interface, because it is invalid JSON. With two backslashes -- "http\\:" -- I get no results either...


(CJ Cenizal) #4

Bah, sorry, I've never hand-rolled a filter query before like this. Dumb suggestion on my part. I took a closer look at the docs and did some experimentation. Can we start off with a very basic regexp filter like this:

{
  "query": {
    "regexp": {
      "http_referer": "https.*"
    }
  }
}

This will match all http_referer values which begin with https followed by any characters. Can you also check if there's a http_referer.raw field? If so, you can try querying that instead:

{
  "query": {
    "regexp": {
      "http_referer.raw": "https.*"
    }
  }
}

Please let me know if this helps.

Thanks,
CJ


(United Marsupials) #5

Thanks. Both methods seem to work as quoted. However, if I try to add colon: "https:.*" to the simple "http_referer", things break. The "http_referer.raw" works better -- it finds hits. But the highlighting in Kibana's interface breaks -- nothing is highlighted.

But, I can live without the highlighting -- and will continue to use the .raw in my searches, however surprising it is, that this is necessary... Thank you!


(CJ Cenizal) #6

That's great! If you think the missing highlighting could be a bug, then I think the team would especially like to learn more about it in a GitHub issue if you have time.

Thanks,
CJ


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.