Regex not working for strings containing special characters


(Jamil Bou Kheir-2) #1

I've spent a number of hours trying to get a simple Regexp query to work.
I'm using Elasticsearch 1.0 with the defaults. Here's the data I've posted
to ES:

$ curl -XPOST 'elasticsearch:9200/regex_test/useragent' -d '
{
"@message": ""userAgent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows
NT 6.1; Trident/5.0)""
}'

Note the escaped double-quotes. Now I'm trying to match this document with
the following regexp filter:

$ curl -XGET 'elasticsearch:9200/regex_test/useragent/_search' -d '
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"regexp": {
"@message": "Mozilla.5.*"
}
}
}
}
}

I get 0 hits. I thought it would have matched the sequence "Mozilla/5..."
?? I also tried ".Mozilla." which doesn't work either. However, when I
match against a blank regexp wildcard I do get the result back (showing
that Regexp is working):

$ curl -XGET 'elasticsearch:9200/regex_test/useragent/_search' -d '
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"regexp": {
"@message": ".*"
}
}
}
}
}

I tried playing around with dynamic mapping templates and using the keyword
analyzer and no analyzer but that didn't seem to make a difference. How can
I go about optimizing the @message field across all my indexes for regexp
searches?

Thanks!

Jamil

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/efa9e596-7af2-4cfa-aea7-a2a072fca42f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #2

Assuming you have no prior mappings, your first example will put @message
through a standard analyzer - i.e. it will chop it up into pieces using
this analyzer:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html

So a query like this will not match (since the standard analyzer will make
it into multiple terms like: ["useragent", "mozilla", "5.0"], etc.):

    "regexp": {
      "@message": "Mozilla.5.*"
    }

But something like this will (since it matches one of the terms: "mozilla"):

    "regexp": {
      "@message": "mozill."
    }

If instead you use something like a keyword analyzer (or not_analyzed),
then the whole string is a single token ([""userAgent": "Mozilla/5.0
(compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)""]).

In this case a query like this will still not match:

    "regexp": {
      "@message": "Mozilla.5.*"
    }

But something like this will:

    "regexp": {
      "@message": ".*Mozilla.5.*"
    }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d13aac4b-b71d-4cc2-ad31-afba761e43da%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jamil Bou Kheir-2) #3

Ahh ok. I'll have to give the keyword analyzer a try then!

Thanks,
Jamil

On Friday, February 21, 2014 2:23:06 PM UTC-8, Binh Ly wrote:

Assuming you have no prior mappings, your first example will put @message
through a standard analyzer - i.e. it will chop it up into pieces using
this analyzer:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html

So a query like this will not match (since the standard analyzer will make
it into multiple terms like: ["useragent", "mozilla", "5.0"], etc.):

    "regexp": {
      "@message": "Mozilla.5.*"
    }

But something like this will (since it matches one of the terms:
"mozilla"):

    "regexp": {
      "@message": "mozill."
    }

If instead you use something like a keyword analyzer (or not_analyzed),
then the whole string is a single token ([""userAgent": "Mozilla/5.0
(compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)""]).

In this case a query like this will still not match:

    "regexp": {
      "@message": "Mozilla.5.*"
    }

But something like this will:

    "regexp": {
      "@message": ".*Mozilla.5.*"
    }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af0df36c-2e29-4edc-be55-ded0f86b2f5e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4