Analyzing URLs for regexp queries


(matt burton) #1

I have a field in my documents that consists of a URL.
{...
"url":"http://example.com/2014/04/15/foo-bar-baz/"
...}

I would like to use a regexp query/filter to find documents in my index
with urls matching a regex pattern.
For example: "http://example.com/\d{4}/\d{2}/\d{2}/([^/]+)/$"

I'm a bit stumped about how to configure an analyzer in the document
_mapping to enable a regexp search (like above) for the url field. I've
tried the standard and keyword analyzer, but they didn't work.

I'm not even sure if this is possible to do, if not I'll can do it outside
of ES, but I thought I'd ask here to see if ya'll had any guidance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/62e05ecc-500f-474e-a5e6-220a9eb86eb3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Luiz Guilherme Santos) #2

Hi Matt,

If you mark your field as not_analyzed:
{
"mappings": {
"type1": {
"properties": {
"url": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}

You could use a regexp query:
POST _search
{
"query": {
"regexp": {
"url": "http://example.com/\d{4}/\d{2}/\d{2}/([^/]+)/$"
}
}
}

On Tue, Apr 15, 2014 at 5:57 PM, matt burton mcburton@gmail.com wrote:

I have a field in my documents that consists of a URL.
{...
"url":"http://example.com/2014/04/15/foo-bar-baz/"
...}

I would like to use a regexp query/filter to find documents in my index
with urls matching a regex pattern.
For example: "http://example.com/\d{4}/\d{2}/\d{2}/([^/]+)/$"

I'm a bit stumped about how to configure an analyzer in the document
_mapping to enable a regexp search (like above) for the url field. I've
tried the standard and keyword analyzer, but they didn't work.

I'm not even sure if this is possible to do, if not I'll can do it outside
of ES, but I thought I'd ask here to see if ya'll had any guidance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/62e05ecc-500f-474e-a5e6-220a9eb86eb3%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/62e05ecc-500f-474e-a5e6-220a9eb86eb3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGwrZWON6tKoZDf4d0BOenDJDNyxaU0HfUOOV83%2Bh9KKA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(mcburton) #3

Luiz, thanks for responding!

I had forgotten to mention I tried not_analyzed as well. The analyzer it
turns out wasn't my problem.

I had 2 problems. First, the ES/Lucene regexp query/filter doesn't support
"\d" for indicating digits. So I had to replace them with the [0-9]
character class. Once I changed my regex to: "
http://example.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/[^/]+/" it worked!

My second problem is it appears the python library has a bug. When I try
the following python using elasticsearch-py:

query = {
"query": {
"regexp": {
"url": "http://example.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/[
^/]+/"
}
}
}
es.search(index="regex-test",doc_type="test1", body=query)

I get:
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, u'hits':
{u'hits': [], u'max_score': None, u'total': 0}, u'timed_out': False,
u'took': 11}

However, when I do this query on the command line:

curl -XPOST "http://localhost:9200/regex-test/type1/_search" -d'

{
"query": {
"regexp": {
"url": "http://example.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/[
^/]+/"
}
}
}'

{"took":6,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"regex-test","_type":"type1","_id":"doc1","_score":1.0,
"_source" : {"url":"http://example.com/2014/04/15/foo-bar-baz/"}

So I guess the issue lies with elasticsearch-py?

On Tue, Apr 15, 2014 at 5:59 PM, Luiz Guilherme Pais dos Santos <
luizgpsantos@gmail.com> wrote:

Hi Matt,

If you mark your field as not_analyzed:
{
"mappings": {
"type1": {
"properties": {
"url": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}

You could use a regexp query:
POST _search
{
"query": {
"regexp": {
"url": "http://example.com/\d{4}/\d{2}/\d{2}/([^/]+)/$"
}
}
}

On Tue, Apr 15, 2014 at 5:57 PM, matt burton mcburton@gmail.com wrote:

I have a field in my documents that consists of a URL.
{...
"url":"http://example.com/2014/04/15/foo-bar-baz/"
...}

I would like to use a regexp query/filter to find documents in my index
with urls matching a regex pattern.
For example: "http://example.com/\d{4}/\d{2}/\d{2}/([^/]+)/$"

I'm a bit stumped about how to configure an analyzer in the document
_mapping to enable a regexp search (like above) for the url field. I've
tried the standard and keyword analyzer, but they didn't work.

I'm not even sure if this is possible to do, if not I'll can do it
outside of ES, but I thought I'd ask here to see if ya'll had any guidance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/62e05ecc-500f-474e-a5e6-220a9eb86eb3%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/62e05ecc-500f-474e-a5e6-220a9eb86eb3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/4_Hz3ivP4uo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGwrZWON6tKoZDf4d0BOenDJDNyxaU0HfUOOV83%2Bh9KKA%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGwrZWON6tKoZDf4d0BOenDJDNyxaU0HfUOOV83%2Bh9KKA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2B0EHHrZ%2B%3DDqRk57fc9%3D26gVqALKqBjqd2BVz3%3D-8cgP26GEWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Honza Král) #4

Hi Matt,

that is curious, could you please try to enably trace logging for
elasticsearch-py and look what exactly is being sent? My guess is that
there is something that needs to be escaped in python though what that
might be alludes me for the time being.

to tenable the logging just do:

import logging
logging.basicConfig()
logging.getLogger('elasticsearch').setLevel(logging.DEBUG)

then you should see all the requests as they are being sent and verify
that the json is being serialized correctly.

Hope this helps,
Honza

On Tue, Apr 15, 2014 at 11:27 PM, mcburton mcburton@umich.edu wrote:

Luiz, thanks for responding!

I had forgotten to mention I tried not_analyzed as well. The analyzer it
turns out wasn't my problem.

I had 2 problems. First, the ES/Lucene regexp query/filter doesn't support
"\d" for indicating digits. So I had to replace them with the [0-9]
character class. Once I changed my regex to:
"http://example.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/[^/]+/" it worked!

My second problem is it appears the python library has a bug. When I try the
following python using elasticsearch-py:

query = {
"query": {
"regexp": {
"url":
"http://example.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/[^/]+/"
}
}
}
es.search(index="regex-test",doc_type="test1", body=query)

I get:
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, u'hits':
{u'hits': [], u'max_score': None, u'total': 0}, u'timed_out': False,
u'took': 11}

However, when I do this query on the command line:

curl -XPOST "http://localhost:9200/regex-test/type1/_search" -d'

{
"query": {
"regexp": {
"url":
"http://example.com/([0-9]{4})/([0-9]{2})/([0-9]{2})/[^/]+/"
}
}
}'

{"took":6,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"regex-test","_type":"type1","_id":"doc1","_score":1.0,
"_source" : {"url":"http://example.com/2014/04/15/foo-bar-baz/"}

So I guess the issue lies with elasticsearch-py?

On Tue, Apr 15, 2014 at 5:59 PM, Luiz Guilherme Pais dos Santos
luizgpsantos@gmail.com wrote:

Hi Matt,

If you mark your field as not_analyzed:
{
"mappings": {
"type1": {
"properties": {
"url": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}

You could use a regexp query:
POST _search
{
"query": {
"regexp": {
"url": "http://example.com/\d{4}/\d{2}/\d{2}/([^/]+)/$"
}
}
}

On Tue, Apr 15, 2014 at 5:57 PM, matt burton mcburton@gmail.com wrote:

I have a field in my documents that consists of a URL.
{...
"url":"http://example.com/2014/04/15/foo-bar-baz/"
...}

I would like to use a regexp query/filter to find documents in my index
with urls matching a regex pattern.
For example: "http://example.com/\d{4}/\d{2}/\d{2}/([^/]+)/$"

I'm a bit stumped about how to configure an analyzer in the document
_mapping to enable a regexp search (like above) for the url field. I've
tried the standard and keyword analyzer, but they didn't work.

I'm not even sure if this is possible to do, if not I'll can do it
outside of ES, but I thought I'd ask here to see if ya'll had any guidance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/62e05ecc-500f-474e-a5e6-220a9eb86eb3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/4_Hz3ivP4uo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGwrZWON6tKoZDf4d0BOenDJDNyxaU0HfUOOV83%2Bh9KKA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CA%2B0EHHrZ%2B%3DDqRk57fc9%3D26gVqALKqBjqd2BVz3%3D-8cgP26GEWg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABfdDipDGG0WkbY7jr6pYtHmz%2BHqrDV8r-WkMy-_9rme4wNJVA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5