URL analysis

Hi

I have a URL that contains an ID. I would like extract the ID during
analysis so that the ID part of the URL is searchable, I would like to it
to have its own property in the index 'productId' and store it so it can be
returned

the url:

http://host:port/path/1234/path

all of the other parts of the URL are not numeric.

Is there a way to do this in elasticSearch?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

you may want to take a look at the pattern analyzer (if you know the the
structure of your URLs)
http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-analyzer/

--Alex

On Thu, May 23, 2013 at 11:44 AM, es newbie dan.tuffery@gmail.com wrote:

Hi

I have a URL that contains an ID. I would like extract the ID during
analysis so that the ID part of the URL is searchable, I would like to it
to have its own property in the index 'productId' and store it so it can be
returned

the url:

http://host:port/path/1234/path

all of the other parts of the URL are not numeric.

Is there a way to do this in elasticSearch?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Alex.

Would the following split on forward slash?

"url_index": {
"pattern": "/",
"lowercase": false,
"type": "pattern"
}

On Thursday, May 23, 2013 11:04:36 AM UTC+1, Alexander Reelsen wrote:

Hey,

you may want to take a look at the pattern analyzer (if you know the the
structure of your URLs)

http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-analyzer/

--Alex

On Thu, May 23, 2013 at 11:44 AM, es newbie <dan.t...@gmail.com<javascript:>

wrote:

Hi

I have a URL that contains an ID. I would like extract the ID during
analysis so that the ID part of the URL is searchable, I would like to it
to have its own property in the index 'productId' and store it so it can be
returned

the url:

http://host:port/path/1234/path

all of the other parts of the URL are not numeric.

Is there a way to do this in elasticSearch?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Given that you want to extract just the ID from the URL, rather than all of
the parts, you can use the pattern tokenizer to capture just the first
capture group:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"tokenizer" : {
"extract_id" : {
"pattern" : "/([0-9]+)(/|$)",
"group" : 1,
"type" : "pattern"
}
},
"analyzer" : {
"extract_id" : {
"tokenizer" : "extract_id"
}
}
}
}
}
'

This pattern "/([0-9]+)(/|$)" matches:

  • /
  • followed by 1 or more numbers
  • followed by a / or the end of string $

You can test this out as:

curl 'http://127.0.0.1:9200/test/_analyze?pretty=1&&analyzer=extract_id' -d
'
http://foo:123/path/111/
'

-> 111

curl 'http://127.0.0.1:9200/test/_analyze?pretty=1&&analyzer=extract_id' -d
'
http://foo:123/path/111
'

-> 111

curl 'http://127.0.0.1:9200/test/_analyze?pretty=1&&analyzer=extract_id' -d
'
http://foo:123/path/111/foo
'

-> 111

curl 'http://127.0.0.1:9200/test/_analyze?pretty=1&&analyzer=extract_id' -d
'
http://foo:123/path/111aa/
'

-> no match

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Clinton, that works.

On Thursday, May 23, 2013 12:22:34 PM UTC+1, Clinton Gormley wrote:

Given that you want to extract just the ID from the URL, rather than all
of the parts, you can use the pattern tokenizer to capture just the first
capture group:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"tokenizer" : {
"extract_id" : {
"pattern" : "/([0-9]+)(/|$)",
"group" : 1,
"type" : "pattern"
}
},
"analyzer" : {
"extract_id" : {
"tokenizer" : "extract_id"
}
}
}
}
}
'

This pattern "/([0-9]+)(/|$)" matches:

  • /
  • followed by 1 or more numbers
  • followed by a / or the end of string $

You can test this out as:

curl 'http://127.0.0.1:9200/test/_analyze?pretty=1&&analyzer=extract_id'
-d '
http://foo:123/path/111/
'

-> 111

curl 'http://127.0.0.1:9200/test/_analyze?pretty=1&&analyzer=extract_id'
-d '
http://foo:123/path/111
'

-> 111

curl 'http://127.0.0.1:9200/test/_analyze?pretty=1&&analyzer=extract_id'
-d '
http://foo:123/path/111/foo
'

-> 111

curl 'http://127.0.0.1:9200/test/_analyze?pretty=1&&analyzer=extract_id'
-d '
http://foo:123/path/111aa/
'

-> no match

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.