Can't get nGram indexing / querying to work as expected


(hukl) #1

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John


(Paul Loy) #2

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1' -d '
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1' -d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1' -d '
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpbader@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Paul Loy) #3

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy keteracel@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1' -d '
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1' -d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1' -d '
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpbader@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(hukl) #4

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/cd08abb8ba01b0d4/38c0ed74e263252a?lnk=gst&q=ngram#38c0ed74e263252a

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d '
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1'-d '
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy


(Paul Loy) #5

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpbader@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/cd08abb8ba01b0d4/38c0ed74e263252a?lnk=gst&q=ngram#38c0ed74e263252a

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d'
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(hukl) #6

Lets add a concrete example:

I added two urls:

{ uri : "http://www.heise.de" }
{ uri : "http://mylatestwebsite.com" }

and I'm using the ascii_ngram analyzer I can search for "hei",
"heise", "latest" and always get both results

at least they seem to be in proper order. Searching for "latest"
returns the mylatestwebsite doc on first position etc.

Still I'd expect to get only one result when searching for a substring
that is only included in one of the docs. Is ngram even the right
thing for that?

As you propbably guess - I'm still confused.

Kind regards, John

On May 8, 7:56 pm, hukl jpba...@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d'
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d'
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy


(Paul Loy) #7

Hang on, looking at
http://www.elasticsearch.org/guide/reference/index-modules/analysis/ngram-tokenizer.htmlthe
min_ngram is 1!

On Sun, May 8, 2011 at 7:07 PM, Paul Loy keteracel@gmail.com wrote:

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpbader@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/cd08abb8ba01b0d4/38c0ed74e263252a?lnk=gst&q=ngram#38c0ed74e263252a

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT '
http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d '

{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as
well

as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the
domain

after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and
collin

even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com

http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(Paul Loy) #8

So yeah, if you search for zzzz it won't find anything. You need to set
min_gram and max_gram to different values.

On Sun, May 8, 2011 at 7:10 PM, Paul Loy keteracel@gmail.com wrote:

Hang on, looking at
http://www.elasticsearch.org/guide/reference/index-modules/analysis/ngram-tokenizer.htmlthe min_ngram is 1!

On Sun, May 8, 2011 at 7:07 PM, Paul Loy keteracel@gmail.com wrote:

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpbader@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/cd08abb8ba01b0d4/38c0ed74e263252a?lnk=gst&q=ngram#38c0ed74e263252a

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT '
http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d '

{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET '
http://127.0.0.1:9200/test/website4/_search?pretty=1'-d '

{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as
well

as clintongormley were really helpful but in the end my example
still

does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the
domain

after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains
to

be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and
collin

even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com

http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(hukl) #9

Any suggestions for proper values ? :wink:

On May 8, 8:13 pm, Paul Loy ketera...@gmail.com wrote:

So yeah, if you search for zzzz it won't find anything. You need to set
min_gram and max_gram to different values.

On Sun, May 8, 2011 at 7:10 PM, Paul Loy ketera...@gmail.com wrote:

Hang on, looking at
http://www.elasticsearch.org/guide/reference/index-modules/analysis/n...min_ngram is 1!

On Sun, May 8, 2011 at 7:07 PM, Paul Loy ketera...@gmail.com wrote:

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpba...@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT '
http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d'

{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d'
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET '
http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'

{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as
well

as clintongormley were really helpful but in the end my example
still

does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the
domain

after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains
to

be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and
collin

even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.com

http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy


(Paul Loy) #10

I was going to go for 3,5 myself.

On Sun, May 8, 2011 at 7:15 PM, hukl jpbader@gmail.com wrote:

Any suggestions for proper values ? :wink:

On May 8, 8:13 pm, Paul Loy ketera...@gmail.com wrote:

So yeah, if you search for zzzz it won't find anything. You need to set
min_gram and max_gram to different values.

On Sun, May 8, 2011 at 7:10 PM, Paul Loy ketera...@gmail.com wrote:

Hang on, looking at

http://www.elasticsearch.org/guide/reference/index-modules/analysis/n...min_ngramis 1!

On Sun, May 8, 2011 at 7:07 PM, Paul Loy ketera...@gmail.com wrote:

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpba...@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the
problem

is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser
to

ascii_ngram. I think the default n for ngram is min 2, max 3. So
since

heis

is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com
wrote:

After a bit of playing I did this:

$ curl -XPUT '
http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d'

{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d'
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET '
http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'

{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi
as

well

as clintongormley were really helpful but in the end my example
still

does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the
domain

after the dots as there is no whitespace and therefore the dots
are

considered as part of the token. Also its quite common for
domains

to

be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and
collin

even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect
to

work but doesn't and I'd love to hear about any suggestions how
to

make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.com

http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy


(system) #11