Can't get nGram indexing / querying to work as expected

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1' -d '
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1' -d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1' -d '
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpbader@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy keteracel@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1' -d '
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1' -d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1' -d '
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpbader@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/cd08abb8ba01b0d4/38c0ed74e263252a?lnk=gst&q=ngram#38c0ed74e263252a

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d '
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1'-d '
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpbader@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/cd08abb8ba01b0d4/38c0ed74e263252a?lnk=gst&q=ngram#38c0ed74e263252a

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d'
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

Lets add a concrete example:

I added two urls:

{ uri : "http://www.heise.de" }
{ uri : "http://mylatestwebsite.com" }

and I'm using the ascii_ngram analyzer I can search for "hei",
"heise", "latest" and always get both results

at least they seem to be in proper order. Searching for "latest"
returns the mylatestwebsite doc on first position etc.

Still I'd expect to get only one result when searching for a substring
that is only included in one of the docs. Is ngram even the right
thing for that?

As you propbably guess - I'm still confused.

Kind regards, John

On May 8, 7:56 pm, hukl jpba...@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT 'http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d'
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d'
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

Hang on, looking at

min_ngram is 1!

On Sun, May 8, 2011 at 7:07 PM, Paul Loy keteracel@gmail.com wrote:

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpbader@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/cd08abb8ba01b0d4/38c0ed74e263252a?lnk=gst&q=ngram#38c0ed74e263252a

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT '
http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d '
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET 'http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as
well
as clintongormley were really helpful but in the end my example still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the
domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and
collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com

http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

So yeah, if you search for zzzz it won't find anything. You need to set
min_gram and max_gram to different values.

On Sun, May 8, 2011 at 7:10 PM, Paul Loy keteracel@gmail.com wrote:

Hang on, looking at
Elasticsearch Platform — Find real-time answers at scale | Elastic min_ngram is 1!

On Sun, May 8, 2011 at 7:07 PM, Paul Loy keteracel@gmail.com wrote:

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpbader@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/cd08abb8ba01b0d4/38c0ed74e263252a?lnk=gst&q=ngram#38c0ed74e263252a

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT '
http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d '
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d '
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET '
http://127.0.0.1:9200/test/website4/_search?pretty=1'-d '
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as
well
as clintongormley were really helpful but in the end my example
still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the
domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains
to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and
collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com

http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

Any suggestions for proper values ? :wink:

On May 8, 8:13 pm, Paul Loy ketera...@gmail.com wrote:

So yeah, if you search for zzzz it won't find anything. You need to set
min_gram and max_gram to different values.

On Sun, May 8, 2011 at 7:10 PM, Paul Loy ketera...@gmail.com wrote:

Hang on, looking at
Elasticsearch Platform — Find real-time answers at scale | Elastic is 1!

On Sun, May 8, 2011 at 7:07 PM, Paul Loy ketera...@gmail.com wrote:

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpba...@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser to
ascii_ngram. I think the default n for ngram is min 2, max 3. So since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com wrote:

After a bit of playing I did this:

$ curl -XPUT '
http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d'
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d'
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET '
http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi as
well
as clintongormley were really helpful but in the end my example
still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the
domain
after the dots as there is no whitespace and therefore the dots are
considered as part of the token. Also its quite common for domains
to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and
collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect to
work but doesn't and I'd love to hear about any suggestions how to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.com

http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

I was going to go for 3,5 myself.

On Sun, May 8, 2011 at 7:15 PM, hukl jpbader@gmail.com wrote:

Any suggestions for proper values ? :wink:

On May 8, 8:13 pm, Paul Loy ketera...@gmail.com wrote:

So yeah, if you search for zzzz it won't find anything. You need to set
min_gram and max_gram to different values.

On Sun, May 8, 2011 at 7:10 PM, Paul Loy ketera...@gmail.com wrote:

Hang on, looking at

Elasticsearch Platform — Find real-time answers at scale | Elastic 1!

On Sun, May 8, 2011 at 7:07 PM, Paul Loy ketera...@gmail.com wrote:

Yup, you are correct. That's a shame as I was going to use ngram for
username search :frowning:

On Sun, May 8, 2011 at 6:56 PM, hukl jpba...@gmail.com wrote:

Thank you! Indeed this makes sense and returns results but the
problem
is that when I enter more urls it will return them all on each query.

This was also mentioned by collin in this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...

where he suggested to use a different search tokenizer.

So the question would be which search analyzer / tokenizer would fit?
Or is it a bug?

Kind regards, John

On May 8, 7:22 pm, Paul Loy ketera...@gmail.com wrote:

Actually scratch that. What worked was changing the search analyser
to
ascii_ngram. I think the default n for ngram is min 2, max 3. So
since
heis
is 4 characters it will not match any of the tokens in the index.

On Sun, May 8, 2011 at 6:15 PM, Paul Loy ketera...@gmail.com
wrote:

After a bit of playing I did this:

$ curl -XPUT '
http://127.0.0.1:9200/test/website4/_mapping?pretty=1'-d'
{
"website" : {
"properties" : {
"uri" : {
"type" : "string",
"include_in_all" : 0,
"index_analyzer" : "ascii_ngram",
"search_analyzer" : "ascii_std"
}
}
}
}
'
{
"ok" : true,
"acknowledged" : true
}

$ curl -XPOST 'http://127.0.0.1:9200/test/website4?pretty=1'-d'
{
"uri" : "http://www.heise.de"
}
'
{
"ok" : true,
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_version" : 1
}

$ curl -XPOST 'http://127.0.0.1:9200/test/_refresh'
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

$ curl -XGET '
http://127.0.0.1:9200/test/website4/_search?pretty=1'-d'
{
"query" : {
"field" : {
"uri" : "heis"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
*"hits" : {
"total" : 1,
"max_score" : 0.28602687,
"hits" : [ {
"_index" : "test",
"_type" : "website4",
"_id" : "j3KS0Py1TWC7JjnOWeeCIg",
"_score" : 0.28602687, "_source" :
{
"uri" : "http://www.heise.de"
}

} ]

}*
}

Putting a refresh in there works.

On Sun, May 8, 2011 at 4:17 PM, hukl jpba...@gmail.com wrote:

Hey there,

I have already spent half the day in the irc channel and karmi
as
well
as clintongormley were really helpful but in the end my example
still
does not work as expected.

What I want to achieve:

I have docs which include an uri field like

{ uri : "http://www.foobar.com" }

And I want to be able to find that doc by searching for "foo"

{ uri : "http://www.mylatestwebsite.com" }

And I want to be able to find it by searching for "latest"

Now with the standard analyzer / tokenizer it doesn't split the
domain
after the dots as there is no whitespace and therefore the dots
are
considered as part of the token. Also its quite common for
domains
to
be not seperated by tokens like the one above.

Now collin, karmi and kimchy suggested using ngram for this and
collin
even provided an example which unfortunately did not work. So I
produced a minimal example based on collins which I would expect
to
work but doesn't and I'd love to hear about any suggestions how
to
make this work.

My ES Session for this Problem looks like this:

https://gist.github.com/gists/961418

Kind regards, John

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.com

http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

--

Paul Loy
p...@keteracel.comhttp://uk.linkedin.com/in/paulloy

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy