Possible bug with query_string, no source stored, German umlaut and wildcard

Before creating a new issue report I wanted to ask here if someone can
please confirm the following situation.

Searching for something with a German umlaut e.g. "Körbe" and using the "*"
wildcard results in zero hits. This is true for 0.20.5 as well as 0.90RC1
and RC2. The index has to be created without storing the source. If the
source is stored, the possible bug seems not to be triggered.

curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"query_string" : {
"default_field" : "message",
"query" : "körb*"
}
}
}'

Setting "analyze_wildcard" bring up results but they look still incomplete.

curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"query_string" : {
"default_field" : "message",
"analyze_wildcard" : true,
"query" : "körb*"
}
}
}'

The index has been built with a per-field analyzer, mostly "german" or a
custom one without stopwords. A simple one could look like:

curl -XPOST 'http://localhost:9200/twitter' -d '{
"mappings" : {
"tweet" : {
"_source" : { "enabled" : false },
"properties" : { "message" : {"type" : "string", "analyzer": "german"
} }
}
}
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{
"message" : "We are testing with German umlauts. Körbe is a great
example."
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{
"message" : "We are still testing with German umlauts. Körbe Made in
Germany are available for worldwide delivery."
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{
"message" : "Here is a third example still with a German umlaut (ä)."
}'

Am I missing something? Can someone confirm it?

Thanks in advance

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

you wrote that your search results 'look incomplete'. Can you elaborate on
that? Executing your second query returns two hits with 'Körbe', which
looks as expected from my birds eye view.

--Alex

On Wed, Apr 10, 2013 at 11:18 AM, Wolf t.wolf@bike24.net wrote:

Before creating a new issue report I wanted to ask here if someone can
please confirm the following situation.

Searching for something with a German umlaut e.g. "Körbe" and using the
"*" wildcard results in zero hits. This is true for 0.20.5 as well as
0.90RC1 and RC2. The index has to be created without storing the source. If
the source is stored, the possible bug seems not to be triggered.

curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"query_string" : {
"default_field" : "message",
"query" : "körb*"
}
}
}'

Setting "analyze_wildcard" bring up results but they look still incomplete.

curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"query_string" : {
"default_field" : "message",
"analyze_wildcard" : true,
"query" : "körb*"
}
}
}'

The index has been built with a per-field analyzer, mostly "german" or a
custom one without stopwords. A simple one could look like:

curl -XPOST 'http://localhost:9200/twitter' -d '{
"mappings" : {
"tweet" : {
"_source" : { "enabled" : false },
"properties" : { "message" : {"type" : "string", "analyzer":
"german" } }
}
}
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{
"message" : "We are testing with German umlauts. Körbe is a great
example."
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{
"message" : "We are still testing with German umlauts. Körbe Made in
Germany are available for worldwide delivery."
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{
"message" : "Here is a third example still with a German umlaut (ä)."
}'

Am I missing something? Can someone confirm it?

Thanks in advance

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for testing. IMHO the first _search-Query should result the same
total of results (but it's zero).

Your question is a little bit harder to explain and test. We have of course
more than 3 documents and tested specific parts of a term with appending
"*" to them.

e.g.

K* has 100 results

Kö* has 12 results

Kör* has 65 results

We were not able to find a reason with e.g. "_explain" for that behavior.
That's why I'm asking.

Am Mittwoch, 10. April 2013 11:32:23 UTC+2 schrieb Alexander Reelsen:

Hey,

you wrote that your search results 'look incomplete'. Can you elaborate on
that? Executing your second query returns two hits with 'Körbe', which
looks as expected from my birds eye view.

--Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This is not Elasticsearch. It depends on your analyzer setup.

In Lucene 3.x, you can't combine wildcard query with an analyzed query.
Effectively it means you can search for "Köln" (token->"koln"), "Koln"
(tokenized->"koln"), or "Ko*" (because "ko" is prefix of "koln") but not
"Kö*" (because there is no "kö" token in the index).

In Lucene 4, a "best effort" approach is used to try analyzing a
wildcard query. But I haven't tried it yet how succesfull this approach
is, I understand it is not perfect. I wonder how it should be possible
to tokenize "Kö" as a german word to "Ko", there is no such
language-based stem rule.

Workaround: index your german words also un-analyzed into another field
and let wildcard queries search also on this field.

Jörg

Am 10.04.13 11:18, schrieb Wolf:

Searching for something with a German umlaut e.g. "Körbe" and using
the "*" wildcard results in zero hits.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for your answer as well. So basically analyzed tokens contain only
ASCII characters and that's why the wildcard search doesn't find anything
in the first search query.

I'm wondering if I should than post a feature request fpr implementing are
more internationalized tokenizer or something in this direction?

Am Mittwoch, 10. April 2013 11:53:40 UTC+2 schrieb Jörg Prante:

This is not Elasticsearch. It depends on your analyzer setup.

In Lucene 3.x, you can't combine wildcard query with an analyzed query.
Effectively it means you can search for "Köln" (token->"koln"), "Koln"
(tokenized->"koln"), or "Ko*" (because "ko" is prefix of "koln") but not
"Kö*" (because there is no "kö" token in the index).

In Lucene 4, a "best effort" approach is used to try analyzing a
wildcard query. But I haven't tried it yet how succesfull this approach
is, I understand it is not perfect. I wonder how it should be possible
to tokenize "Kö" as a german word to "Ko", there is no such
language-based stem rule.

Workaround: index your german words also un-analyzed into another field
and let wildcard queries search also on this field.

Jörg

Am 10.04.13 11:18, schrieb Wolf:

Searching for something with a German umlaut e.g. "Körbe" and using
the "*" wildcard results in zero hits.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I assume you use german tokenizer. it's the stemming that reduces words
to base forms (it has nothing to do with ASCII).

You can try other analyzers and you will get slightly different results,
I'm sure.

There are internationalized tokenizers (see ICU analyzer), it's not a
matter of "more internatiolized tokenizer", they are all in place.

My recommendation is, think of how you want to combine wildcard query
with analyzed / un-analyzed query, and design your fields to match these
use cases.

Jörg

Am 10.04.13 12:03, schrieb Wolf:

Thanks for your answer as well. So basically analyzed tokens contain
only ASCII characters and that's why the wildcard search doesn't find
anything in the first search query.

I'm wondering if I should than post a feature request fpr implementing
are more internationalized tokenizer or something in this direction?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Got it. Thanks

Am Mittwoch, 10. April 2013 12:12:37 UTC+2 schrieb Jörg Prante:

I assume you use german tokenizer. it's the stemming that reduces words
to base forms (it has nothing to do with ASCII).

You can try other analyzers and you will get slightly different results,
I'm sure.

There are internationalized tokenizers (see ICU analyzer), it's not a
matter of "more internatiolized tokenizer", they are all in place.

My recommendation is, think of how you want to combine wildcard query
with analyzed / un-analyzed query, and design your fields to match these
use cases.

Jörg

Am 10.04.13 12:03, schrieb Wolf:

Thanks for your answer as well. So basically analyzed tokens contain
only ASCII characters and that's why the wildcard search doesn't find
anything in the first search query.

I'm wondering if I should than post a feature request fpr implementing
are more internationalized tokenizer or something in this direction?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.