Whitespace tokenizer does not tokenise character code 160

Hello All,

I'm using ES 0.20.6.

(all the curl commands below are available at this
gist: https://gist.github.com/imdhmd/cda6880e0cc770e80052)

ISSUE: Unable to tokenize character code 160 (which looks like space) using
whitespace tokenizer

Create the following mapping and settings

Settings contains the match_phrase analyzer definition:

curl -XPOST localhost:9200/newindex -d '{
"mappings": {
"newtype": {
"properties": {
"going": {
"type": "string", "analyzer" : "match_phrase"
}
}
}
},
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analyzer": {
"match_phrase" : {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}'

I have indexed the following document

IMPORTANT NOTE: The space characters in the below document have

character code 160.

curl -XPOST localhost:9200/newindex/newtype -d '
{
"going": "Link ID ADV Router Age Seq#
Checksum Link count"
}'

When i search using the following query i do NOT get any results:

The query below has normal space, but even if it is char code 160 there

is no result

curl -XPOST localhost:9200/newindex/newtype/_search -d '{
"query":{
"query_string":{
"query":"ADV Router"
,
"fields" : ["going"]
}
}
}'

Can you please suggest a fix to this? Is this a known issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Imdad,

One work-around is to define a char filter that converts code 160 to code
20 (ASCII blank) before tokenizing.

I am not on my usual laptop and can't give you an example. But I do this
for matching various languages in which there are character equivalencies
for matching (for example, Finnish tends like like W to match V).

Regards,
Brian

On Friday, July 5, 2013 6:06:15 AM UTC-4, Imdad Ahmed wrote:

Hello All,

I'm using ES 0.20.6.

(all the curl commands below are available at this gist:
https://gist.github.com/imdhmd/cda6880e0cc770e80052)

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

Create the following mapping and settings

Settings contains the match_phrase analyzer definition:

curl -XPOST localhost:9200/newindex -d '{
"mappings": {
"newtype": {
"properties": {
"going": {
"type": "string", "analyzer" : "match_phrase"
}
}
}
},
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analyzer": {
"match_phrase" : {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}'

I have indexed the following document

IMPORTANT NOTE: The space characters in the below document have

character code 160.

curl -XPOST localhost:9200/newindex/newtype -d '
{
"going": "Link ID ADV Router Age Seq#
Checksum Link count"
}'

When i search using the following query i do NOT get any results:

The query below has normal space, but even if it is char code 160 there

is no result

curl -XPOST localhost:9200/newindex/newtype/_search -d '{
"query":{
"query_string":{
"query":"ADV Router"
,
"fields" : ["going"]
}
}
}'

Can you please suggest a fix to this? Is this a known issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()
http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I opened an improvement issue
https://issues.apache.org/jira/browse/LUCENE-5096

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()
http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Jörg

On Monday, July 8, 2013 5:12:31 AM UTC+5:30, Jörg Prante wrote:

I opened an improvement issue
https://issues.apache.org/jira/browse/LUCENE-5096

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()

http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For the benefits of others:

char_filter:
whitespace_mapping:
type: mapping
mappings: ["\u00A0=>\u0020"]

On Monday, July 8, 2013 10:30:07 AM UTC+5:30, Imdad Ahmed wrote:

Thanks Jörg

On Monday, July 8, 2013 5:12:31 AM UTC+5:30, Jörg Prante wrote:

I opened an improvement issue
https://issues.apache.org/jira/browse/LUCENE-5096

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()

http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.