Whitespace tokenizer does not tokenise character code 160

Hello All,

I'm using ES 0.20.6.

(all the curl commands below are available at this
gist: https://gist.github.com/imdhmd/cda6880e0cc770e80052)

ISSUE: Unable to tokenize character code 160 (which looks like space) using
whitespace tokenizer

Create the following mapping and settings

Settings contains the match_phrase analyzer definition:

curl -XPOST localhost:9200/newindex -d '{
"mappings": {
"newtype": {
"properties": {
"going": {
"type": "string", "analyzer" : "match_phrase"
}
}
}
},
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analyzer": {
"match_phrase" : {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}'

I have indexed the following document

IMPORTANT NOTE: The space characters in the below document have

character code 160.

curl -XPOST localhost:9200/newindex/newtype -d '
{
"going": "Link ID ADV Router Age Seq#
Checksum Link count"
}'

When i search using the following query i do NOT get any results:

The query below has normal space, but even if it is char code 160 there

is no result

curl -XPOST localhost:9200/newindex/newtype/_search -d '{
"query":{
"query_string":{
"query":"ADV Router"
,
"fields" : ["going"]
}
}
}'

Can you please suggest a fix to this? Is this a known issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Imdad,

One work-around is to define a char filter that converts code 160 to code
20 (ASCII blank) before tokenizing.

I am not on my usual laptop and can't give you an example. But I do this
for matching various languages in which there are character equivalencies
for matching (for example, Finnish tends like like W to match V).

Regards,
Brian

On Friday, July 5, 2013 6:06:15 AM UTC-4, Imdad Ahmed wrote:

Hello All,

I'm using ES 0.20.6.

(all the curl commands below are available at this gist:
https://gist.github.com/imdhmd/cda6880e0cc770e80052)

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

Create the following mapping and settings

Settings contains the match_phrase analyzer definition:

curl -XPOST localhost:9200/newindex -d '{
"mappings": {
"newtype": {
"properties": {
"going": {
"type": "string", "analyzer" : "match_phrase"
}
}
}
},
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analyzer": {
"match_phrase" : {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}'

I have indexed the following document

IMPORTANT NOTE: The space characters in the below document have

character code 160.

curl -XPOST localhost:9200/newindex/newtype -d '
{
"going": "Link ID ADV Router Age Seq#
Checksum Link count"
}'

When i search using the following query i do NOT get any results:

The query below has normal space, but even if it is char code 160 there

is no result

curl -XPOST localhost:9200/newindex/newtype/_search -d '{
"query":{
"query_string":{
"query":"ADV Router"
,
"fields" : ["going"]
}
}
}'

Can you please suggest a fix to this? Is this a known issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()
http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I opened an improvement issue
https://issues.apache.org/jira/browse/LUCENE-5096

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()
Character (Java Platform SE 6)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Jörg

On Monday, July 8, 2013 5:12:31 AM UTC+5:30, Jörg Prante wrote:

I opened an improvement issue
[LUCENE-5096] WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace - ASF JIRA

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()

Character (Java Platform SE 6)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For the benefits of others:

char_filter:
whitespace_mapping:
type: mapping
mappings: ["\u00A0=>\u0020"]

On Monday, July 8, 2013 10:30:07 AM UTC+5:30, Imdad Ahmed wrote:

Thanks Jörg

On Monday, July 8, 2013 5:12:31 AM UTC+5:30, Jörg Prante wrote:

I opened an improvement issue
[LUCENE-5096] WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace - ASF JIRA

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()

Character (Java Platform SE 6)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.