Whitespace tokenizer does not tokenise character code 160

imdhmd · July 5, 2013, 10:06am

Hello All,

I'm using ES 0.20.6.

(all the curl commands below are available at this
gist: https://gist.github.com/imdhmd/cda6880e0cc770e80052)

ISSUE: Unable to tokenize character code 160 (which looks like space) using
whitespace tokenizer

Create the following mapping and settings

Settings contains the match_phrase analyzer definition:

curl -XPOST localhost:9200/newindex -d '{
"mappings": {
"newtype": {
"properties": {
"going": {
"type": "string", "analyzer" : "match_phrase"
}
}
}
},
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analyzer": {
"match_phrase" : {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}'

I have indexed the following document

IMPORTANT NOTE: The space characters in the below document have

character code 160.

curl -XPOST localhost:9200/newindex/newtype -d '
{
"going": "Link ID ADV Router Age Seq#
Checksum Link count"
}'

When i search using the following query i do NOT get any results:

The query below has normal space, but even if it is char code 160 there

is no result

curl -XPOST localhost:9200/newindex/newtype/_search -d '{
"query":{
"query_string":{
"query":"ADV Router"
,
"fields" : ["going"]
}
}
}'

Can you please suggest a fix to this? Is this a known issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · July 7, 2013, 6:21pm

Hi, Imdad,

One work-around is to define a char filter that converts code 160 to code
20 (ASCII blank) before tokenizing.

I am not on my usual laptop and can't give you an example. But I do this
for matching various languages in which there are character equivalencies
for matching (for example, Finnish tends like like W to match V).

Regards,
Brian

On Friday, July 5, 2013 6:06:15 AM UTC-4, Imdad Ahmed wrote:

Hello All,

I'm using ES 0.20.6.

(all the curl commands below are available at this gist:
https://gist.github.com/imdhmd/cda6880e0cc770e80052)

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

Create the following mapping and settings

Settings contains the match_phrase analyzer definition:

curl -XPOST localhost:9200/newindex -d '{
"mappings": {
"newtype": {
"properties": {
"going": {
"type": "string", "analyzer" : "match_phrase"
}
}
}
},
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analyzer": {
"match_phrase" : {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}'

I have indexed the following document

IMPORTANT NOTE: The space characters in the below document have

character code 160.

curl -XPOST localhost:9200/newindex/newtype -d '
{
"going": "Link ID ADV Router Age Seq#
Checksum Link count"
}'

When i search using the following query i do NOT get any results:

The query below has normal space, but even if it is char code 160 there

is no result

curl -XPOST localhost:9200/newindex/newtype/_search -d '{
"query":{
"query_string":{
"query":"ADV Router"
,
"fields" : ["going"]
}
}
}'

Can you please suggest a fix to this? Is this a known issue?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · July 7, 2013, 11:28pm

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()
http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · July 7, 2013, 11:42pm

I opened an improvement issue
https://issues.apache.org/jira/browse/LUCENE-5096

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()
Character (Java Platform SE 6)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

imdhmd · July 8, 2013, 5:00am

Thanks Jörg

On Monday, July 8, 2013 5:12:31 AM UTC+5:30, Jörg Prante wrote:

I opened an improvement issue
[LUCENE-5096] WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace - ASF JIRA

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()

Character (Java Platform SE 6)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

imdhmd · July 8, 2013, 8:17am

For the benefits of others:

char_filter:
whitespace_mapping:
type: mapping
mappings: ["\u00A0=>\u0020"]

On Monday, July 8, 2013 10:30:07 AM UTC+5:30, Imdad Ahmed wrote:

Thanks Jörg

On Monday, July 8, 2013 5:12:31 AM UTC+5:30, Jörg Prante wrote:

I opened an improvement issue
[LUCENE-5096] WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace - ASF JIRA

Jörg

Am 08.07.13 01:28, schrieb Jörg Prante:

This is an issue with Lucene. Lucene whitespace tokenizer only checks
whitespace for Java, which is realized by Character.isWhiteSpace()

Character (Java Platform SE 6)

But the Java whitespaces are unfortunately different from Unicode
whitespace property list in
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Jörg

Am 05.07.13 12:06, schrieb Imdad Ahmed:

ISSUE: Unable to tokenize character code 160 (which looks like space)
using whitespace tokenizer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Whitespace tokenizer not working as I'd expect Elasticsearch	3	1135	July 6, 2017
Aalyzer issue - terms not getting tokenized on whitespace Elasticsearch	1	321	July 6, 2017
WhiteSpaceTokenizer buffer_size Elasticsearch	6	1296	July 5, 2017
Whitespace Tokenizer dont works as expected Elasticsearch	2	470	December 19, 2018
Phrases with special characters Elasticsearch	1	1408	July 6, 2017

Whitespace tokenizer does not tokenise character code 160

Create the following mapping and settings

Settings contains the match_phrase analyzer definition:

I have indexed the following document

IMPORTANT NOTE: The space characters in the below document have

When i search using the following query i do NOT get any results:

The query below has normal space, but even if it is char code 160 there

Create the following mapping and settings

Settings contains the match_phrase analyzer definition:

I have indexed the following document

IMPORTANT NOTE: The space characters in the below document have

When i search using the following query i do NOT get any results:

The query below has normal space, but even if it is char code 160 there

Related topics