Do entries in a synonym list always get whitespace tokenized?

Glen_Smith · September 27, 2013, 1:14am

It appears to me they do.

(Apologies if this is a repost. Posted over an hour ago and it hasn't shown
up here.)

#!/bin/sh
echo "\nattempt to delete the index"
curl -XDELETE "http://localhost:9200/syndex/?pretty=false"
echo "\ncreate the index"
curl -XPUT "http://localhost:9200/syndex/?pretty=true" -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"syn": {
"tokenizer": "keyword",
"filter": ["color_synonym"]
}
},
"filter": {
"color_synonym": {
"type" : "synonym",
"synonyms" : ["red, another shade"]
}
}
}
}
}'
echo "\n analyze red: I want a single token another shade"
echo "\n instead I get two tokens another & shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d "red"
echo "\n sanity check how keyword tokenizer handes another shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d "another shade"
echo "\n and of course it does not split them"

Generates the output

attempt to delete the index
{"ok":true,"acknowledged":true}
create the index
{
"ok" : true,
"acknowledged" : true
}
analyze red: I want a single token another shade

instead I get two tokens another & shade
{
"tokens" : [ {
"token" : "red",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "another",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "shade",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}
sanity check how keyword tokenizer handes another shade
{
"tokens" : [ {
"token" : "another shade",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 1
} ]
}
and of course it does not split them

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · September 27, 2013, 4:22pm

Correct, the terms are being tokenized with a WhitespaceTokenizer. You can
see the code in SynonymTokenFilterFactory:

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/SynonymTokenFilterFactory.java#L85

Cheers,

Ivan

On Thu, Sep 26, 2013 at 6:14 PM, Glen Smith glen@smithsrock.com wrote:

It appears to me they do.

(Apologies if this is a repost. Posted over an hour ago and it hasn't
shown up here.)

#!/bin/sh
echo "\nattempt to delete the index"
curl -XDELETE "http://localhost:9200/syndex/?pretty=false"
echo "\ncreate the index"
curl -XPUT "http://localhost:9200/syndex/?pretty=true" -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"syn": {
"tokenizer": "keyword",
"filter": ["color_synonym"]
}
},
"filter": {
"color_synonym": {
"type" : "synonym",
"synonyms" : ["red, another shade"]
}
}
}
}
}'
echo "\n analyze red: I want a single token another shade"
echo "\n instead I get two tokens another & shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d "red"
echo "\n sanity check how keyword tokenizer handes another shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d "another shade"
echo "\n and of course it does not split them"

Generates the output

attempt to delete the index
{"ok":true,"acknowledged":true}
create the index
{
"ok" : true,
"acknowledged" : true
}
analyze red: I want a single token another shade

instead I get two tokens another & shade
{
"tokens" : [ {
"token" : "red",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "another",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "shade",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}
sanity check how keyword tokenizer handes another shade
{
"tokens" : [ {
"token" : "another shade",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 1
} ]
}
and of course it does not split them

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · September 27, 2013, 4:23pm

Hit reply too soon. You can change the tokenizer used by passing setting
the tokenizer setting.

--
Ivan

On Fri, Sep 27, 2013 at 9:22 AM, Ivan Brusic ivan@brusic.com wrote:

Correct, the terms are being tokenized with a WhitespaceTokenizer. You can
see the code in SynonymTokenFilterFactory:

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/SynonymTokenFilterFactory.java#L85

Cheers,

Ivan

On Thu, Sep 26, 2013 at 6:14 PM, Glen Smith glen@smithsrock.com wrote:

It appears to me they do.

(Apologies if this is a repost. Posted over an hour ago and it hasn't
shown up here.)

#!/bin/sh
echo "\nattempt to delete the index"
curl -XDELETE "http://localhost:9200/syndex/?pretty=false"
echo "\ncreate the index"
curl -XPUT "http://localhost:9200/syndex/?pretty=true" -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"syn": {
"tokenizer": "keyword",
"filter": ["color_synonym"]
}
},
"filter": {
"color_synonym": {
"type" : "synonym",
"synonyms" : ["red, another shade"]
}
}
}
}
}'
echo "\n analyze red: I want a single token another shade"
echo "\n instead I get two tokens another & shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d "red"
echo "\n sanity check how keyword tokenizer handes another shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d "another shade"
echo "\n and of course it does not split them"

Generates the output

attempt to delete the index
{"ok":true,"acknowledged":true}
create the index
{
"ok" : true,
"acknowledged" : true
}
analyze red: I want a single token another shade

instead I get two tokens another & shade
{
"tokens" : [ {
"token" : "red",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "another",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "shade",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}
sanity check how keyword tokenizer handes another shade
{
"tokens" : [ {
"token" : "another shade",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 1
} ]
}
and of course it does not split them

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Glen_Smith · September 27, 2013, 5:00pm

Thanks, Ivan. Yeah, I finally did figure out the obvious - the filter gets
its own tokenizer.

Hopefully this reply doesn't take 15 hours to show up...

On Friday, September 27, 2013 12:22:29 PM UTC-4, Ivan Brusic wrote:

Correct, the terms are being tokenized with a WhitespaceTokenizer. You can
see the code in SynonymTokenFilterFactory:

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/SynonymTokenFilterFactory.java#L85

Cheers,

Ivan

On Thu, Sep 26, 2013 at 6:14 PM, Glen Smith <gl...@smithsrock.com<javascript:>

wrote:

It appears to me they do.

(Apologies if this is a repost. Posted over an hour ago and it hasn't
shown up here.)

#!/bin/sh
echo "\nattempt to delete the index"
curl -XDELETE "http://localhost:9200/syndex/?pretty=false"
echo "\ncreate the index"
curl -XPUT "http://localhost:9200/syndex/?pretty=true" -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"syn": {
"tokenizer": "keyword",
"filter": ["color_synonym"]
}
},
"filter": {
"color_synonym": {
"type" : "synonym",
"synonyms" : ["red, another shade"]
}
}
}
}
}'
echo "\n analyze red: I want a single token another shade"
echo "\n instead I get two tokens another & shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d "red"
echo "\n sanity check how keyword tokenizer handes another shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d "another shade"
echo "\n and of course it does not split them"

Generates the output

attempt to delete the index
{"ok":true,"acknowledged":true}
create the index
{
"ok" : true,
"acknowledged" : true
}
analyze red: I want a single token another shade

instead I get two tokens another & shade
{
"tokens" : [ {
"token" : "red",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "another",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "shade",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}
sanity check how keyword tokenizer handes another shade
{
"tokens" : [ {
"token" : "another shade",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 1
} ]
}
and of course it does not split them

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · September 27, 2013, 6:13pm

Nope, it showed up right away. BTW, I love your avatar.

--
Ivan

On Fri, Sep 27, 2013 at 10:00 AM, Glen Smith glen@smithsrock.com wrote:

Thanks, Ivan. Yeah, I finally did figure out the obvious - the filter gets
its own tokenizer.

Hopefully this reply doesn't take 15 hours to show up...

On Friday, September 27, 2013 12:22:29 PM UTC-4, Ivan Brusic wrote:

Correct, the terms are being tokenized with a WhitespaceTokenizer. You
can see the code in SynonymTokenFilterFactory:

https://github.com/**elasticsearch/elasticsearch/**
blob/master/src/main/java/org/elasticsearch/index/analysis/
SynonymTokenFilterFactory.**java#L85https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/SynonymTokenFilterFactory.java#L85

Cheers,

Ivan

On Thu, Sep 26, 2013 at 6:14 PM, Glen Smith gl...@smithsrock.com wrote:

It appears to me they do.

(Apologies if this is a repost. Posted over an hour ago and it hasn't
shown up here.)

#!/bin/sh
echo "\nattempt to delete the index"
curl -XDELETE "http://localhost:9200/syndex/**?pretty=false http://localhost:9200/syndex/?pretty=false"
echo "\ncreate the index"
curl -XPUT "http://localhost:9200/syndex/**?pretty=true http://localhost:9200/syndex/?pretty=true" -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"syn": {
"tokenizer": "keyword",
"filter": ["color_synonym"]
}
},
"filter": {
"color_synonym": {
"type" : "synonym",
"synonyms" : ["red, another shade"]
}
}
}
}
}'
echo "\n analyze red: I want a single token another shade"
echo "\n instead I get two tokens another & shade"
curl -XGET "localhost:9200/syndex/**analyze?analyzer=syn&pretty=**true" -d "red"
echo "\n sanity check how keyword tokenizer handes another shade"
curl -XGET "localhost:9200/syndex/**analyze?analyzer=syn&pretty=**true" -d "another shade"
echo "\n and of course it does not split them"

Generates the output

attempt to delete the index
{"ok":true,"acknowledged":**true}
create the index
{
"ok" : true,
"acknowledged" : true
}
analyze red: I want a single token another shade

instead I get two tokens another & shade
{
"tokens" : [ {
"token" : "red",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "another",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "shade",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}
sanity check how keyword tokenizer handes another shade
{
"tokens" : [ {
"token" : "another shade",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 1
} ]
}
and of course it does not split them

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Synonyms are always whitespace tokenized? Elasticsearch	1	980	July 6, 2017
Problem with synonym token filter Elasticsearch	8	503	July 6, 2017
Synonym Filter Elasticsearch	2	374	July 6, 2017
Synonym Token Filter Elasticsearch	1	339	July 6, 2017
Keyword tokenizer Elasticsearch	4	320	July 6, 2017

Do entries in a synonym list always get whitespace tokenized?

Related topics