Synonyms are always whitespace tokenized?


(Glen Smith) #1

I am trying to use a synonym list that includes space-separated values.
For example, if a "1" is passed in a match query to the configured field, I
want to expand that into a stream of tokens that
includes "Phase 1". However, as far as I can tell, entries in the synonym
list get tokenized by whitespace.

Is this really the only behavior available?

Running this:
#!/bin/sh
echo "\nattempt to delete the index"
curl -XDELETE "http://localhost:9200/syndex/?pretty=false"
echo "\ncreate the index"
curl -XPUT "http://localhost:9200/syndex/?pretty=true" -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"syn": {
"tokenizer": "keyword",
"filter": ["color_synonym"]
}
},
"filter": {
"color_synonym": {
"type" : "synonym",
"synonyms" : ["red, another shade"]
}
}
}
}
}'
echo "\n analyze red: I want a single token another shade"
echo "\n instead I get two tokens another & shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d
"red"
echo "\n sanity check how keyword tokenizer handes another shade"
curl -XGET "localhost:9200/syndex/_analyze?analyzer=syn&pretty=true" -d
"another shade"
echo "\n and of course it does not split them"

Generates the output:
attempt to delete the index
{"ok":true,"acknowledged":true}
create the index
{
"ok" : true,
"acknowledged" : true
}
analyze red: I want a single token another shade

instead I get two tokens another & shade
{
"tokens" : [ {
"token" : "red",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "another",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "shade",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 2
} ]
}
sanity check how keyword tokenizer handes another shade
{
"tokens" : [ {
"token" : "another shade",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 1
} ]
}
and of course it does not split them

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #2