I'm reading this article about Patterns for Synonyms in Elasticsearch and I have some questions about the results that I got, here is the mappings and settings I used:
PUT wheat_syn
{
"mappings": {
"wheat": {
"properties": {
"description": {
"type": "text",
"analyzer": "syn_text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"autophrase_syn": {
"type": "synonym",
"synonyms": ["triticum aestivum => triticum_aestivum",
"bread wheat => bread_wheat"]
},
"wheat_syn": {
"type": "synonym",
"tokenizer": "keyword",
"synonyms": ["triticum_aestivum, bread_wheat, wheat"]
}
},
"analyzer": {
"syn_text": {
"tokenizer": "standard",
"filter": ["lowercase", "autophrase_syn", "wheat_syn"]
}
}
}
}
}
The Documents:
PUT wheat_syn/wheat/_bulk
{ "index" : { "_id" : "1" } }
{ "description": "Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food." }
{ "index" : { "_id" : "2" } }
{ "description": "The scientific name is Triticum aestivum." }
{ "index" : { "_id" : "3" } }
{ "description": "bread wheat is good for health." }
The query:
GET wheat_syn/wheat/_search
{
"query": {
"match": {
"description": "wheat"
}
}
}
After executing the query, I got the following result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.48155478,
"hits": [
{
"_index": "wheat_syn",
"_type": "wheat",
"_id": "2",
"_score": 0.48155478,
"_source": {
"description": "The scientific name is Triticum aestivum."
}
},
{
"_index": "wheat_syn",
"_type": "wheat",
"_id": "3",
"_score": 0.48155478,
"_source": {
"description": "bread wheat is good for health."
}
},
{
"_index": "wheat_syn",
"_type": "wheat",
"_id": "1",
"_score": 0.46197122,
"_source": {
"description": "Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food."
}
}
]
}
}
Now, my questions are:
- I was expecting to get the sentence
Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food
first since the user was looking for this word, why isn't the case ? - Why there is a difference in the score and if it's depend on the position of the queried sentence/word in the description field, the third sentence in the results should be first right ? (this is a simple example, the more documents I add the higher difference of score is).
Thank you !