Short version: I have "Pentium 3" and "Pentium4", in the index. I want to be able to search "Pentium 4" and get the record for "Pentium4". I want to be able to search "Pentium3" and get the record for "Pentium 3"
I want to use edge-ngram because this index will be used to auto-complete search terms. Right now, both records are being returned for both searches.
I used the Edge ngram page to setup the index as follows:
curl -X PUT "localhost:9200/searches?pretty" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": []
}
}
}
},
"mappings": {
"properties": {
"term": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
'
Added both documents
curl -X PUT "localhost:9200/searches/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
"term": "Pentium 4"
}
'
curl -X PUT "localhost:9200/searches/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
"term": "Pentium3"
}
'
and the output
curl -X GET "localhost:9200/searches/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"term": {
"query": "pentium 4",
"operator": "and"
}
}
}
}
'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18681718,
"hits" : [
{
"_index" : "searches",
"_id" : "2",
"_score" : 0.18681718,
"_source" : {
"term" : "Pentium3"
}
},
{
"_index" : "searches",
"_id" : "1",
"_score" : 0.17803724,
"_source" : {
"term" : "Pentium 4"
}
}
]
}
}
I also tried this with min_gram = 2 and I was getting the same result: both docs are being returned.
I also tried keeping 2 copies of the terms: one for the original (term), and a second one (term_no_space) with the spaces removed. This way searches coming in will also have spaces removed. But still both records come back
curl -X GET "localhost:9200/searches/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"term_no_space": {
"query": "pentium4",
"operator": "and"
}
}
}
}
'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "searches",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"term" : "Pentium 4",
"term_no_space" : "pentium4"
}
},
{
"_index" : "searches",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"term" : "Pentium 3",
"term_no_space" : "pentium3"
}
}
]
}
}
Any hints, or links to existing discussions are welcome! Unfortunately I don't know the exact term for this scenario so I can't find them.
Thank you!