I'm checking to see if there is a better way to achieve the same goal that the percolate query does for me but with less overhead.
I have an index where each document is a product (say Acme Anvils). I want to determine if a piece of text, say a customer review, is mentioning a specific product. The way I have this currently working is that each document that represents a product has a field of type percolator
and the value for that is a match_phrase
query. Here is the index template and example document
Index Template:
PUT /_template/product-percolate-test
{
"index_patterns": [
"product-percolate-test"
],
"settings": {
"refresh_interval": "1s",
"number_of_shards": 5,
"number_of_replicas": 1,
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_search": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"asciifolding",
"english_stop",
"english_stemmer"
]
},
"english_mention": {
"tokenizer": "uax_url_email",
"filter": [
"lowercase",
"asciifolding",
"english_stop"
]
}
}
}
},
"mappings": {
"dynamic": "strict",
"properties": {
"abv": {
"type": "double"
},
"name": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "english_search"
}
}
},
"manufacturer": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "english_search"
}
}
},
"percolator-message": {
"type": "text",
"analyzer": "english_mention"
},
"percolator-query": {
"type": "percolator"
}
}
}
}
Example Product Document
{
"_index" : "product-percolate-test",
"_type" : "_doc",
"_id" : "00c68132-19b5-488f-8ea9-ee4b5d97a996",
"_score" : 1.0,
"_source" : {
"name" : "Anvil 3000 Deluxe",
"manufacturer" : "Acme",
"percolator-query" : {
"match_phrase" : {
"percolator-message" : {
"query" : "Anvil 3000 Deluxe",
"slop" : 1
}
}
}
}
}
With this, I can run the following percolate query and get the following results:
Percolate Query
GET /product-percolate-test/_search
{
"query": {
"percolate": {
"field": "percolator-query",
"document": {
"percolator-message" : "I purchased the Anvil 3000 Deluxe from Acme about a week ago. I would not recommend this product. Although it is a good anvil, I have yet been able to successfully catch any road runners with it. It instead keeps falling on my head."
}
}
}
}
Percolate Results
{
"took" : 33,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.6538229,
"hits" : [
{
"_index" : "product-percolate-test",
"_type" : "_doc",
"_id" : "00c68132-19b5-488f-8ea9-ee4b5d97a996",
"_score" : 0.6538229,
"_source" : {
"name" : "Anvil 3000 Deluxe",
"style" : "Acme",
"percolator-query" : {
"match_phrase" : {
"percolator-message" : {
"query" : "Anvil 3000 Deluxe",
"slop" : 1
}
}
}
},
"fields" : {
"_percolator_document_slot" : [
0
]
}
}
]
}
}
Here are the issues I see with this:
- There is a percolate query for each product document and the number of documents is likely to reach into the tens-of-thousands
- The percolate query for each product document is exactly the same with the exception of the
match_phrase
query value which is always the product name - If i need to change the queries used for this, i would need to modify each document likely through a
reindex
operation
I am wondering if there is a more optimal way of achieving the same result.
Thank you