Hi,
Setup: Elasticseach 6.3.0, 2 shards, 4 replicas on 5 dedicated data nodes.
I have a 10,000,000 products feed index - each document represent a product and look like:
{
"id": 123,
"active": true,
"popularity": 51,
"url": "http://mysite.com/new-iPhone-12-pro-max"
}
A typical query is to find all products that contain the substring "iPhone".
I know that the regex query considers slow, but this is what we use in production (with lowercase tokenizer) and unfortunately is slow (about 1.5s).
Now, I have a known fact that I want to use to improve that: half of the products have a "popularity" value that is lte
0.
My first assumption is that filtering by "range" will improve the "took" - is that true?
In addition, I have added the "active" bool field that represents the rule above - (true if "popularity" > 0).
My second assumption is that filtering by "term" will perform better than "range" - is that true?
Here is the first query (from slowlog
):
{
"from": 0,
"size": 9,
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"match_all": {
"boost": 1
}
},
"functions": [
{
"filter": {
"match_all": {
"boost": 1
}
},
"field_value_factor": {
"field": "popularity",
"factor": 1,
"modifier": "none"
}
}
],
"score_mode": "multiply",
"max_boost": 3.4028235e+38,
"boost": 1
}
}
],
"filter": [
{
"term": {
"active": {
"value": true,
"boost": 1
}
}
},
{
"bool": {
"should": [
{
"regexp": {
"url": {
"value": ".*iphone.*",
"flags_value": 65535,
"max_determinized_states": 10000,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"_source": {
"includes": [],
"excludes": [
"active*",
"popularity*"
]
}
}
I will add the range query if you think that my second assumption is wrong ("terms" will perform as "range")
I will appreciate any help!