Optimizing a query that matches a large number of documents

David_Smith_2 · July 13, 2014, 3:54pm

Hi,

I'm trying to optimize this query which takes 5-10s to run. This query is
run repeated for different (pretty much all) users via an offline
process/script daily. The index it is run against has about 4 billion
documents, each query matches approximately 500k documents in that index
but I only need the top 25 results.

Mapping:

        "_all": {
            "enabled": false
        },
        "_source" : {
            "enabled" : false
        },
        "properties": {
            "user_id": {
                "type": "long",
                "store": "no"
            },
            "item_id": {
                "type": "long",
                "store": "no"
            },
            "match_score": {
                "type": "integer",
                "store": "no"
            },
            .... other fields not used by this query ....
        }

Query:

{
"size": 25,
"min_score": 50.0,
"query": {
"function_score": {
"filter": {
"bool": {
"must": {
"term": {
"item_id": 8342743,
"_cache": false
}
},
"must_not": {
"term": {
"user_id": 10434531,
"_cache": false
}
}
}
},
"functions": [
{
"script_score": {
"script": "doc['match_score'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}

The reasoning for current format is given below:

match_score is used as document score. It's a pre calculated field
kept in the index. I'm using function_score here to get these benefits:
I can return as match_score as document score
- I can make use of default sorting that happens on document score
  and there by avoid needing to sort on this field
- I can make use of *min_score *to act as a range filter on this field
- I did find this approach to be faster than adding a range filter on
  this field to the bool clause, then sorting by this field and then
  returning this field value by using fields: { "match_score" }
caching is turned off on item_id and user_id as this query is run
repeated for different users and different item combinations, but never for
the same user / item within a given 24 hour period. I'm trying to avoid
wasting precious cache space for loading data that won't be used for
another day.
routing cannot be used here as this index is built using user_id as
the routing value, but we're going across users in this particular query as
we're trying to find users who have this item_id as a match.

How can optimize the query (or perhaps the mapping and re-index) so that
this query can run much faster? It's a simple enough query that I should be
able to get it to run quite fast, under 100ms without any problem. I assume
the reason why this is taking so long is because it matches way too many
documents. What's a good strategy for this?

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4c1a2301-5c82-4233-9010-ca10b65bb4d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · July 14, 2014, 5:14pm

Since the script is executed against lots of matched documents, perhaps
converting it into a native Java script (not Javascript) would provide a
performance boost.

Note that using fields in scripts will force their values to be loaded into
the cache.

--
Ivan

On Sun, Jul 13, 2014 at 8:54 AM, David Smith davidksmith2k@gmail.com
wrote:

Hi,

I'm trying to optimize this query which takes 5-10s to run. This query is
run repeated for different (pretty much all) users via an offline
process/script daily. The index it is run against has about 4 billion
documents, each query matches approximately 500k documents in that index
but I only need the top 25 results.

Mapping:
        "_all": {
            "enabled": false
        },
        "_source" : {
            "enabled" : false
        },
        "properties": {
            "user_id": {
                "type": "long",
                "store": "no"
            },
            "item_id": {
                "type": "long",
                "store": "no"
            },
            "match_score": {
                "type": "integer",
                "store": "no"
            },
            .... other fields not used by this query ....
        }
Query:

{
"size": 25,
"min_score": 50.0,
"query": {
"function_score": {
"filter": {
"bool": {
"must": {
"term": {
"item_id": 8342743,
"_cache": false
}
},
"must_not": {
"term": {
"user_id": 10434531,
"_cache": false
}
}
}
},
"functions": [
{
"script_score": {
"script": "doc['match_score'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}

The reasoning for current format is given below:

match_score is used as document score. It's a pre calculated field
kept in the index. I'm using function_score here to get these benefits:

I can return as match_score as document score

I can make use of default sorting that happens on document score
and there by avoid needing to sort on this field

I can make use of *min_score *to act as a range filter on this
field

I did find this approach to be faster than adding a range filter
on this field to the bool clause, then sorting by this field and then
returning this field value by using fields: { "match_score" }

caching is turned off on item_id and user_id as this query is run
repeated for different users and different item combinations, but never for
the same user / item within a given 24 hour period. I'm trying to avoid
wasting precious cache space for loading data that won't be used for
another day.

routing cannot be used here as this index is built using user_id as
the routing value, but we're going across users in this particular query as
we're trying to find users who have this item_id as a match.

How can optimize the query (or perhaps the mapping and re-index) so that
this query can run much faster? It's a simple enough query that I should be
able to get it to run quite fast, under 100ms without any problem. I assume
the reason why this is taking so long is because it matches way too many
documents. What's a good strategy for this?

David

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4c1a2301-5c82-4233-9010-ca10b65bb4d0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4c1a2301-5c82-4233-9010-ca10b65bb4d0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA_dZvWmGa5Br%3DZp-%2BA_Mz2DqniR1H3gmFhfY_iFnKCnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David_Smith_2 · July 14, 2014, 5:36pm

Hi Ivan,

Thanks for your response. Couple of questions.

Is it possible to use a field in a native script without loading it into cache?
I've considered using a rescore query and moving the script_score to the rescore phase leaving only the item_id term match in query phase. With a short window size that would lose a fair bit of accuracy, right? Are there other ways of restructuring the query I can take advantage of?
(unrelated to query) What's the most performant format to map fields as when you will access them from a native script to perform some further calculations?

Analyzed but not stored
not analyzed but stored
not analyzed but stored as doc values

Thanks
David

On Jul 14, 2014, at 1:14 PM, Ivan Brusic ivan@brusic.com wrote:

Since the script is executed against lots of matched documents, perhaps converting it into a native Java script (not Javascript) would provide a performance boost.

Note that using fields in scripts will force their values to be loaded into the cache.

--
Ivan
On Sun, Jul 13, 2014 at 8:54 AM, David Smith davidksmith2k@gmail.com wrote:
Hi,

I'm trying to optimize this query which takes 5-10s to run. This query is run repeated for different (pretty much all) users via an offline process/script daily. The index it is run against has about 4 billion documents, each query matches approximately 500k documents in that index but I only need the top 25 results.

Mapping:
        "_all": {
            "enabled": false
        },
        "_source" : {
            "enabled" : false
        },
        "properties": {
            "user_id": {
                "type": "long",
                "store": "no"
            },
            "item_id": {
                "type": "long",
                "store": "no"
            },
            "match_score": {
                "type": "integer",
                "store": "no"
            },
            .... other fields not used by this query ....
        }
Query:

{
"size": 25,
"min_score": 50.0,
"query": {
"function_score": {
"filter": {
"bool": {
"must": {
"term": {
"item_id": 8342743,
"_cache": false
}
},
"must_not": {
"term": {
"user_id": 10434531,
"_cache": false
}
}
}
},
"functions": [
{
"script_score": {
"script": "doc['match_score'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}

The reasoning for current format is given below:
match_score is used as document score. It's a pre calculated field kept in the index. I'm using function_score here to get these benefits:
I can return as match_score as document score
I can make use of default sorting that happens on document score and there by avoid needing to sort on this field
I can make use of min_score to act as a range filter on this field
I did find this approach to be faster than adding a range filter on this field to the bool clause, then sorting by this field and then returning this field value by using fields: { "match_score" }
caching is turned off on item_id and user_id as this query is run repeated for different users and different item combinations, but never for the same user / item within a given 24 hour period. I'm trying to avoid wasting precious cache space for loading data that won't be used for another day.
routing cannot be used here as this index is built using user_id as the routing value, but we're going across users in this particular query as we're trying to find users who have this item_id as a match.
How can optimize the query (or perhaps the mapping and re-index) so that this query can run much faster? It's a simple enough query that I should be able to get it to run quite fast, under 100ms without any problem. I assume the reason why this is taking so long is because it matches way too many documents. What's a good strategy for this?

David

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4c1a2301-5c82-4233-9010-ca10b65bb4d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA_dZvWmGa5Br%3DZp-%2BA_Mz2DqniR1H3gmFhfY_iFnKCnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F9BE8E89-58AF-4E57-8A24-E4A90FF44C77%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Custom Scoring Script Running Slow Elasticsearch	8	1109	July 6, 2017
Using elastic search with per-user ordering of results Elasticsearch	13	860	July 6, 2017
Function Score Query and Native scripts Elasticsearch	3	447	July 6, 2017
Combing relevancy and trending scores Elasticsearch	7	726	July 6, 2017
Profiling shows large cpu usage for scripts Elasticsearch	10	498	July 6, 2017

Optimizing a query that matches a large number of documents

David

Related topics