Optimizing a query that matches a large number of documents


(David Smith-2) #1

Hi,

I'm trying to optimize this query which takes 5-10s to run. This query is
run repeated for different (pretty much all) users via an offline
process/script daily. The index it is run against has about 4 billion
documents, each query matches approximately 500k documents in that index
but I only need the top 25 results.

Mapping:

        "_all": {
            "enabled": false
        },
        "_source" : {
            "enabled" : false
        },
        "properties": {
            "user_id": {
                "type": "long",
                "store": "no"
            },
            "item_id": {
                "type": "long",
                "store": "no"
            },
            "match_score": {
                "type": "integer",
                "store": "no"
            },
            .... other fields not used by this query ....
        }

Query:

{
"size": 25,
"min_score": 50.0,
"query": {
"function_score": {
"filter": {
"bool": {
"must": {
"term": {
"item_id": 8342743,
"_cache": false
}
},
"must_not": {
"term": {
"user_id": 10434531,
"_cache": false
}
}
}
},
"functions": [
{
"script_score": {
"script": "doc['match_score'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}

The reasoning for current format is given below:

  • match_score is used as document score. It's a pre calculated field
    kept in the index. I'm using function_score here to get these benefits:
  • I can return as match_score as document score
    • I can make use of default sorting that happens on document score
      and there by avoid needing to sort on this field
    • I can make use of *min_score *to act as a range filter on this field
    • I did find this approach to be faster than adding a range filter on
      this field to the bool clause, then sorting by this field and then
      returning this field value by using fields: { "match_score" }
  • caching is turned off on item_id and user_id as this query is run
    repeated for different users and different item combinations, but never for
    the same user / item within a given 24 hour period. I'm trying to avoid
    wasting precious cache space for loading data that won't be used for
    another day.
  • routing cannot be used here as this index is built using user_id as
    the routing value, but we're going across users in this particular query as
    we're trying to find users who have this item_id as a match.

How can optimize the query (or perhaps the mapping and re-index) so that
this query can run much faster? It's a simple enough query that I should be
able to get it to run quite fast, under 100ms without any problem. I assume
the reason why this is taking so long is because it matches way too many
documents. What's a good strategy for this?

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4c1a2301-5c82-4233-9010-ca10b65bb4d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #2

Since the script is executed against lots of matched documents, perhaps
converting it into a native Java script (not Javascript) would provide a
performance boost.

Note that using fields in scripts will force their values to be loaded into
the cache.

--
Ivan

On Sun, Jul 13, 2014 at 8:54 AM, David Smith davidksmith2k@gmail.com
wrote:

Hi,

I'm trying to optimize this query which takes 5-10s to run. This query is
run repeated for different (pretty much all) users via an offline
process/script daily. The index it is run against has about 4 billion
documents, each query matches approximately 500k documents in that index
but I only need the top 25 results.

Mapping:

        "_all": {
            "enabled": false
        },
        "_source" : {
            "enabled" : false
        },
        "properties": {
            "user_id": {
                "type": "long",
                "store": "no"
            },
            "item_id": {
                "type": "long",
                "store": "no"
            },
            "match_score": {
                "type": "integer",
                "store": "no"
            },
            .... other fields not used by this query ....
        }

Query:

{
"size": 25,
"min_score": 50.0,
"query": {
"function_score": {
"filter": {
"bool": {
"must": {
"term": {
"item_id": 8342743,
"_cache": false
}
},
"must_not": {
"term": {
"user_id": 10434531,
"_cache": false
}
}
}
},
"functions": [
{
"script_score": {
"script": "doc['match_score'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}

The reasoning for current format is given below:

  • match_score is used as document score. It's a pre calculated field
    kept in the index. I'm using function_score here to get these benefits:
  • I can return as match_score as document score
    • I can make use of default sorting that happens on document score
      and there by avoid needing to sort on this field
    • I can make use of *min_score *to act as a range filter on this
      field
    • I did find this approach to be faster than adding a range filter
      on this field to the bool clause, then sorting by this field and then
      returning this field value by using fields: { "match_score" }
  • caching is turned off on item_id and user_id as this query is run
    repeated for different users and different item combinations, but never for
    the same user / item within a given 24 hour period. I'm trying to avoid
    wasting precious cache space for loading data that won't be used for
    another day.
  • routing cannot be used here as this index is built using user_id as
    the routing value, but we're going across users in this particular query as
    we're trying to find users who have this item_id as a match.

How can optimize the query (or perhaps the mapping and re-index) so that
this query can run much faster? It's a simple enough query that I should be
able to get it to run quite fast, under 100ms without any problem. I assume
the reason why this is taking so long is because it matches way too many
documents. What's a good strategy for this?

David

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4c1a2301-5c82-4233-9010-ca10b65bb4d0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4c1a2301-5c82-4233-9010-ca10b65bb4d0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA_dZvWmGa5Br%3DZp-%2BA_Mz2DqniR1H3gmFhfY_iFnKCnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(David Smith-2) #3

Hi Ivan,

Thanks for your response. Couple of questions.

  1. Is it possible to use a field in a native script without loading it into cache?

  2. I've considered using a rescore query and moving the script_score to the rescore phase leaving only the item_id term match in query phase. With a short window size that would lose a fair bit of accuracy, right? Are there other ways of restructuring the query I can take advantage of?

  3. (unrelated to query) What's the most performant format to map fields as when you will access them from a native script to perform some further calculations?

  • Analyzed but not stored
  • not analyzed but stored
  • not analyzed but stored as doc values

Thanks
David

On Jul 14, 2014, at 1:14 PM, Ivan Brusic ivan@brusic.com wrote:

Since the script is executed against lots of matched documents, perhaps converting it into a native Java script (not Javascript) would provide a performance boost.

Note that using fields in scripts will force their values to be loaded into the cache.

--
Ivan

On Sun, Jul 13, 2014 at 8:54 AM, David Smith davidksmith2k@gmail.com wrote:
Hi,

I'm trying to optimize this query which takes 5-10s to run. This query is run repeated for different (pretty much all) users via an offline process/script daily. The index it is run against has about 4 billion documents, each query matches approximately 500k documents in that index but I only need the top 25 results.

Mapping:

        "_all": {
            "enabled": false
        },
        "_source" : {
            "enabled" : false
        },
        "properties": {
            "user_id": {
                "type": "long",
                "store": "no"
            },
            "item_id": {
                "type": "long",
                "store": "no"
            },
            "match_score": {
                "type": "integer",
                "store": "no"
            },
            .... other fields not used by this query ....
        }

Query:

{
"size": 25,
"min_score": 50.0,
"query": {
"function_score": {
"filter": {
"bool": {
"must": {
"term": {
"item_id": 8342743,
"_cache": false
}
},
"must_not": {
"term": {
"user_id": 10434531,
"_cache": false
}
}
}
},
"functions": [
{
"script_score": {
"script": "doc['match_score'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}

The reasoning for current format is given below:
match_score is used as document score. It's a pre calculated field kept in the index. I'm using function_score here to get these benefits:
I can return as match_score as document score
I can make use of default sorting that happens on document score and there by avoid needing to sort on this field
I can make use of min_score to act as a range filter on this field
I did find this approach to be faster than adding a range filter on this field to the bool clause, then sorting by this field and then returning this field value by using fields: { "match_score" }
caching is turned off on item_id and user_id as this query is run repeated for different users and different item combinations, but never for the same user / item within a given 24 hour period. I'm trying to avoid wasting precious cache space for loading data that won't be used for another day.
routing cannot be used here as this index is built using user_id as the routing value, but we're going across users in this particular query as we're trying to find users who have this item_id as a match.
How can optimize the query (or perhaps the mapping and re-index) so that this query can run much faster? It's a simple enough query that I should be able to get it to run quite fast, under 100ms without any problem. I assume the reason why this is taking so long is because it matches way too many documents. What's a good strategy for this?

David

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4c1a2301-5c82-4233-9010-ca10b65bb4d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA_dZvWmGa5Br%3DZp-%2BA_Mz2DqniR1H3gmFhfY_iFnKCnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F9BE8E89-58AF-4E57-8A24-E4A90FF44C77%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4