I'm trying to optimize this query which takes 5-10s to run. This query is
run repeated for different (pretty much all) users via an offline
process/script daily. The index it is run against has about 4 billion
documents, each query matches approximately 500k documents in that index
but I only need the top 25 results.
Mapping:
"_all": {
"enabled": false
},
"_source" : {
"enabled" : false
},
"properties": {
"user_id": {
"type": "long",
"store": "no"
},
"item_id": {
"type": "long",
"store": "no"
},
"match_score": {
"type": "integer",
"store": "no"
},
.... other fields not used by this query ....
}
match_score is used as document score. It's a pre calculated field
kept in the index. I'm using function_score here to get these benefits:
I can return as match_score as document score
I can make use of default sorting that happens on document score
and there by avoid needing to sort on this field
I can make use of *min_score *to act as a range filter on this field
I did find this approach to be faster than adding a range filter on
this field to the bool clause, then sorting by this field and then
returning this field value by using fields: { "match_score" }
caching is turned off on item_id and user_id as this query is run
repeated for different users and different item combinations, but never for
the same user / item within a given 24 hour period. I'm trying to avoid
wasting precious cache space for loading data that won't be used for
another day.
routing cannot be used here as this index is built using user_id as
the routing value, but we're going across users in this particular query as
we're trying to find users who have this item_id as a match.
How can optimize the query (or perhaps the mapping and re-index) so that
this query can run much faster? It's a simple enough query that I should be
able to get it to run quite fast, under 100ms without any problem. I assume
the reason why this is taking so long is because it matches way too many
documents. What's a good strategy for this?
Since the script is executed against lots of matched documents, perhaps
converting it into a native Java script (not Javascript) would provide a
performance boost.
Note that using fields in scripts will force their values to be loaded into
the cache.
I'm trying to optimize this query which takes 5-10s to run. This query is
run repeated for different (pretty much all) users via an offline
process/script daily. The index it is run against has about 4 billion
documents, each query matches approximately 500k documents in that index
but I only need the top 25 results.
Mapping:
"_all": {
"enabled": false
},
"_source" : {
"enabled" : false
},
"properties": {
"user_id": {
"type": "long",
"store": "no"
},
"item_id": {
"type": "long",
"store": "no"
},
"match_score": {
"type": "integer",
"store": "no"
},
.... other fields not used by this query ....
}
match_score is used as document score. It's a pre calculated field
kept in the index. I'm using function_score here to get these benefits:
I can return as match_score as document score
I can make use of default sorting that happens on document score
and there by avoid needing to sort on this field
I can make use of *min_score *to act as a range filter on this
field
I did find this approach to be faster than adding a range filter
on this field to the bool clause, then sorting by this field and then
returning this field value by using fields: { "match_score" }
caching is turned off on item_id and user_id as this query is run
repeated for different users and different item combinations, but never for
the same user / item within a given 24 hour period. I'm trying to avoid
wasting precious cache space for loading data that won't be used for
another day.
routing cannot be used here as this index is built using user_id as
the routing value, but we're going across users in this particular query as
we're trying to find users who have this item_id as a match.
How can optimize the query (or perhaps the mapping and re-index) so that
this query can run much faster? It's a simple enough query that I should be
able to get it to run quite fast, under 100ms without any problem. I assume
the reason why this is taking so long is because it matches way too many
documents. What's a good strategy for this?
Is it possible to use a field in a native script without loading it into cache?
I've considered using a rescore query and moving the script_score to the rescore phase leaving only the item_id term match in query phase. With a short window size that would lose a fair bit of accuracy, right? Are there other ways of restructuring the query I can take advantage of?
(unrelated to query) What's the most performant format to map fields as when you will access them from a native script to perform some further calculations?
Analyzed but not stored
not analyzed but stored
not analyzed but stored as doc values
Thanks
David
On Jul 14, 2014, at 1:14 PM, Ivan Brusic ivan@brusic.com wrote:
Since the script is executed against lots of matched documents, perhaps converting it into a native Java script (not Javascript) would provide a performance boost.
Note that using fields in scripts will force their values to be loaded into the cache.
I'm trying to optimize this query which takes 5-10s to run. This query is run repeated for different (pretty much all) users via an offline process/script daily. The index it is run against has about 4 billion documents, each query matches approximately 500k documents in that index but I only need the top 25 results.
Mapping:
"_all": {
"enabled": false
},
"_source" : {
"enabled" : false
},
"properties": {
"user_id": {
"type": "long",
"store": "no"
},
"item_id": {
"type": "long",
"store": "no"
},
"match_score": {
"type": "integer",
"store": "no"
},
.... other fields not used by this query ....
}
The reasoning for current format is given below:
match_score is used as document score. It's a pre calculated field kept in the index. I'm using function_score here to get these benefits:
I can return as match_score as document score
I can make use of default sorting that happens on document score and there by avoid needing to sort on this field
I can make use of min_score to act as a range filter on this field
I did find this approach to be faster than adding a range filter on this field to the bool clause, then sorting by this field and then returning this field value by using fields: { "match_score" }
caching is turned off on item_id and user_id as this query is run repeated for different users and different item combinations, but never for the same user / item within a given 24 hour period. I'm trying to avoid wasting precious cache space for loading data that won't be used for another day.
routing cannot be used here as this index is built using user_id as the routing value, but we're going across users in this particular query as we're trying to find users who have this item_id as a match.
How can optimize the query (or perhaps the mapping and re-index) so that this query can run much faster? It's a simple enough query that I should be able to get it to run quite fast, under 100ms without any problem. I assume the reason why this is taking so long is because it matches way too many documents. What's a good strategy for this?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.