Nested range query is slow, slower with _cache, how to debug?


(Damien Alexandre) #1

Hi everyone,

ES 0.90.3, 5 shards.

I run an index with a nested field,
I have like 6 billions documents, and I run a query like this:

{
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"nested": {
"path": "rights.display",
"query": {
"bool": {
"must": [
{
"field": {
"rights.display.zones": {
"query": "FR"
}
}
},
{
"range": {
"rights.display.end_date": {
"gte": "2013-09-16"
}
}
},
{
"range": {
"rights.display.start_date": {
"lte": "2013-09-16"
}
}
}
]
}
}
}
}
}
}
}

If I remove the two range part, queries perform really fast (5/6ms),
but with them it took 1 second average.

So I have tried what the documentation tell about Nested Filterhttp://www.elasticsearch.org/guide/reference/query-dsl/nested-filter/
:

"_cache" : true, "_name": "testing_FR"

With this _cache rule, results are slower: 2 to 3 seconds!!

I have no idea how to debug this,
here is a quick gist: https://gist.github.com/damienalexandre/6581850
But without massive datas the difference between cached and not cached is
not as clear as what I get.

I can see two issues here:

  • my range query are slow, I guess this is the cost of doing a date range
    accross billions docs ;
  • my nested filter is not cached, trying to set the cache make the query
    slower.

I'm looking for advice and tips on how to debug this,
maybe it's a bug, but before creating an issue on github I think another
pair of eyes can't hurt.

PS : I have also tried to set the filter to an alias - same perf issue.

Thx a lot,
Damien

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #2

Hi Damien,

This is perhaps naive of me, but what I've seen work well across 100M
documents (about 1/20 the number of documents you mentioned), the best
range performance is when the range query is wrapped inside a bool query.
For example (with the actual gn and sn query values changed to protect the
innocent):

{
"from" : 0,
"size" : 20,
"timeout" : 60000,
"query" : {
"bool" : {
"must" : [ {
"match" : {
"gn" : {
"query" : "aurelio",
"type" : "boolean"
}
}
}, {
"match" : {
"sn" : {
"query" : "phzee",
"type" : "boolean"
}
}
}, {
"range" : {
"hn" : {
"from" : 1000,
"to" : 2000,
"include_lower" : true,
"include_upper" : true
}
}
} ]
}
},
"version" : true,
"explain" : false,
"fields" : [ "_ttl", "_source" ]
}

This query took 3.4 seconds to return 5 documents out of 34 when the
numeric range was omitted. But it did get much faster on subsequent
queries, down to 100ms or less.

I hope this helps!

P.S. My client actually builds the queries in Java, and then can emit them
as JSON for debugging and explanatory reasons.

Brian

On Monday, September 16, 2013 11:09:48 AM UTC-4, Damien Alexandre wrote:

Hi everyone,

ES 0.90.3, 5 shards.

I run an index with a nested field,
I have like 6 billions documents, and I run a query like this:

{
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"nested": {
"path": "rights.display",
"query": {
"bool": {
"must": [
{
"field": {
"rights.display.zones": {
"query": "FR"
}
}
},
{
"range": {
"rights.display.end_date": {
"gte": "2013-09-16"
}
}
},
{
"range": {
"rights.display.start_date": {
"lte": "2013-09-16"
}
}
}
]
}
}
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Martijn Van Groningen) #3

Hi Damien,

I would change the range query into range filter, then each range filter be
cached on its own by default:
http://www.elasticsearch.org/guide/reference/query-dsl/range-filter/

The range query doesn't cache at all on its own. If you wrap a filtered
query as inner query in the nested filter and put the range filters in the
filter part and the fields query in the query part then I expect a faster
execution time:

{
  "query": {
    "filtered": {
      "query": { "match_all": {} },
      "filter": {
        "nested": {
          "path": "rights.display",
          "query": {
            "filtered": {
               "query": {
                    "field": {
                        "rights.display.zones": {
                          "query": "FR"
                        }
                    }
               },
               "filter": {
                   "bool": {
                       "must": [
                        {
                          "range": {
                            "rights.display.end_date": {
                              "gte": "2013-09-16"
                            }
                          }
                        },
                        {
                          "range": {
                            "rights.display.start_date": {
                              "lte": "2013-09-16"
                            }
                          }
                        }
                        ]
                   }
               }
            }
          }
        }
      }
    }
  }
}

The first time the range filters are executed these execution time is
similar than the range query, but any subsequent search request should be
much faster.
Also I see that you're filtering on a day precession, are you also indexing
the dates into the same precession? If not then I expect the range filter
(and query) to execute better if you do this.

Also caching the nested filter doesn't really help, if one element in the
nested filter changes than the cached entry can't be reused, and the nested
filter needs to be completely re-executed.

Let me know if these tips helped out.

On 16 September 2013 18:34, InquiringMind brian.from.fl@gmail.com wrote:

Hi Damien,

This is perhaps naive of me, but what I've seen work well across 100M
documents (about 1/20 the number of documents you mentioned), the best
range performance is when the range query is wrapped inside a bool query.
For example (with the actual gn and sn query values changed to protect the
innocent):

{
"from" : 0,
"size" : 20,
"timeout" : 60000,
"query" : {
"bool" : {
"must" : [ {
"match" : {
"gn" : {
"query" : "aurelio",
"type" : "boolean"
}
}
}, {
"match" : {
"sn" : {
"query" : "phzee",
"type" : "boolean"
}
}
}, {
"range" : {
"hn" : {
"from" : 1000,
"to" : 2000,
"include_lower" : true,
"include_upper" : true
}
}
} ]
}
},
"version" : true,
"explain" : false,
"fields" : [ "_ttl", "_source" ]
}

This query took 3.4 seconds to return 5 documents out of 34 when the
numeric range was omitted. But it did get much faster on subsequent
queries, down to 100ms or less.

I hope this helps!

P.S. My client actually builds the queries in Java, and then can emit them
as JSON for debugging and explanatory reasons.

Brian

On Monday, September 16, 2013 11:09:48 AM UTC-4, Damien Alexandre wrote:

Hi everyone,

ES 0.90.3, 5 shards.

I run an index with a nested field,
I have like 6 billions documents, and I run a query like this:

{
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"nested": {
"path": "rights.display",
"query": {
"bool": {
"must": [
{
"field": {
"rights.display.zones": {
"query": "FR"
}
}
},
{
"range": {
"rights.display.end_date": {
"gte": "2013-09-16"
}
}
},
{
"range": {
"rights.display.start_date": {
"lte": "2013-09-16"
}
}
}
]
}
}
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Damien Alexandre) #4

Hi,

using Range filter instead of Range query works very well! I dropped from
300ms to 10ms on a lot of my queries!
Still, I think it's strange that the Nested Filter cache does not work
better than the Range filter one's - looks odd to me, but anyway :]

About the date, yes they are indexed with a day precision, like in my
queries - so it's kind of fast now,
I apply sort, filters, queries... on billions of documents with nested
filed and now I get my results in 10ms: that's awesome :heart:

Thanks Martijn & Brian!

http://gph.is/XL6HqD

Damien.

On Monday, September 16, 2013 9:58:59 PM UTC+2, Martijn v Groningen wrote:

Hi Damien,

I would change the range query into range filter, then each range filter
be cached on its own by default:
http://www.elasticsearch.org/guide/reference/query-dsl/range-filter/

The range query doesn't cache at all on its own. If you wrap a filtered
query as inner query in the nested filter and put the range filters in the
filter part and the fields query in the query part then I expect a faster
execution time:

{
  "query": {
    "filtered": {
      "query": { "match_all": {} },
      "filter": {
        "nested": {
          "path": "rights.display",
          "query": {
            "filtered": {
               "query": {
                    "field": {
                        "rights.display.zones": {
                          "query": "FR"
                        }
                    }
               },
               "filter": {
                   "bool": {
                       "must": [
                        {
                          "range": {
                            "rights.display.end_date": {
                              "gte": "2013-09-16"
                            }
                          }
                        },
                        {
                          "range": {
                            "rights.display.start_date": {
                              "lte": "2013-09-16"
                            }
                          }
                        }
                        ]
                   }
               }
            }
          }
        }
      }
    }
  }
}

The first time the range filters are executed these execution time is
similar than the range query, but any subsequent search request should be
much faster.
Also I see that you're filtering on a day precession, are you also
indexing the dates into the same precession? If not then I expect the range
filter (and query) to execute better if you do this.

Also caching the nested filter doesn't really help, if one element in the
nested filter changes than the cached entry can't be reused, and the nested
filter needs to be completely re-executed.

Let me know if these tips helped out.

On 16 September 2013 18:34, InquiringMind <brian....@gmail.com<javascript:>

wrote:

Hi Damien,

This is perhaps naive of me, but what I've seen work well across 100M
documents (about 1/20 the number of documents you mentioned), the best
range performance is when the range query is wrapped inside a bool query.
For example (with the actual gn and sn query values changed to protect the
innocent):

{
"from" : 0,
"size" : 20,
"timeout" : 60000,
"query" : {
"bool" : {
"must" : [ {
"match" : {
"gn" : {
"query" : "aurelio",
"type" : "boolean"
}
}
}, {
"match" : {
"sn" : {
"query" : "phzee",
"type" : "boolean"
}
}
}, {
"range" : {
"hn" : {
"from" : 1000,
"to" : 2000,
"include_lower" : true,
"include_upper" : true
}
}
} ]
}
},
"version" : true,
"explain" : false,
"fields" : [ "_ttl", "_source" ]
}

This query took 3.4 seconds to return 5 documents out of 34 when the
numeric range was omitted. But it did get much faster on subsequent
queries, down to 100ms or less.

I hope this helps!

P.S. My client actually builds the queries in Java, and then can emit
them as JSON for debugging and explanatory reasons.

Brian

On Monday, September 16, 2013 11:09:48 AM UTC-4, Damien Alexandre wrote:

Hi everyone,

ES 0.90.3, 5 shards.

I run an index with a nested field,
I have like 6 billions documents, and I run a query like this:

{
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"nested": {
"path": "rights.display",
"query": {
"bool": {
"must": [
{
"field": {
"rights.display.zones": {
"query": "FR"
}
}
},
{
"range": {
"rights.display.end_date": {
"gte": "2013-09-16"
}
}
},
{
"range": {
"rights.display.start_date": {
"lte": "2013-09-16"
}
}
}
]
}
}
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #5

Don't range filter work better with and/or/not filter and not inside bool
filters due to bitset caching? Never profiled myself.

--
Ivan

On Mon, Sep 16, 2013 at 12:58 PM, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:

Hi Damien,

I would change the range query into range filter, then each range filter
be cached on its own by default:
http://www.elasticsearch.org/guide/reference/query-dsl/range-filter/

The range query doesn't cache at all on its own. If you wrap a filtered
query as inner query in the nested filter and put the range filters in the
filter part and the fields query in the query part then I expect a faster
execution time:

{
  "query": {
    "filtered": {
      "query": { "match_all": {} },
      "filter": {
        "nested": {
          "path": "rights.display",
          "query": {
            "filtered": {
               "query": {
                    "field": {
                        "rights.display.zones": {
                          "query": "FR"
                        }
                    }
               },
               "filter": {
                   "bool": {
                       "must": [
                        {
                          "range": {
                            "rights.display.end_date": {
                              "gte": "2013-09-16"
                            }
                          }
                        },
                        {
                          "range": {
                            "rights.display.start_date": {
                              "lte": "2013-09-16"
                            }
                          }
                        }
                        ]
                   }
               }
            }
          }
        }
      }
    }
  }
}

The first time the range filters are executed these execution time is
similar than the range query, but any subsequent search request should be
much faster.
Also I see that you're filtering on a day precession, are you also
indexing the dates into the same precession? If not then I expect the range
filter (and query) to execute better if you do this.

Also caching the nested filter doesn't really help, if one element in the
nested filter changes than the cached entry can't be reused, and the nested
filter needs to be completely re-executed.

Let me know if these tips helped out.

On 16 September 2013 18:34, InquiringMind brian.from.fl@gmail.com wrote:

Hi Damien,

This is perhaps naive of me, but what I've seen work well across 100M
documents (about 1/20 the number of documents you mentioned), the best
range performance is when the range query is wrapped inside a bool query.
For example (with the actual gn and sn query values changed to protect the
innocent):

{
"from" : 0,
"size" : 20,
"timeout" : 60000,
"query" : {
"bool" : {
"must" : [ {
"match" : {
"gn" : {
"query" : "aurelio",
"type" : "boolean"
}
}
}, {
"match" : {
"sn" : {
"query" : "phzee",
"type" : "boolean"
}
}
}, {
"range" : {
"hn" : {
"from" : 1000,
"to" : 2000,
"include_lower" : true,
"include_upper" : true
}
}
} ]
}
},
"version" : true,
"explain" : false,
"fields" : [ "_ttl", "_source" ]
}

This query took 3.4 seconds to return 5 documents out of 34 when the
numeric range was omitted. But it did get much faster on subsequent
queries, down to 100ms or less.

I hope this helps!

P.S. My client actually builds the queries in Java, and then can emit
them as JSON for debugging and explanatory reasons.

Brian

On Monday, September 16, 2013 11:09:48 AM UTC-4, Damien Alexandre wrote:

Hi everyone,

ES 0.90.3, 5 shards.

I run an index with a nested field,
I have like 6 billions documents, and I run a query like this:

{
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"nested": {
"path": "rights.display",
"query": {
"bool": {
"must": [
{
"field": {
"rights.display.zones": {
"query": "FR"
}
}
},
{
"range": {
"rights.display.end_date": {
"gte": "2013-09-16"
}
}
},
{
"range": {
"rights.display.start_date": {
"lte": "2013-09-16"
}
}
}
]
}
}
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Martijn Van Groningen) #6

The range filter should work best inside a bool filter. The
numeric_range should work best inside an and/or filter, but only when it
isn't cached (by default this filter is never cached).

On 17 September 2013 23:52, Ivan Brusic ivan@brusic.com wrote:

Don't range filter work better with and/or/not filter and not inside bool
filters due to bitset caching? Never profiled myself.

--
Ivan

On Mon, Sep 16, 2013 at 12:58 PM, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:

Hi Damien,

I would change the range query into range filter, then each range filter
be cached on its own by default:
http://www.elasticsearch.org/guide/reference/query-dsl/range-filter/

The range query doesn't cache at all on its own. If you wrap a filtered
query as inner query in the nested filter and put the range filters in the
filter part and the fields query in the query part then I expect a faster
execution time:

{
  "query": {
    "filtered": {
      "query": { "match_all": {} },
      "filter": {
        "nested": {
          "path": "rights.display",
          "query": {
            "filtered": {
               "query": {
                    "field": {
                        "rights.display.zones": {
                          "query": "FR"
                        }
                    }
               },
               "filter": {
                   "bool": {
                       "must": [
                        {
                          "range": {
                            "rights.display.end_date": {
                              "gte": "2013-09-16"
                            }
                          }
                        },
                        {
                          "range": {
                            "rights.display.start_date": {
                              "lte": "2013-09-16"
                            }
                          }
                        }
                        ]
                   }
               }
            }
          }
        }
      }
    }
  }
}

The first time the range filters are executed these execution time is
similar than the range query, but any subsequent search request should be
much faster.
Also I see that you're filtering on a day precession, are you also
indexing the dates into the same precession? If not then I expect the range
filter (and query) to execute better if you do this.

Also caching the nested filter doesn't really help, if one element in the
nested filter changes than the cached entry can't be reused, and the nested
filter needs to be completely re-executed.

Let me know if these tips helped out.

On 16 September 2013 18:34, InquiringMind brian.from.fl@gmail.comwrote:

Hi Damien,

This is perhaps naive of me, but what I've seen work well across 100M
documents (about 1/20 the number of documents you mentioned), the best
range performance is when the range query is wrapped inside a bool query.
For example (with the actual gn and sn query values changed to protect the
innocent):

{
"from" : 0,
"size" : 20,
"timeout" : 60000,
"query" : {
"bool" : {
"must" : [ {
"match" : {
"gn" : {
"query" : "aurelio",
"type" : "boolean"
}
}
}, {
"match" : {
"sn" : {
"query" : "phzee",
"type" : "boolean"
}
}
}, {
"range" : {
"hn" : {
"from" : 1000,
"to" : 2000,
"include_lower" : true,
"include_upper" : true
}
}
} ]
}
},
"version" : true,
"explain" : false,
"fields" : [ "_ttl", "_source" ]
}

This query took 3.4 seconds to return 5 documents out of 34 when the
numeric range was omitted. But it did get much faster on subsequent
queries, down to 100ms or less.

I hope this helps!

P.S. My client actually builds the queries in Java, and then can emit
them as JSON for debugging and explanatory reasons.

Brian

On Monday, September 16, 2013 11:09:48 AM UTC-4, Damien Alexandre wrote:

Hi everyone,

ES 0.90.3, 5 shards.

I run an index with a nested field,
I have like 6 billions documents, and I run a query like this:

{
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"nested": {
"path": "rights.display",
"query": {
"bool": {
"must": [
{
"field": {
"rights.display.zones": {
"query": "FR"
}
}
},
{
"range": {
"rights.display.end_date": {
"gte": "2013-09-16"
}
}
},
{
"range": {
"rights.display.start_date": {
"lte": "2013-09-16"
}
}
}
]
}
}
}
}
}
}
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7