Does date histogram load entire field into memory, even with range filter?


(Andrew Clegg) #1

Hi,

I'm trying to run a date facet over a subset of a long time series (very
many values), and it keeps OOMing. But when I remove the facet clause from
the query, I get an overall result instantly.

This suggests to me that even with the filter in place, ES is trying to
load all the distinct values of the field. Is that correct? If so, is there
any way round it?

The query looks like this:

{
"query": {
"filtered" : {
"query" : {
"range" : {
"pubDate" : {
"from" : "2010-10-01",
"to" : "2011-01-01"
}
}
},
"filter" : {
"and" : {
"filters": [
{
"exists" : { "field" : "foo" }
},
{
"term" : { "bar" : "somestring" }
},
{
"prefix" : { "baz" : "a" }
}
]
}
}
}
},
"facets" : {
"published" : {
"date_histogram" : {
"field" : "pubDate",
"interval" : "month"
}
}
}
}

Like that, I get:

[2012-06-15 11:54:48,335][WARN ][index.cache.field.data.soft] [Centurion]
[search_criteria] loading field [event.pubDate] caused out of memory failure

But when I take out the facet, no problems.

This is with search_type=count by the way, as I don't care about the actual
hits.

Thanks,

Andrew.


(Andrew Clegg) #2

Sorry, I realise I got the terminology wrong here, what I meant was a range
query not a range filter.

On Friday, 15 June 2012 12:06:10 UTC+1, Andrew Clegg wrote:

Hi,

I'm trying to run a date facet over a subset of a long time series (very
many values), and it keeps OOMing. But when I remove the facet clause from
the query, I get an overall result instantly.

This suggests to me that even with the filter in place, ES is trying to
load all the distinct values of the field. Is that correct? If so, is there
any way round it?

The query looks like this:

{
"query": {
"filtered" : {
"query" : {
"range" : {
"pubDate" : {
"from" : "2010-10-01",
"to" : "2011-01-01"
}
}
},
"filter" : {
"and" : {
"filters": [
{
"exists" : { "field" : "foo" }
},
{
"term" : { "bar" : "somestring" }
},
{
"prefix" : { "baz" : "a" }
}
]
}
}
}
},
"facets" : {
"published" : {
"date_histogram" : {
"field" : "pubDate",
"interval" : "month"
}
}
}
}

Like that, I get:

[2012-06-15 11:54:48,335][WARN ][index.cache.field.data.soft] [Centurion]
[search_criteria] loading field [event.pubDate] caused out of memory failure

But when I take out the facet, no problems.

This is with search_type=count by the way, as I don't care about the
actual hits.

Thanks,

Andrew.


(David Pilato) #3

Hi Andrew,

You have to filter the facet with the same filters you are already using in your
query.
So put your range filter as a Facet Filter should help.

Facet Filter

All facets can be configured with an additional filter (explained in the Query
DSL http://www.elasticsearch.org/guide/reference/query-dsl section), which
will reduce the documents they use for computing results. An example with a term
filter:

{
"facets" : {
"" : {
"" : {
...
},
"facet_filter" : {
"term" : { "user" : "kimchy"}
}
}
}
}

Note that this is different from a facet of the filter
http://www.elasticsearch.org/guide/reference/api/search/facets/filter-facet.html
type.

See also if scope could help :
http://www.elasticsearch.org/guide/reference/api/search/facets/index.html
http://www.elasticsearch.org/guide/reference/api/search/facets/index.html

Scope

As we have already mentioned, facet computation is restricted to the scope of
the current query, called main, by default. Facets can be computed within the
global scope as well, in which case it will return values computed acrosss all
documents in the index:

{
"facets" : {
"" : {
"" : { ... },
"global" : true
}
}
}

There’s one important distinction to keep in mind. While search queries restrict
both the returned documents and facet counts, search filters restrict only
returned documents — but notfacet counts.

If you need to restrict both the documents and facets, and you’re not willing or
able to use a query, you may use a facet filter.

HTH
David.

Le 15 juin 2012 à 13:06, Andrew Clegg andrew.clegg@gmail.com a écrit :

Hi,

I'm trying to run a date facet over a subset of a long time series (very many
values), and it keeps OOMing. But when I remove the facet clause from the
query, I get an overall result instantly.

This suggests to me that even with the filter in place, ES is trying to load
all the distinct values of the field. Is that correct? If so, is there any way
round it?

The query looks like this:

{
"query": {
"filtered" : {
"query" : {
"range" : {
"pubDate" : {
"from" : "2010-10-01",
"to" : "2011-01-01"
}
}
},
"filter" : {
"and" : {
"filters": [
{
"exists" : { "field" : "foo" }
},
{
"term" : { "bar" : "somestring" }
},
{
"prefix" : { "baz" : "a" }
}
]
}
}
}
},
"facets" : {
"published" : {
"date_histogram" : {
"field" : "pubDate",
"interval" : "month"
}
}
}
}

Like that, I get:

[2012-06-15 11:54:48,335][WARN ][index.cache.field.data.soft] [Centurion]
[search_criteria] loading field [event.pubDate] caused out of memory failure

But when I take out the facet, no problems.

This is with search_type=count by the way, as I don't care about the actual
hits.

Thanks,

Andrew.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(Andrew Clegg) #4

Hi David, I'm not sure that's helped, sadly.

I've tried making the query as restrictive as possible, by putting the same
filters in both the main query and the facet:

{
"query": {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"and" : {
"filters": [
{ "range" : {
"pubDate" : {
"from" : "2010-12-31",
"to" : "2011-01-01"
}
} },
{
"exists" : { "field" : "foo" }
},
{
"term" : { "bar" : "XXX" }
},
{
"prefix" : { "baz" : "a" }
}
]
}
}
}
},
"facets" : {
"published" : {
"date_histogram" : {
"field" : "pubDate",
"interval" : "month"
},
"facet_filter" : {
"and" : {
"filters": [
{ "range" : {
"pubDate" : {
"from" : "2010-12-31",
"to" : "2011-01-01"
}
} },
{
"exists" : { "field" : "foo" }
},
{
"term" : { "bar" : "XXX" }
},
{
"prefix" : { "baz" : "a" }
}
]
}
}
}
}
}

but even this with a single day covered bombs out in a JVM with 1G heap.

And there are only 63388 documents in that day so there's no reason it
should. (I know this because a count query without a facet on that date
range is instant...)

On Friday, 15 June 2012 12:24:07 UTC+1, David Pilato wrote:

Hi Andrew,

You have to filter the facet with the same filters you are already using
in your query.

So put your range filter as a Facet Filter should help.
Facet Filter

All facets can be configured with an additional filter (explained in the
Query DSL http://www.elasticsearch.org/guide/reference/query-dsl section),
which will reduce the documents they use for computing results. An
example with a term filter:

{
"facets" : {
"" : {
"" : {
...
},
"facet_filter" : {
"term" : { "user" : "kimchy"}
}
}
}
}

Note that this is different from a facet of the filterhttp://www.elasticsearch.org/guide/reference/api/search/facets/filter-facet.html
type.

See also if scope could help :
http://www.elasticsearch.org/guide/reference/api/search/facets/index.html
Scope

As we have already mentioned, facet computation is restricted to the scope
of the current query, called main, by default. Facets can be computed
within the global scope as well, in which case it will return values
computed acrosss all documents in the index:

{
"facets" : {
"" : {
"" : { ... },
"global" : true
}
}
}

There’s one important distinction to keep in mind. While search *queries

  • restrict both the returned documents and facet counts, search filters restrict
    only returned documents — but notfacet counts.

If you need to restrict both the documents and facets, and you’re not
willing or able to use a query, you may use a facet filter.

HTH

David.

Le 15 juin 2012 à 13:06, Andrew Clegg andrew.clegg@gmail.com a écrit :

Hi,

I'm trying to run a date facet over a subset of a long time series (very
many values), and it keeps OOMing. But when I remove the facet clause from
the query, I get an overall result instantly.

This suggests to me that even with the filter in place, ES is trying to
load all the distinct values of the field. Is that correct? If so, is there
any way round it?

The query looks like this:

{
"query": {
"filtered" : {
"query" : {
"range" : {
"pubDate" : {
"from" : "2010-10-01",
"to" : "2011-01-01"
}
}
},
"filter" : {
"and" : {
"filters": [
{
"exists" : { "field" : "foo" }
},
{
"term" : { "bar" : "somestring" }
},
{
"prefix" : { "baz" : "a" }
}
]
}
}
}
},
"facets" : {
"published" : {
"date_histogram" : {
"field" : "pubDate",
"interval" : "month"
}
}
}
}

Like that, I get:

[2012-06-15 11:54:48,335][WARN ][index.cache.field.data.soft] [Centurion]
[search_criteria] loading field [event.pubDate] caused out of memory
failure

But when I take out the facet, no problems.

This is with search_type=count by the way, as I don't care about the
actual hits.

Thanks,

Andrew.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(system) #5