Aggregation / Sort and CircuitBreakingException

lpoole · March 15, 2015, 10:11pm

Hey guys,

I have a question about the mechanics of aggregation and sorting w.r.t. the
fielddata cache. I know this has been covered in some detail previously,
and I'm caught up on the advice to use doc_values where possible, but we
have a use case where we do light analysis on a particular set of fields in
our document, but also allow sorting on those fields.

While we'll probably modify our schema to solve the issue, I was first
wondering whether it is possible to filter the set of documents that ES
aggregates / sorts over before pulling them into the fielddata cache? We
have extremely high cardinality fields, but very selective queries, and it
seems very inefficient to pull multiple gigabytes into the fielddata cache
to select relatively few matching documents.

Thanks,

Lindsey

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 15, 2015, 11:41pm

Have you considered doc values?

http://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html

Jörg

On Sun, Mar 15, 2015 at 11:11 PM, Lindsey Poole lpoole@gmail.com wrote:

Hey guys,

I have a question about the mechanics of aggregation and sorting w.r.t.
the fielddata cache. I know this has been covered in some detail
previously, and I'm caught up on the advice to use doc_values where
possible, but we have a use case where we do light analysis on a particular
set of fields in our document, but also allow sorting on those fields.

While we'll probably modify our schema to solve the issue, I was first
wondering whether it is possible to filter the set of documents that ES
aggregates / sorts over before pulling them into the fielddata cache? We
have extremely high cardinality fields, but very selective queries, and it
seems very inefficient to pull multiple gigabytes into the fielddata cache
to select relatively few matching documents.

Thanks,

Lindsey

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFgpwVbkkAsKK11m74qqE_avwQ5mmMGb2z1w0-qH5hNMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 15, 2015, 11:43pm

I mean, I do not understand what you mean by "I'm caught up on the advice
to use doc_values where possible, but we have a use case where we do light
analysis on a particular set of fields in our document" - what exactly
prevents you from doc values?

Jörg

On Mon, Mar 16, 2015 at 12:41 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Have you considered doc values?

http://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html

Jörg

On Sun, Mar 15, 2015 at 11:11 PM, Lindsey Poole lpoole@gmail.com wrote:

Hey guys,

I have a question about the mechanics of aggregation and sorting w.r.t.
the fielddata cache. I know this has been covered in some detail
previously, and I'm caught up on the advice to use doc_values where
possible, but we have a use case where we do light analysis on a particular
set of fields in our document, but also allow sorting on those fields.

While we'll probably modify our schema to solve the issue, I was first
wondering whether it is possible to filter the set of documents that ES
aggregates / sorts over before pulling them into the fielddata cache? We
have extremely high cardinality fields, but very selective queries, and it
seems very inefficient to pull multiple gigabytes into the fielddata cache
to select relatively few matching documents.

Thanks,

Lindsey

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGD8qRCq6k6MwK4ujnWYfYv%2BGzdqn45GA6a6Gv4jHcUWw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

lpoole · March 16, 2015, 2:01am

Well, we have a field that is supporting a backward compatibility use case.
Clients are executing a partial match query on this field, so we used the
keyword tokenizer instead of not_analyzed. Since this is supporting legacy
functionality, the clients cannot be updated to change the expectation that
a partial match will return results.

I can modify the schema and re-index so that we aggregate and sort over a
not_analyzed subfield instead, while executing any queries on the parent
field, but I wanted to verify that there is no other way to filter out
terms prior to loading them into the fielddata cache.

The kind of filtering I'm looking for would be something like, "only
consider terms in field1 from documents where field2=valueA".

-Lindsey

On Sunday, March 15, 2015 at 4:43:56 PM UTC-7, Jörg Prante wrote:

I mean, I do not understand what you mean by "I'm caught up on the advice
to use doc_values where possible, but we have a use case where we do light
analysis on a particular set of fields in our document" - what exactly
prevents you from doc values?

Jörg

On Mon, Mar 16, 2015 at 12:41 AM, joerg...@gmail.com <javascript:> <
joerg...@gmail.com <javascript:>> wrote:

Have you considered doc values?

http://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html

Jörg

On Sun, Mar 15, 2015 at 11:11 PM, Lindsey Poole <lpo...@gmail.com
<javascript:>> wrote:

Hey guys,

I have a question about the mechanics of aggregation and sorting w.r.t.
the fielddata cache. I know this has been covered in some detail
previously, and I'm caught up on the advice to use doc_values where
possible, but we have a use case where we do light analysis on a particular
set of fields in our document, but also allow sorting on those fields.

While we'll probably modify our schema to solve the issue, I was first
wondering whether it is possible to filter the set of documents that ES
aggregates / sorts over before pulling them into the fielddata cache? We
have extremely high cardinality fields, but very selective queries, and it
seems very inefficient to pull multiple gigabytes into the fielddata cache
to select relatively few matching documents.

Thanks,

Lindsey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c9dc986-cfe1-42f9-ac83-d1ca40699c3d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lpoole · March 16, 2015, 2:17am

Also, if I understand correctly, there are negative implications when
sorting over a column that has been analyzed - in our case, to remove
stop-words.

Since the total cardinality of our sort field exceeds the heap available,
we can't sort a single users documents when using stop word analysis since
doc_values do not support analyzed fields.

It seems like we'll have to preprocess the field to remove stop-words?

On Sunday, March 15, 2015 at 7:01:21 PM UTC-7, Lindsey Poole wrote:

Well, we have a field that is supporting a backward compatibility use
case. Clients are executing a partial match query on this field, so we used
the keyword tokenizer instead of not_analyzed. Since this is supporting
legacy functionality, the clients cannot be updated to change the
expectation that a partial match will return results.

I can modify the schema and re-index so that we aggregate and sort over a
not_analyzed subfield instead, while executing any queries on the parent
field, but I wanted to verify that there is no other way to filter out
terms prior to loading them into the fielddata cache.

The kind of filtering I'm looking for would be something like, "only
consider terms in field1 from documents where field2=valueA".

-Lindsey

On Sunday, March 15, 2015 at 4:43:56 PM UTC-7, Jörg Prante wrote:

I mean, I do not understand what you mean by "I'm caught up on the
advice to use doc_values where possible, but we have a use case where we do
light analysis on a particular set of fields in our document" - what
exactly prevents you from doc values?

Jörg

On Mon, Mar 16, 2015 at 12:41 AM, joerg...@gmail.com joerg...@gmail.com
wrote:

Have you considered doc values?

http://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html

Jörg

On Sun, Mar 15, 2015 at 11:11 PM, Lindsey Poole lpo...@gmail.com
wrote:

Hey guys,

I have a question about the mechanics of aggregation and sorting w.r.t.
the fielddata cache. I know this has been covered in some detail
previously, and I'm caught up on the advice to use doc_values where
possible, but we have a use case where we do light analysis on a particular
set of fields in our document, but also allow sorting on those fields.

While we'll probably modify our schema to solve the issue, I was first
wondering whether it is possible to filter the set of documents that ES
aggregates / sorts over before pulling them into the fielddata cache? We
have extremely high cardinality fields, but very selective queries, and it
seems very inefficient to pull multiple gigabytes into the fielddata cache
to select relatively few matching documents.

Thanks,

Lindsey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8027c84c-dd00-490e-a845-7fb0bb2f6107%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 16, 2015, 8:59am

You should sort over doc values (recommended, it will be the default in
next ES version). Sorting over not_analyzed / keyword analyzed fields is
old school.

Doc values for analyzed strings make not much sense in my opinion and lead
to unwanted results. If you use multifield, then you do not have to worry
because you can set up both doc values and analyzed field.

Example:

gist.github.com

https://gist.github.com/jprante/da2980446108b5c112a8

docvalues-multifield.sh


curl -XDELETE 'localhost:9200/test'

curl -XPUT 'localhost:9200/test' -d '
{
    "mappings" : {
        "docs" : {
            "properties" : {
                "content" : {
                    "type" : "string",

This file has been truncated. show original

The kind of filtering I'm looking for would be something like, "only
consider terms in field1 from documents where field2=valueA".

This always needs a complete load of all values of a field into the field
cache, with an inverted index. There is no lunch for free. And that's why
doc values (columnar style) were invented, to avoid this field cache
loading, for example for high cardinality fields.

Jörg

On Mon, Mar 16, 2015 at 3:17 AM, Lindsey Poole lpoole@gmail.com wrote:

Also, if I understand correctly, there are negative implications when
sorting over a column that has been analyzed - in our case, to remove
stop-words.

Since the total cardinality of our sort field exceeds the heap available,
we can't sort a single users documents when using stop word analysis since
doc_values do not support analyzed fields.

It seems like we'll have to preprocess the field to remove stop-words?

On Sunday, March 15, 2015 at 7:01:21 PM UTC-7, Lindsey Poole wrote:

Well, we have a field that is supporting a backward compatibility use
case. Clients are executing a partial match query on this field, so we used
the keyword tokenizer instead of not_analyzed. Since this is supporting
legacy functionality, the clients cannot be updated to change the
expectation that a partial match will return results.

I can modify the schema and re-index so that we aggregate and sort over a
not_analyzed subfield instead, while executing any queries on the parent
field, but I wanted to verify that there is no other way to filter out
terms prior to loading them into the fielddata cache.

The kind of filtering I'm looking for would be something like, "only
consider terms in field1 from documents where field2=valueA".

-Lindsey

On Sunday, March 15, 2015 at 4:43:56 PM UTC-7, Jörg Prante wrote:

I mean, I do not understand what you mean by "I'm caught up on the
advice to use doc_values where possible, but we have a use case where we do
light analysis on a particular set of fields in our document" - what
exactly prevents you from doc values?

Jörg

On Mon, Mar 16, 2015 at 12:41 AM, joerg...@gmail.com <joerg...@gmail.com

wrote:

Have you considered doc values?

Elasticsearch - The Definitive Guide | Elastic
current/doc-values.html

Jörg

On Sun, Mar 15, 2015 at 11:11 PM, Lindsey Poole lpo...@gmail.com
wrote:

Hey guys,

I have a question about the mechanics of aggregation and sorting
w.r.t. the fielddata cache. I know this has been covered in some detail
previously, and I'm caught up on the advice to use doc_values where
possible, but we have a use case where we do light analysis on a particular
set of fields in our document, but also allow sorting on those fields.

While we'll probably modify our schema to solve the issue, I was first
wondering whether it is possible to filter the set of documents that ES
aggregates / sorts over before pulling them into the fielddata cache? We
have extremely high cardinality fields, but very selective queries, and it
seems very inefficient to pull multiple gigabytes into the fielddata cache
to select relatively few matching documents.

Thanks,

Lindsey

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e32cf7c3-e2b3-48e9-bc7c-d7f2e0016835%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8027c84c-dd00-490e-a845-7fb0bb2f6107%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8027c84c-dd00-490e-a845-7fb0bb2f6107%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFg_F_UsgfN7DJfxQ-D%2BMhpiN%3D5%2BZ1-eiXg48hyA12osA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Fielddata cache and doc values Elasticsearch	2	390	July 6, 2017
Fielddata circuit breaker problems Elasticsearch	7	365	July 6, 2017
Significant Terms - Memory Issues Redux Elasticsearch	2	747	July 6, 2017
Finding Heap Memory Circuit Breaker hard to predict Elasticsearch	7	1503	July 5, 2017
Fielddata: use or not to use Elasticsearch	4	762	February 14, 2017

Aggregation / Sort and CircuitBreakingException

Related topics