Term filter causes memory to spike drastically

The term filter that is used:

curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {"void": false}
},
"fields": [["user_id1", "user_name", "date", "status", "q1",
"q1_unique_code", "q2", "q3"]],
"size": 50000, "sort": ["date_value"]}'

  • The 'void' field is a boolean field.
  • The index store size is 504mb.
  • The elastic search setup consists of only a single node and the index
    consists of only a single shard and 0 replicas. The version of
    elasticsearch is 0.90.7
  • The fields mentioned above is only the first 8 fields. The actual term
    filter that we execute has 350 fields mentioned.

We noticed the memory spiking by about 2-3gb though the store size is only
504mb.

Running the query multiple times seems to continuously increase the
memory.

Could someone explain why this memory spike occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7c4ea660-9411-4d1d-a86c-84f1c43f4f7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This is something that I just "discovered" as well.

Using a top-level filter is really a "post_filter" (it's renamed in later
versions of ES):

So, this will execute the query first (a default "match_all: {}") and then
execute the filter on that result set. This is not very efficient for your
query, since I expect you expected having filter there to work like a
"pre-filter" and filter out results before executing the query.

To do that, you need to use a "filtered query":

In your case, the resulting query would look like:

curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"query": {
"filtered": {
"filter": {
"term": {"void": false}
}
}
},
"fields": [["user_id1", "user_name", "date", "status", "q1",
"q1_unique_code", "q2", "q3"]],
"size": 50000, "sort": ["date_value"]}'

On Fri, Nov 21, 2014 at 7:07 AM, Ajay Divakaran ajay.divakaran86@gmail.com
wrote:

The term filter that is used:

curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {"void": false}
},
"fields": [["user_id1", "user_name", "date", "status", "q1",
"q1_unique_code", "q2", "q3"]],
"size": 50000, "sort": ["date_value"]}'

  • The 'void' field is a boolean field.
  • The index store size is 504mb.
  • The Elasticsearch setup consists of only a single node and the
    index consists of only a single shard and 0 replicas. The version of
    elasticsearch is 0.90.7
  • The fields mentioned above is only the first 8 fields. The actual
    term filter that we execute has 350 fields mentioned.

We noticed the memory spiking by about 2-3gb though the store size is
only 504mb.

Running the query multiple times seems to continuously increase the
memory.

Could someone explain why this memory spike occurs?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7c4ea660-9411-4d1d-a86c-84f1c43f4f7e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7c4ea660-9411-4d1d-a86c-84f1c43f4f7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Nick Canzoneri
Developer, Wildbit http://wildbit.com/
Beanstalk http://beanstalkapp.com/, Postmark http://postmarkapp.com/,

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKWm5yPQo5G0w2ShxtC4E9K90Bji1tQca3RqcG%2BvGUDR8aAqMQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

So does that mean all the documents in the shard (since the index has only
1 shard) are pulled into memory and then the filter is applied?

By moving it to a post-filter, I see the response coming back in around 15s
than previously where it was taking more than a minute. However the memory
still increases around 2-3GB.
Re-running the filtered query again multiple times does not further
increase the memory.

Though the index store mentions only 504mb could you explain why the memory
spikes to 2-3GB even with a filtered-query?

With the filtered query approach does the filtering happen at the disk
level?
Could you also explain why I don't see the memory increasing further with
multiple runs of the filtered-query?

On Friday, November 21, 2014 6:15:27 PM UTC+5:30, Nick Canzoneri wrote:

This is something that I just "discovered" as well.

Using a top-level filter is really a "post_filter" (it's renamed in later
versions of ES):
Elasticsearch Platform — Find real-time answers at scale | Elastic

So, this will execute the query first (a default "match_all: {}") and then
execute the filter on that result set. This is not very efficient for your
query, since I expect you expected having filter there to work like a
"pre-filter" and filter out results before executing the query.

To do that, you need to use a "filtered query":
Elasticsearch Platform — Find real-time answers at scale | Elastic

In your case, the resulting query would look like:

curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"query": {
"filtered": {
"filter": {
"term": {"void": false}
}
}
},
"fields": [["user_id1", "user_name", "date", "status", "q1",
"q1_unique_code", "q2", "q3"]],
"size": 50000, "sort": ["date_value"]}'

On Fri, Nov 21, 2014 at 7:07 AM, Ajay Divakaran <ajay.div...@gmail.com
<javascript:>> wrote:

The term filter that is used:

curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {"void": false}
},
"fields": [["user_id1", "user_name", "date", "status", "q1",
"q1_unique_code", "q2", "q3"]],
"size": 50000, "sort": ["date_value"]}'

  • The 'void' field is a boolean field.
  • The index store size is 504mb.
  • The Elasticsearch setup consists of only a single node and the
    index consists of only a single shard and 0 replicas. The version of
    elasticsearch is 0.90.7
  • The fields mentioned above is only the first 8 fields. The actual
    term filter that we execute has 350 fields mentioned.

We noticed the memory spiking by about 2-3gb though the store size is
only 504mb.

Running the query multiple times seems to continuously increase the
memory.

Could someone explain why this memory spike occurs?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7c4ea660-9411-4d1d-a86c-84f1c43f4f7e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7c4ea660-9411-4d1d-a86c-84f1c43f4f7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Nick Canzoneri
Developer, Wildbit http://wildbit.com/
Beanstalk http://beanstalkapp.com/, Postmark http://postmarkapp.com/,
dploy.io

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6436153-49b1-4a07-ab57-d135e035f84d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Ajay,

As Nick pointed out, this query is going to match documents on a doc-by-doc
basis, which is going to be very slow.

However, each iteration is not supposed to increase memory usage. Memory
usage might jump on the first request because elasticsearch will need to
load the date_value field into field data and potentially your term
filter into the filter cache, but that should be it, and subsequent
executions of this request should not add 2GB of garbage.

Something that is uncommonly high in your query is the size parameter. Do
you have an idea of how large your documents are? It could be that part of
the reason why so much garbage is generated is due to the building and then
serialization of the search response.

On Fri, Nov 21, 2014 at 1:07 PM, Ajay Divakaran ajay.divakaran86@gmail.com
wrote:

The term filter that is used:

curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {"void": false}
},
"fields": [["user_id1", "user_name", "date", "status", "q1",
"q1_unique_code", "q2", "q3"]],
"size": 50000, "sort": ["date_value"]}'

  • The 'void' field is a boolean field.
  • The index store size is 504mb.
  • The Elasticsearch setup consists of only a single node and the
    index consists of only a single shard and 0 replicas. The version of
    elasticsearch is 0.90.7
  • The fields mentioned above is only the first 8 fields. The actual
    term filter that we execute has 350 fields mentioned.

We noticed the memory spiking by about 2-3gb though the store size is
only 504mb.

Running the query multiple times seems to continuously increase the
memory.

Could someone explain why this memory spike occurs?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7c4ea660-9411-4d1d-a86c-84f1c43f4f7e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7c4ea660-9411-4d1d-a86c-84f1c43f4f7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5pdoVmgSp-27pWQvzJ-aHWqbKDAubmuMN9RUVWCvndDw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

This query is used for exporting data from our application hence the high value of the size parameter.

The doc by doc comparison is done by a term filter or filtered query?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d12257b-9b9b-4ec6-bfae-f105fa0a5892%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Each document has around 350 text fields.

But im still not able to relate the store size to the memory spike.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c8014753-b0e2-481c-85dc-8a9eceae7d32%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

One difference is that the store is compressed while the structure of the
response that is built into memory is not (and potentially very wasteful).

The doc-by-doc comparison is done by the post-filter: for every document
that matches the query, the filter is evaluated in order to know whether it
matches or not. On the other hand, when a filter is in the query (either
under a constant_score or a filtered_query), it can efficiently jump to the
next matches using the inverted index.

If you want to export your index, a more efficient way would be to use the
scan search type:
Elasticsearch Platform — Find real-time answers at scale | Elastic.
It basically opens a cursor that you can iterate on, as opposed as trying
to get everything at once.

On Fri, Nov 21, 2014 at 4:43 PM, Ajay Divakaran ajay.divakaran86@gmail.com
wrote:

Each document has around 350 text fields.

But im still not able to relate the store size to the memory spike.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c8014753-b0e2-481c-85dc-8a9eceae7d32%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6jsKSLZFxkGgmw%2BQbmwr8WYyeWh_zqZWQOdX7Y4kivWA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Adrien for your reply.

I upgraded to ES 0.90.13. What I'm now noticing is that the memory seems
to continuously increase when running the query again and again.
I also noticed that after upgrading it started using OpenJDK 1.6_0.33. So I
switched back to using Oracle JDK 1.7.71 however the issue seems to persist.

On Friday, November 21, 2014 10:57:09 PM UTC+5:30, Adrien Grand wrote:

One difference is that the store is compressed while the structure of the
response that is built into memory is not (and potentially very wasteful).

The doc-by-doc comparison is done by the post-filter: for every document
that matches the query, the filter is evaluated in order to know whether it
matches or not. On the other hand, when a filter is in the query (either
under a constant_score or a filtered_query), it can efficiently jump to the
next matches using the inverted index.

If you want to export your index, a more efficient way would be to use the
scan search type:
Elasticsearch Platform — Find real-time answers at scale | Elastic.
It basically opens a cursor that you can iterate on, as opposed as trying
to get everything at once.

On Fri, Nov 21, 2014 at 4:43 PM, Ajay Divakaran <ajay.div...@gmail.com
<javascript:>> wrote:

Each document has around 350 text fields.

But im still not able to relate the store size to the memory spike.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c8014753-b0e2-481c-85dc-8a9eceae7d32%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/77ab05ed-39cd-4313-8d1a-9950995e8b71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Seeing memory continuously increasing over a couple of requests is not
necessarily a bad sign. If you give X gigabytes of memory to a JVM, it
won't hesitate to use them if it can help decrease the frequency at which
it has to run costly garbage collections. What is more important to watch
is how memory usage behaves over a long period, eg. does the frequency at
which GCs run keep on increasing (which could indicate that the server is
encountering memory pressure or that there is something that leaks memory
somewhere).

On Fri, Nov 21, 2014 at 6:50 PM, Ajay Divakaran ajay.divakaran86@gmail.com
wrote:

Thanks Adrien for your reply.

I upgraded to ES 0.90.13. What I'm now noticing is that the memory seems
to continuously increase when running the query again and again.
I also noticed that after upgrading it started using OpenJDK 1.6_0.33. So
I switched back to using Oracle JDK 1.7.71 however the issue seems to
persist.

On Friday, November 21, 2014 10:57:09 PM UTC+5:30, Adrien Grand wrote:

One difference is that the store is compressed while the structure of the
response that is built into memory is not (and potentially very wasteful).

The doc-by-doc comparison is done by the post-filter: for every document
that matches the query, the filter is evaluated in order to know whether it
matches or not. On the other hand, when a filter is in the query (either
under a constant_score or a filtered_query), it can efficiently jump to the
next matches using the inverted index.

If you want to export your index, a more efficient way would be to use
the scan search type: http://www.elasticsearch.org/
guide/en/elasticsearch/guide/current/scan-scroll.html. It basically
opens a cursor that you can iterate on, as opposed as trying to get
everything at once.

On Fri, Nov 21, 2014 at 4:43 PM, Ajay Divakaran ajay.div...@gmail.com
wrote:

Each document has around 350 text fields.

But im still not able to relate the store size to the memory spike.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/c8014753-b0e2-481c-85dc-8a9eceae7d32%
40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/77ab05ed-39cd-4313-8d1a-9950995e8b71%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/77ab05ed-39cd-4313-8d1a-9950995e8b71%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4zPn5ECjtcVCXjjmfNNbYyJ47HRsVdrvA%3DYHaoP-ZfbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.