Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)


(Narendra Yadala) #1

I have a cluster of size 240 GB including replica and it has 5 nodes in it.
I allocated 5 GB RAM (total 5*5 GB) to each node and started the cluster.
When I start continuously firing queries on the cluster the GC starts
kicking in and eventually node goes down because of OutOfMemory exception.
I add upto 200k documents everyday. The indexing part works fine but
querying part is causing trouble. I have the cluster on ec2 and I use ec2
discovery mode.

What is ideal RAM size and are there any other parameters I need to tune to
get this cluster going?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Cluster sometimes dies due to excessive GC activity - query size
(Ivan Brusic) #2

How expensive are your queries? Are you using aggregations or sorting on
string fields that could use up your field data cache? Are you using the
defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can help
optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <narendra.yadala@gmail.com

wrote:

I have a cluster of size 240 GB including replica and it has 5 nodes in
it. I allocated 5 GB RAM (total 5*5 GB) to each node and started the
cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to tune
to get this cluster going?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Narendra Yadala) #3

Hi Ivan,

Thanks for the input about aggregating on strings, I do that, but those
queries take time but they do not crash node.

The queries which caused problem were pretty straightforward queries (such
as a boolean query with two musts, one must is equal match and other a
range match on long) but the real problem was with the size. When I kept
size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
it started working fine. I think it is worth mentioning somewhere about
this strange behavior (probably expected but strange).

I did double up on the RAM though and now I have allocated 5*10G RAM to the
cluster. Things are looking ok as of now, except that the aggregations (on
strings) are quite slow. May be I would run these aggregations as batch and
cache the outputs in a different type and move on for now.

Thanks
NY

On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic ivan@brusic.com wrote:

How expensive are your queries? Are you using aggregations or sorting on
string fields that could use up your field data cache? Are you using the
defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can help
optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
narendra.yadala@gmail.com> wrote:

I have a cluster of size 240 GB including replica and it has 5 nodes in
it. I allocated 5 GB RAM (total 5*5 GB) to each node and started the
cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to tune
to get this cluster going?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

Before firing queries, you should consider if the index design and query
choice is optimal.

Numeric range queries are not straightforward. They were a major issue on
inverted index engines like Lucene/Elasticsearch and it has taken some time
to introduce efficient implementations. See e.g.
https://issues.apache.org/jira/browse/LUCENE-1673

ES tries to compensate the downsides of massive numeric range queries by
loading all the field values into memory. To achieve effective queries, you
have to carefully discretize the values you index.

For example, a few hundred millions of different timestamps, with
millisecond resolution, are a real burden for searching on inverted
indices. A good discretization strategy for indexing is to reduce the total
amount of values in such field to a few hundred or thousands. For
timestamps, this means, indexing time-based series data in discrete
intervals of days, hours, minutes, maybe seconds is much more efficient
than e.g. millisecond resolution.

Another topic is to use filters for boolean queries. They are much faster.

Jörg

On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala narendra.yadala@gmail.com
wrote:

Hi Ivan,

Thanks for the input about aggregating on strings, I do that, but those
queries take time but they do not crash node.

The queries which caused problem were pretty straightforward queries (such
as a boolean query with two musts, one must is equal match and other a
range match on long) but the real problem was with the size. When I kept
size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
it started working fine. I think it is worth mentioning somewhere about
this strange behavior (probably expected but strange).

I did double up on the RAM though and now I have allocated 5*10G RAM to
the cluster. Things are looking ok as of now, except that the aggregations
(on strings) are quite slow. May be I would run these aggregations as batch
and cache the outputs in a different type and move on for now.

Thanks
NY

On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic ivan@brusic.com wrote:

How expensive are your queries? Are you using aggregations or sorting on
string fields that could use up your field data cache? Are you using the
defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can help
optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
narendra.yadala@gmail.com> wrote:

I have a cluster of size 240 GB including replica and it has 5 nodes
in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started the
cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to tune
to get this cluster going?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdY3-Kyhy5kenK16Bbv5tSu36mJFd1ULKkhNE4feh0Hg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Narendra Yadala) #5

Hi Jörg,

This query
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
},
"must": {
"match" : {"id": 521}
}
}
}
}

and this query are performing exactly same
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
}
}
},
"filter" : {
"term" : { "id" : "521" }
}
}

I am not able understand what makes a filtered query fast. Is there any
place where I can find documentation on the internals of how different
queries are processed by elasticsearch.

NY

On Sat, Aug 23, 2014 at 6:20 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Before firing queries, you should consider if the index design and query
choice is optimal.

Numeric range queries are not straightforward. They were a major issue on
inverted index engines like Lucene/Elasticsearch and it has taken some time
to introduce efficient implementations. See e.g.
https://issues.apache.org/jira/browse/LUCENE-1673

ES tries to compensate the downsides of massive numeric range queries by
loading all the field values into memory. To achieve effective queries, you
have to carefully discretize the values you index.

For example, a few hundred millions of different timestamps, with
millisecond resolution, are a real burden for searching on inverted
indices. A good discretization strategy for indexing is to reduce the total
amount of values in such field to a few hundred or thousands. For
timestamps, this means, indexing time-based series data in discrete
intervals of days, hours, minutes, maybe seconds is much more efficient
than e.g. millisecond resolution.

Another topic is to use filters for boolean queries. They are much faster.

Jörg

On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala <
narendra.yadala@gmail.com> wrote:

Hi Ivan,

Thanks for the input about aggregating on strings, I do that, but those
queries take time but they do not crash node.

The queries which caused problem were pretty straightforward queries
(such as a boolean query with two musts, one must is equal match and other
a range match on long) but the real problem was with the size. When I kept
size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
it started working fine. I think it is worth mentioning somewhere about
this strange behavior (probably expected but strange).

I did double up on the RAM though and now I have allocated 5*10G RAM to
the cluster. Things are looking ok as of now, except that the aggregations
(on strings) are quite slow. May be I would run these aggregations as batch
and cache the outputs in a different type and move on for now.

Thanks
NY

On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic ivan@brusic.com wrote:

How expensive are your queries? Are you using aggregations or sorting on
string fields that could use up your field data cache? Are you using the
defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can help
optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
narendra.yadala@gmail.com> wrote:

I have a cluster of size 240 GB including replica and it has 5 nodes
in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started the
cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to
tune to get this cluster going?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdY3-Kyhy5kenK16Bbv5tSu36mJFd1ULKkhNE4feh0Hg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdY3-Kyhy5kenK16Bbv5tSu36mJFd1ULKkhNE4feh0Hg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOpeyMF5ymV6W4Zc6w01679Tt7hKNkrbzOuSnem%2BztJpSDeEZw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Narendra Yadala) #6

Hi Jörg,

This query
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
},
"must": {
"match" : {"id": 521}
}
}
}
}

and this query are performing exactly same
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
}
}
},
"filter" : {
"term" : { "id" : "521" }
}
}

I am not able understand what makes a filtered query fast. Is there any
place where I can find documentation on the internals of how different
queries are processed by elasticsearch.

On Saturday, 23 August 2014 18:20:23 UTC+5:30, Jörg Prante wrote:

Before firing queries, you should consider if the index design and query
choice is optimal.

Numeric range queries are not straightforward. They were a major issue on
inverted index engines like Lucene/Elasticsearch and it has taken some time
to introduce efficient implementations. See e.g.
https://issues.apache.org/jira/browse/LUCENE-1673

ES tries to compensate the downsides of massive numeric range queries by
loading all the field values into memory. To achieve effective queries, you
have to carefully discretize the values you index.

For example, a few hundred millions of different timestamps, with
millisecond resolution, are a real burden for searching on inverted
indices. A good discretization strategy for indexing is to reduce the total
amount of values in such field to a few hundred or thousands. For
timestamps, this means, indexing time-based series data in discrete
intervals of days, hours, minutes, maybe seconds is much more efficient
than e.g. millisecond resolution.

Another topic is to use filters for boolean queries. They are much faster.

Jörg

On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala <narendr...@gmail.com
<javascript:>> wrote:

Hi Ivan,

Thanks for the input about aggregating on strings, I do that, but those
queries take time but they do not crash node.

The queries which caused problem were pretty straightforward queries
(such as a boolean query with two musts, one must is equal match and other
a range match on long) but the real problem was with the size. When I kept
size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
it started working fine. I think it is worth mentioning somewhere about
this strange behavior (probably expected but strange).

I did double up on the RAM though and now I have allocated 5*10G RAM to
the cluster. Things are looking ok as of now, except that the aggregations
(on strings) are quite slow. May be I would run these aggregations as batch
and cache the outputs in a different type and move on for now.

Thanks
NY

On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic <iv...@brusic.com
<javascript:>> wrote:

How expensive are your queries? Are you using aggregations or sorting on
string fields that could use up your field data cache? Are you using the
defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can help
optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <narendr...@gmail.com
<javascript:>> wrote:

I have a cluster of size 240 GB including replica and it has 5 nodes
in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started the
cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to
tune to get this cluster going?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #7

"When I kept size as Integer.MAX_VALUE, it caused all the problems"

Are you trying to return up to 2 billion documents at once? Even if that
number was only 1 million, you will face problems. Or did I perhaps
misunderstand you?

Are you sorting the documents based on the score (the default)?
Lucene/Elasticsearch would need to keep all the values in memory in order
to start them, causing memory problems. In general, Lucene is not effective
at deep pagination. Use scan/scroll:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

--
Ivan

On Sat, Aug 23, 2014 at 6:46 AM, Narendra Yadala narendra.yadala@gmail.com
wrote:

Hi Jörg,

This query
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
},
"must": {
"match" : {"id": 521}
}
}
}
}

and this query are performing exactly same
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
}
}
},
"filter" : {
"term" : { "id" : "521" }
}
}

I am not able understand what makes a filtered query fast. Is there any
place where I can find documentation on the internals of how different
queries are processed by elasticsearch.

On Saturday, 23 August 2014 18:20:23 UTC+5:30, Jörg Prante wrote:

Before firing queries, you should consider if the index design and query
choice is optimal.

Numeric range queries are not straightforward. They were a major issue on
inverted index engines like Lucene/Elasticsearch and it has taken some time
to introduce efficient implementations. See e.g.
https://issues.apache.org/jira/browse/LUCENE-1673

ES tries to compensate the downsides of massive numeric range queries by
loading all the field values into memory. To achieve effective queries, you
have to carefully discretize the values you index.

For example, a few hundred millions of different timestamps, with
millisecond resolution, are a real burden for searching on inverted
indices. A good discretization strategy for indexing is to reduce the total
amount of values in such field to a few hundred or thousands. For
timestamps, this means, indexing time-based series data in discrete
intervals of days, hours, minutes, maybe seconds is much more efficient
than e.g. millisecond resolution.

Another topic is to use filters for boolean queries. They are much faster.

Jörg

On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala narendr...@gmail.com
wrote:

Hi Ivan,

Thanks for the input about aggregating on strings, I do that, but those
queries take time but they do not crash node.

The queries which caused problem were pretty straightforward queries
(such as a boolean query with two musts, one must is equal match and other
a range match on long) but the real problem was with the size. When I kept
size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
it started working fine. I think it is worth mentioning somewhere about
this strange behavior (probably expected but strange).

I did double up on the RAM though and now I have allocated 5*10G RAM to
the cluster. Things are looking ok as of now, except that the aggregations
(on strings) are quite slow. May be I would run these aggregations as batch
and cache the outputs in a different type and move on for now.

Thanks
NY

On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic iv...@brusic.com wrote:

How expensive are your queries? Are you using aggregations or sorting
on string fields that could use up your field data cache? Are you using the
defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can
help optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
narendr...@gmail.com> wrote:

I have a cluster of size 240 GB including replica and it has 5 nodes
in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started the
cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to
tune to get this cluster going?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-
0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%
2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB95stJ%3DOhuBJSGM9%3DgpNsnrykb4kAwhpSDbvA8OT%3Ds4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Narendra Yadala) #8

I am not returning 2 billion documents :slight_smile:

I am returning all documents that match. Actual number can be anywhere
between 0 to 50k. I am just fetching documents between a given time
interval such as one hour, one day so on and then do batch processing them.

I fixed this by making 2 queries, one to fetch count and other for actual
data. It is mentioned in some other thread that scroll api is performance
intensive so I did not go for it.

On Saturday, 23 August 2014 21:32:59 UTC+5:30, Ivan Brusic wrote:

"When I kept size as Integer.MAX_VALUE, it caused all the problems"

Are you trying to return up to 2 billion documents at once? Even if that
number was only 1 million, you will face problems. Or did I perhaps
misunderstand you?

Are you sorting the documents based on the score (the default)?
Lucene/Elasticsearch would need to keep all the values in memory in order
to start them, causing memory problems. In general, Lucene is not effective
at deep pagination. Use scan/scroll:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

--
Ivan

On Sat, Aug 23, 2014 at 6:46 AM, Narendra Yadala <narendr...@gmail.com
<javascript:>> wrote:

Hi Jörg,

This query
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
},
"must": {
"match" : {"id": 521}
}
}
}
}

and this query are performing exactly same
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
}
}
},
"filter" : {
"term" : { "id" : "521" }
}
}

I am not able understand what makes a filtered query fast. Is there any
place where I can find documentation on the internals of how different
queries are processed by elasticsearch.

On Saturday, 23 August 2014 18:20:23 UTC+5:30, Jörg Prante wrote:

Before firing queries, you should consider if the index design and query
choice is optimal.

Numeric range queries are not straightforward. They were a major issue
on inverted index engines like Lucene/Elasticsearch and it has taken some
time to introduce efficient implementations. See e.g.
https://issues.apache.org/jira/browse/LUCENE-1673

ES tries to compensate the downsides of massive numeric range queries by
loading all the field values into memory. To achieve effective queries, you
have to carefully discretize the values you index.

For example, a few hundred millions of different timestamps, with
millisecond resolution, are a real burden for searching on inverted
indices. A good discretization strategy for indexing is to reduce the total
amount of values in such field to a few hundred or thousands. For
timestamps, this means, indexing time-based series data in discrete
intervals of days, hours, minutes, maybe seconds is much more efficient
than e.g. millisecond resolution.

Another topic is to use filters for boolean queries. They are much
faster.

Jörg

On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala narendr...@gmail.com
wrote:

Hi Ivan,

Thanks for the input about aggregating on strings, I do that, but those
queries take time but they do not crash node.

The queries which caused problem were pretty straightforward queries
(such as a boolean query with two musts, one must is equal match and other
a range match on long) but the real problem was with the size. When I kept
size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
it started working fine. I think it is worth mentioning somewhere about
this strange behavior (probably expected but strange).

I did double up on the RAM though and now I have allocated 5*10G RAM to
the cluster. Things are looking ok as of now, except that the aggregations
(on strings) are quite slow. May be I would run these aggregations as batch
and cache the outputs in a different type and move on for now.

Thanks
NY

On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic iv...@brusic.com wrote:

How expensive are your queries? Are you using aggregations or sorting
on string fields that could use up your field data cache? Are you using the
defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can
help optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
narendr...@gmail.com> wrote:

I have a cluster of size 240 GB including replica and it has 5
nodes in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started
the cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to
tune to get this cluster going?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-
0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%
2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/69c54959-1169-466c-9d53-62b985b4dafe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jonathan Foy) #9

I ran into the same issue when using Integer.MAX_VALUE as the size
parameter (migrating from a DB-based search). Perhaps someone can come up
with a proper reference, I cannot, but according to a comment in this SO
http://stackoverflow.com/questions/8829468/elasticsearch-query-to-return-all-records
question, Elasticsearch/Lucene tries to allocate memory for that many
scores. When I switched those queries to a count/search duo, things
improved dramatically, as you've already noticed.

On Saturday, August 23, 2014 12:17:47 PM UTC-4, Narendra Yadala wrote:

I am not returning 2 billion documents :slight_smile:

I am returning all documents that match. Actual number can be anywhere
between 0 to 50k. I am just fetching documents between a given time
interval such as one hour, one day so on and then do batch processing them.

I fixed this by making 2 queries, one to fetch count and other for actual
data. It is mentioned in some other thread that scroll api is performance
intensive so I did not go for it.

On Saturday, 23 August 2014 21:32:59 UTC+5:30, Ivan Brusic wrote:

"When I kept size as Integer.MAX_VALUE, it caused all the problems"

Are you trying to return up to 2 billion documents at once? Even if that
number was only 1 million, you will face problems. Or did I perhaps
misunderstand you?

Are you sorting the documents based on the score (the default)?
Lucene/Elasticsearch would need to keep all the values in memory in order
to start them, causing memory problems. In general, Lucene is not effective
at deep pagination. Use scan/scroll:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

--
Ivan

On Sat, Aug 23, 2014 at 6:46 AM, Narendra Yadala narendr...@gmail.com
wrote:

Hi Jörg,

This query
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
},
"must": {
"match" : {"id": 521}
}
}
}
}

and this query are performing exactly same
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
}
}
},
"filter" : {
"term" : { "id" : "521" }
}
}

I am not able understand what makes a filtered query fast. Is there any
place where I can find documentation on the internals of how different
queries are processed by elasticsearch.

On Saturday, 23 August 2014 18:20:23 UTC+5:30, Jörg Prante wrote:

Before firing queries, you should consider if the index design and
query choice is optimal.

Numeric range queries are not straightforward. They were a major issue
on inverted index engines like Lucene/Elasticsearch and it has taken some
time to introduce efficient implementations. See e.g.
https://issues.apache.org/jira/browse/LUCENE-1673

ES tries to compensate the downsides of massive numeric range queries
by loading all the field values into memory. To achieve effective queries,
you have to carefully discretize the values you index.

For example, a few hundred millions of different timestamps, with
millisecond resolution, are a real burden for searching on inverted
indices. A good discretization strategy for indexing is to reduce the total
amount of values in such field to a few hundred or thousands. For
timestamps, this means, indexing time-based series data in discrete
intervals of days, hours, minutes, maybe seconds is much more efficient
than e.g. millisecond resolution.

Another topic is to use filters for boolean queries. They are much
faster.

Jörg

On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala narendr...@gmail.com
wrote:

Hi Ivan,

Thanks for the input about aggregating on strings, I do that, but
those queries take time but they do not crash node.

The queries which caused problem were pretty straightforward queries
(such as a boolean query with two musts, one must is equal match and other
a range match on long) but the real problem was with the size. When I kept
size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
it started working fine. I think it is worth mentioning somewhere about
this strange behavior (probably expected but strange).

I did double up on the RAM though and now I have allocated 5*10G RAM
to the cluster. Things are looking ok as of now, except that the
aggregations (on strings) are quite slow. May be I would run these
aggregations as batch and cache the outputs in a different type and move on
for now.

Thanks
NY

On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic iv...@brusic.com
wrote:

How expensive are your queries? Are you using aggregations or sorting
on string fields that could use up your field data cache? Are you using the
defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can
help optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
narendr...@gmail.com> wrote:

I have a cluster of size 240 GB including replica and it has 5
nodes in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started
the cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to
tune to get this cluster going?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b659d11-
d757-4f8e-b347-60b3807c2dfe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-
0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%
2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4f0a203e-775e-4295-9081-a694554a2ed0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #10

Exactly. Filters do not use scores. They also use bit sets which makes them
reusable and fast.

I wasn't talking about a filter added to a query, I mean filtered queries.
This is a huge difference.

This query

{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
},
"must": {
"match" : {"id": 521}
}
}
}
}

can be turned into this filtered query

{
"query" : {
"constant_score": {
"filter": {
"bool": {
"must": [
{ "match" : { "body" : "big" } },
{"match" : {"id": 521} }
],
"must_not": {
"match" : { "body" : "data" }
}
}
}
}
}
}

(plus fixing the double key "must" which is a potential source of errors)

Jörg

On Sun, Aug 24, 2014 at 4:30 PM, Jonathan Foy thefoy@gmail.com wrote:

I ran into the same issue when using Integer.MAX_VALUE as the size
parameter (migrating from a DB-based search). Perhaps someone can come up
with a proper reference, I cannot, but according to a comment in this SO
http://stackoverflow.com/questions/8829468/elasticsearch-query-to-return-all-records
question, Elasticsearch/Lucene tries to allocate memory for that many
scores. When I switched those queries to a count/search duo, things
improved dramatically, as you've already noticed.

On Saturday, August 23, 2014 12:17:47 PM UTC-4, Narendra Yadala wrote:

I am not returning 2 billion documents :slight_smile:

I am returning all documents that match. Actual number can be anywhere
between 0 to 50k. I am just fetching documents between a given time
interval such as one hour, one day so on and then do batch processing them.

I fixed this by making 2 queries, one to fetch count and other for actual
data. It is mentioned in some other thread that scroll api is performance
intensive so I did not go for it.

On Saturday, 23 August 2014 21:32:59 UTC+5:30, Ivan Brusic wrote:

"When I kept size as Integer.MAX_VALUE, it caused all the problems"

Are you trying to return up to 2 billion documents at once? Even if that
number was only 1 million, you will face problems. Or did I perhaps
misunderstand you?

Are you sorting the documents based on the score (the default)?
Lucene/Elasticsearch would need to keep all the values in memory in order
to start them, causing memory problems. In general, Lucene is not effective
at deep pagination. Use scan/scroll:

http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/search-request-scroll.html

--
Ivan

On Sat, Aug 23, 2014 at 6:46 AM, Narendra Yadala narendr...@gmail.com
wrote:

Hi Jörg,

This query
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
},
"must": {
"match" : {"id": 521}
}
}
}
}

and this query are performing exactly same
{
"query" : {
"bool": {
"must": {
"match" : { "body" : "big" }
},
"must_not": {
"match" : { "body" : "data" }
}
}
},
"filter" : {
"term" : { "id" : "521" }
}
}

I am not able understand what makes a filtered query fast. Is there any
place where I can find documentation on the internals of how different
queries are processed by elasticsearch.

On Saturday, 23 August 2014 18:20:23 UTC+5:30, Jörg Prante wrote:

Before firing queries, you should consider if the index design and
query choice is optimal.

Numeric range queries are not straightforward. They were a major issue
on inverted index engines like Lucene/Elasticsearch and it has taken some
time to introduce efficient implementations. See e.g.
https://issues.apache.org/jira/browse/LUCENE-1673

ES tries to compensate the downsides of massive numeric range queries
by loading all the field values into memory. To achieve effective queries,
you have to carefully discretize the values you index.

For example, a few hundred millions of different timestamps, with
millisecond resolution, are a real burden for searching on inverted
indices. A good discretization strategy for indexing is to reduce the total
amount of values in such field to a few hundred or thousands. For
timestamps, this means, indexing time-based series data in discrete
intervals of days, hours, minutes, maybe seconds is much more efficient
than e.g. millisecond resolution.

Another topic is to use filters for boolean queries. They are much
faster.

Jörg

On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala <narendr...@gmail.com

wrote:

Hi Ivan,

Thanks for the input about aggregating on strings, I do that, but
those queries take time but they do not crash node.

The queries which caused problem were pretty straightforward queries
(such as a boolean query with two musts, one must is equal match and other
a range match on long) but the real problem was with the size. When I kept
size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
it started working fine. I think it is worth mentioning somewhere about
this strange behavior (probably expected but strange).

I did double up on the RAM though and now I have allocated 5*10G RAM
to the cluster. Things are looking ok as of now, except that the
aggregations (on strings) are quite slow. May be I would run these
aggregations as batch and cache the outputs in a different type and move on
for now.

Thanks
NY

On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic iv...@brusic.com
wrote:

How expensive are your queries? Are you using aggregations or
sorting on string fields that could use up your field data cache? Are you
using the defaults for the cache? Post the current usage.

If you post an example query and mapping, perhaps the community can
help optimize it.

Cheers,

Ivan

On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
narendr...@gmail.com> wrote:

I have a cluster of size 240 GB including replica and it has 5
nodes in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started
the cluster. When I start continuously firing queries on the cluster the GC
starts kicking in and eventually node goes down because of OutOfMemory
exception. I add upto 200k documents everyday. The indexing part works fine
but querying part is causing trouble. I have the cluster on ec2 and I use
ec2 discovery mode.

What is ideal RAM size and are there any other parameters I need to
tune to get this cluster going?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d75
7-4f8e-b347-60b3807c2dfe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/DdPD8MiquYQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9
GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_
JrcMgwXy2MA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4f0a203e-775e-4295-9081-a694554a2ed0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4f0a203e-775e-4295-9081-a694554a2ed0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaSJjdyTm9FG2DsL9RP8kBOi2YuUNEv3yDiRzOB4cBRw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #11