Always get OutOfMemoryError while querying 3B documents in a 10-node cluster

I am trying to do some data analysis on the user action logs stored in our
cluster. However, I always get OutOfMemoryError while doing some simple
aggregation queries.

The cluster is built with 10 EC2 r3.large instances (2 cpus, 15GB memory),
8GB is allocated to JVM, the rest is for filesystem cache.

The following is a snippet of the {host}/_cat/indices?v API call:
green proj1-2014.11.12 5 1 45062362 0 14.3gb
7.1gb
green proj1-2014.11.13 5 1 49711374 0 15.6gb
7.8gb
green proj1-2014.11.14 5 1 48117585 0 15.4gb
7.7gb
green proj1-2014.11.15 5 1 49787532 0 15.9gb
7.9gb
green proj1-2014.11.16 5 1 49610956 0 15.9gb
7.9gb
green proj1-2014.11.17 5 1 45786570 0 14.4gb
7.2gb
green proj1-2014.11.18 5 1 50250179 0 15.7gb
7.8gb
green proj1-2014.11.19 5 1 51194044 0
16gb 8gb
green proj1-2014.11.20 5 1 49449391 0 15.5gb
7.7gb
green proj1-2014.11.21 5 1 49656731 0 15.3gb
7.6gb
green proj1-2014.11.22 5 1 53166199 0 16.4gb
8.2gb
green proj1-2014.11.23 5 1 52484206 0 16.3gb
8.1gb
green proj1-2014.11.24 5 1 48237162 0 14.9gb
7.4gb
green proj1-2014.11.25 5 1 50600654 0 15.6gb
7.8gb
green proj1-2014.11.26 5 1 53851289 0 16.6gb
8.3gb

As you can see, we use a per day basis to create the index for proj1.
Currently we have 3 months of data, which is about 3 billions of user logs.
The queries I used is 'terms aggregation':
curl '{host}:9200/proj1-2014.11.,proj1-2014.12./_search?pretty&size=0' -d
'{
"query": {
"filtered": {
"filter": {
"bool": {
"must_not": [
{
"terms": {
"serial_num": [
"1234567890", ...more test device serial numbers
]
}
}
],
"must": [
{
"range": {
"event_timestamp": {
"lt": "2014-12-03T00:00:00+00:00",
"gte": "2014-10-04T00:00:00+00:00"
}
}
}
]
}
}
}
},
"aggs": {
"by_locale": {
"terms": {
"field": "locale"
}
}
}
}'

The queries do not succeed even once. After this query is sent, no response
is back. Then I checked the status of the cluster, there are 2 nodes
missing from the cluster, obviously, the 2 nodes is OutOfMemory due to this
query (and I confirm this by login into the instance and check the log).

How can I avoid this? I think 10 nodes should be enough to do analysis on
large volume data.

Any suggestions?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/231b8f43-72b2-44d9-9c57-49736933e1a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I can think of 3 potential causes for it:

  • fielddata: do you know how many unique values you have for the locale
    field?
  • norms: do you have "index: analyzed" fields in your mappings and don't
    care about length normalization? Then you might want to disable norms.
  • bloom filters: they take lots of memory and are disabled by default as
    of elasticsearch 1.4. On older versions you can disable them with the
    index.codec.bloom.load setting (see
    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html
    )

On Wed, Dec 3, 2014 at 11:54 AM, Shih-Peng Lin shihpeng.lin@gmail.com
wrote:

I am trying to do some data analysis on the user action logs stored in our
cluster. However, I always get OutOfMemoryError while doing some simple
aggregation queries.

The cluster is built with 10 EC2 r3.large instances (2 cpus, 15GB memory),
8GB is allocated to JVM, the rest is for filesystem cache.

The following is a snippet of the {host}/_cat/indices?v API call:
green proj1-2014.11.12 5 1 45062362 0
14.3gb 7.1gb
green proj1-2014.11.13 5 1 49711374 0
15.6gb 7.8gb
green proj1-2014.11.14 5 1 48117585 0
15.4gb 7.7gb
green proj1-2014.11.15 5 1 49787532 0
15.9gb 7.9gb
green proj1-2014.11.16 5 1 49610956 0
15.9gb 7.9gb
green proj1-2014.11.17 5 1 45786570 0
14.4gb 7.2gb
green proj1-2014.11.18 5 1 50250179 0
15.7gb 7.8gb
green proj1-2014.11.19 5 1 51194044 0
16gb 8gb
green proj1-2014.11.20 5 1 49449391 0
15.5gb 7.7gb
green proj1-2014.11.21 5 1 49656731 0
15.3gb 7.6gb
green proj1-2014.11.22 5 1 53166199 0
16.4gb 8.2gb
green proj1-2014.11.23 5 1 52484206 0
16.3gb 8.1gb
green proj1-2014.11.24 5 1 48237162 0
14.9gb 7.4gb
green proj1-2014.11.25 5 1 50600654 0
15.6gb 7.8gb
green proj1-2014.11.26 5 1 53851289 0
16.6gb 8.3gb

As you can see, we use a per day basis to create the index for proj1.
Currently we have 3 months of data, which is about 3 billions of user logs.
The queries I used is 'terms aggregation':
curl '{host}:9200/proj1-2014.11.,proj1-2014.12./_search?pretty&size=0'
-d '{
"query": {
"filtered": {
"filter": {
"bool": {
"must_not": [
{
"terms": {
"serial_num": [
"1234567890", ...more test device serial numbers
]
}
}
],
"must": [
{
"range": {
"event_timestamp": {
"lt": "2014-12-03T00:00:00+00:00",
"gte": "2014-10-04T00:00:00+00:00"
}
}
}
]
}
}
}
},
"aggs": {
"by_locale": {
"terms": {
"field": "locale"
}
}
}
}'

The queries do not succeed even once. After this query is sent, no
response is back. Then I checked the status of the cluster, there are 2
nodes missing from the cluster, obviously, the 2 nodes is OutOfMemory due
to this query (and I confirm this by login into the instance and check the
log).

How can I avoid this? I think 10 nodes should be enough to do analysis on
large volume data.

Any suggestions?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/231b8f43-72b2-44d9-9c57-49736933e1a6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/231b8f43-72b2-44d9-9c57-49736933e1a6%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6pa4Vq3hfhp_wk_y9Oz_Jx%2BOO1e-JD0OLuYYVDS5GZ0A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.