I am trying to do some data analysis on the user action logs stored in our
cluster. However, I always get OutOfMemoryError while doing some simple
aggregation queries.
The cluster is built with 10 EC2 r3.large instances (2 cpus, 15GB memory),
8GB is allocated to JVM, the rest is for filesystem cache.
The following is a snippet of the {host}/_cat/indices?v API call:
green proj1-2014.11.12 5 1 45062362 0 14.3gb
7.1gb
green proj1-2014.11.13 5 1 49711374 0 15.6gb
7.8gb
green proj1-2014.11.14 5 1 48117585 0 15.4gb
7.7gb
green proj1-2014.11.15 5 1 49787532 0 15.9gb
7.9gb
green proj1-2014.11.16 5 1 49610956 0 15.9gb
7.9gb
green proj1-2014.11.17 5 1 45786570 0 14.4gb
7.2gb
green proj1-2014.11.18 5 1 50250179 0 15.7gb
7.8gb
green proj1-2014.11.19 5 1 51194044 0
16gb 8gb
green proj1-2014.11.20 5 1 49449391 0 15.5gb
7.7gb
green proj1-2014.11.21 5 1 49656731 0 15.3gb
7.6gb
green proj1-2014.11.22 5 1 53166199 0 16.4gb
8.2gb
green proj1-2014.11.23 5 1 52484206 0 16.3gb
8.1gb
green proj1-2014.11.24 5 1 48237162 0 14.9gb
7.4gb
green proj1-2014.11.25 5 1 50600654 0 15.6gb
7.8gb
green proj1-2014.11.26 5 1 53851289 0 16.6gb
8.3gb
As you can see, we use a per day basis to create the index for proj1.
Currently we have 3 months of data, which is about 3 billions of user logs.
The queries I used is 'terms aggregation':
curl '{host}:9200/proj1-2014.11.,proj1-2014.12./_search?pretty&size=0' -d
'{
"query": {
"filtered": {
"filter": {
"bool": {
"must_not": [
{
"terms": {
"serial_num": [
"1234567890", ...more test device serial numbers
]
}
}
],
"must": [
{
"range": {
"event_timestamp": {
"lt": "2014-12-03T00:00:00+00:00",
"gte": "2014-10-04T00:00:00+00:00"
}
}
}
]
}
}
}
},
"aggs": {
"by_locale": {
"terms": {
"field": "locale"
}
}
}
}'
The queries do not succeed even once. After this query is sent, no response
is back. Then I checked the status of the cluster, there are 2 nodes
missing from the cluster, obviously, the 2 nodes is OutOfMemory due to this
query (and I confirm this by login into the instance and check the log).
How can I avoid this? I think 10 nodes should be enough to do analysis on
large volume data.
I am trying to do some data analysis on the user action logs stored in our
cluster. However, I always get OutOfMemoryError while doing some simple
aggregation queries.
The cluster is built with 10 EC2 r3.large instances (2 cpus, 15GB memory),
8GB is allocated to JVM, the rest is for filesystem cache.
The following is a snippet of the {host}/_cat/indices?v API call:
green proj1-2014.11.12 5 1 45062362 0
14.3gb 7.1gb
green proj1-2014.11.13 5 1 49711374 0
15.6gb 7.8gb
green proj1-2014.11.14 5 1 48117585 0
15.4gb 7.7gb
green proj1-2014.11.15 5 1 49787532 0
15.9gb 7.9gb
green proj1-2014.11.16 5 1 49610956 0
15.9gb 7.9gb
green proj1-2014.11.17 5 1 45786570 0
14.4gb 7.2gb
green proj1-2014.11.18 5 1 50250179 0
15.7gb 7.8gb
green proj1-2014.11.19 5 1 51194044 0
16gb 8gb
green proj1-2014.11.20 5 1 49449391 0
15.5gb 7.7gb
green proj1-2014.11.21 5 1 49656731 0
15.3gb 7.6gb
green proj1-2014.11.22 5 1 53166199 0
16.4gb 8.2gb
green proj1-2014.11.23 5 1 52484206 0
16.3gb 8.1gb
green proj1-2014.11.24 5 1 48237162 0
14.9gb 7.4gb
green proj1-2014.11.25 5 1 50600654 0
15.6gb 7.8gb
green proj1-2014.11.26 5 1 53851289 0
16.6gb 8.3gb
As you can see, we use a per day basis to create the index for proj1.
Currently we have 3 months of data, which is about 3 billions of user logs.
The queries I used is 'terms aggregation':
curl '{host}:9200/proj1-2014.11.,proj1-2014.12./_search?pretty&size=0'
-d '{
"query": {
"filtered": {
"filter": {
"bool": {
"must_not": [
{
"terms": {
"serial_num": [
"1234567890", ...more test device serial numbers
]
}
}
],
"must": [
{
"range": {
"event_timestamp": {
"lt": "2014-12-03T00:00:00+00:00",
"gte": "2014-10-04T00:00:00+00:00"
}
}
}
]
}
}
}
},
"aggs": {
"by_locale": {
"terms": {
"field": "locale"
}
}
}
}'
The queries do not succeed even once. After this query is sent, no
response is back. Then I checked the status of the cluster, there are 2
nodes missing from the cluster, obviously, the 2 nodes is OutOfMemory due
to this query (and I confirm this by login into the instance and check the
log).
How can I avoid this? I think 10 nodes should be enough to do analysis on
large volume data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.