Aggregation error( Java heap space)


(vir.candy) #1

I do an aggregation search on my index(6 nodes). There are about 200
million lines
of data(port scanning). Each line is same* like this :**{"ip":"85.18.68.5",
"banner":"cisco-IOS", "country":"IT", "_type":"port-80"}.*
So you can image I have these data sort into different type by port they
are scanning. Now, I want to know who open a lot of ports at the same time.
So, I choose to do aggregation on IP field, and I get an OOM error that may
be reasonable because of most of them open only one port so that there are
too many buckets? I guess.

And then, I use aggregation filter.

{
"aggs":{
"just_name1":{
"filter":{
"prefix":{
"ip":"100.1"
}
},
"aggs":{
"just_name2":{
"terms":{
"field":"ip",
"execution_hint":"map"
}
}
}
}
}
}(yes, my ip field is set as string)

I think this time, I could make ES narrow down the set for aggregation. But I still get an OOM error. While It works on a smaller index(another cluster, one node). Why would this happen? After filtering, 2 cluster should have an equal-volume set. Why the bigger one failed?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d66bef21-b1e9-4538-b621-e93949b389cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(vir.candy) #2

The smaller index have 1 million lines of data. They are the lines filtered
by "prefix":{"ip":"100.1"} from the bigger one.

在 2014年4月2日星期三UTC+8下午4时04分27秒,vir....@gmail.com写道:

I do an aggregation search on my index(6 nodes). There are about 200
million lines
of data(port scanning). Each line is same* like this :**{"ip":"85.18.68.5",
"banner":"cisco-IOS", "country":"IT", "_type":"port-80"}.*
So you can image I have these data sort into different type by port they
are scanning. Now, I want to know who open a lot of ports at the same time.
So, I choose to do aggregation on IP field, and I get an OOM error that may
be reasonable because of most of them open only one port so that there are
too many buckets? I guess.

And then, I use aggregation filter.

{
"aggs":{
"just_name1":{
"filter":{
"prefix":{
"ip":"100.1"
}
},
"aggs":{
"just_name2":{
"terms":{
"field":"ip",
"execution_hint":"map"
}
}
}
}
}
}(yes, my ip field is set as string)

I think this time, I could make ES narrow down the set for aggregation. But I still get an OOM error. While It works on a smaller index(another cluster, one node). Why would this happen? After filtering, 2 cluster should have an equal-volume set. Why the bigger one failed?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d384bea8-4a60-4521-aa0e-34bb2fd61ec5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #3

Given your description of the problem, I think the issue is that your
Elasticsearch cluster doesn't have enough memory to load field data for the
ip field (which needs to be done for all documents, not only those that
match your query). So you either need to give more nodes to your cluster,
more memory to your nodes, or use doc values for your ip field[1] (the
latter option requires reindexing).

[1]

On Wed, Apr 2, 2014 at 10:09 AM, vir.candy@gmail.com wrote:

The smaller index have 1 million lines of data. They are the lines
filtered by "prefix":{"ip":"100.1"} from the bigger one.

在 2014年4月2日星期三UTC+8下午4时04分27秒,vir....@gmail.com写道:

I do an aggregation search on my index(6 nodes). There are about 200
million lines
of data(port scanning). Each line is same* like this :**{"ip":"85.18.68.5",
"banner":"cisco-IOS", "country":"IT", "_type":"port-80"}.*
So you can image I have these data sort into different type by port they
are scanning. Now, I want to know who open a lot of ports at the same time.
So, I choose to do aggregation on IP field, and I get an OOM error that may
be reasonable because of most of them open only one port so that there are
too many buckets? I guess.

And then, I use aggregation filter.

{
"aggs":{
"just_name1":{
"filter":{
"prefix":{
"ip":"100.1"
}
},
"aggs":{
"just_name2":{
"terms":{
"field":"ip",
"execution_hint":"map"
}
}
}
}
}
}(yes, my ip field is set as string)

I think this time, I could make ES narrow down the set for aggregation. But I still get an OOM error. While It works on a smaller index(another cluster, one node). Why would this happen? After filtering, 2 cluster should have an equal-volume set. Why the bigger one failed?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d384bea8-4a60-4521-aa0e-34bb2fd61ec5%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d384bea8-4a60-4521-aa0e-34bb2fd61ec5%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6kOx7RXmBzU9wfhesUYiz-2Qx8mrZStb_rCGdQv%2BpqNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(vir.candy) #4

But I can do aggregation on 'banner' field on both cluster. Is that because
values of 'banner' are not so unique compared to 'ip' field

2014-04-02 16:27 GMT+08:00 Adrien Grand adrien.grand@elasticsearch.com:

Given your description of the problem, I think the issue is that your
Elasticsearch cluster doesn't have enough memory to load field data for the
ip field (which needs to be done for all documents, not only those that
match your query). So you either need to give more nodes to your cluster,
more memory to your nodes, or use doc values for your ip field[1] (the
latter option requires reindexing).

[1]
http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

On Wed, Apr 2, 2014 at 10:09 AM, vir.candy@gmail.com wrote:

The smaller index have 1 million lines of data. They are the lines
filtered by "prefix":{"ip":"100.1"} from the bigger one.

在 2014年4月2日星期三UTC+8下午4时04分27秒,vir....@gmail.com写道:

I do an aggregation search on my index(6 nodes). There are about 200
million lines
of data(port scanning). Each line is same* like this :**{"ip":"85.18.68.5",
"banner":"cisco-IOS", "country":"IT", "_type":"port-80"}.*
So you can image I have these data sort into different type by port they
are scanning. Now, I want to know who open a lot of ports at the same time.
So, I choose to do aggregation on IP field, and I get an OOM error that may
be reasonable because of most of them open only one port so that there are
too many buckets? I guess.

And then, I use aggregation filter.

{
"aggs":{
"just_name1":{
"filter":{
"prefix":{
"ip":"100.1"
}
},
"aggs":{
"just_name2":{
"terms":{
"field":"ip",
"execution_hint":"map"
}
}
}
}
}
}(yes, my ip field is set as string)

I think this time, I could make ES narrow down the set for aggregation. But I still get an OOM error. While It works on a smaller index(another cluster, one node). Why would this happen? After filtering, 2 cluster should have an equal-volume set. Why the bigger one failed?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d384bea8-4a60-4521-aa0e-34bb2fd61ec5%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d384bea8-4a60-4521-aa0e-34bb2fd61ec5%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cf6dpcV7G3w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6kOx7RXmBzU9wfhesUYiz-2Qx8mrZStb_rCGdQv%2BpqNQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6kOx7RXmBzU9wfhesUYiz-2Qx8mrZStb_rCGdQv%2BpqNQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJp1%3DtwM3KJ1QYvsKGcXi4bDfjwDF-bRviSsYX6jUBEg6w5qgQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #5

On Wed, Apr 2, 2014 at 10:52 AM, 张阳 vir.candy@gmail.com wrote:

But I can do aggregation on 'banner' field on both cluster. Is that
because values of 'banner' are not so unique compared to 'ip' field

Very likely, yes. Memory usage of field data is higher on high-cardinality
fields.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7Fzw6Aud-J2RFb7a2DvfzrDfjyNdMLP0DcjuWgd0Ax9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6