Aggregation on big data


(vir.candy) #1

I have 200 million lines of data(about port scanning). I want ES to return
those "ip" who open not only one port at the same time(order by count).
But, considering the volume of data and very little docs have same value on
"ip" field, obviously I get an out of memory error. Is there any way to
finish my query mission.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c15fcb9d-14f0-4eac-ba33-4b46d21c75a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #2

This kind of use-case requires memory for two main reasons:

  • field data,
  • counting values (aggregations).

Field data memory usage can be reduced by using doc values[1] which will
effectively store data on disk instead of memory and rely on the filesystem
cache.

Aggregations memory usage is more complicated to improve. In case you are
storing your IPs as string fields, you might want to use the map
execution hint that requires less memory than the ordinals execution hint
(please however note that we are working on improving the efficiency of
ordinals on high-cardinality fields so it might improve in future versions).

[1]


[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-aggregations-bucket-terms-aggregation.html#_execution_hint

On Mon, Mar 31, 2014 at 11:01 AM, vir.candy@gmail.com wrote:

I have 200 million lines of data(about port scanning). I want ES to return
those "ip" who open not only one port at the same time(order by count).
But, considering the volume of data and very little docs have same value on
"ip" field, obviously I get an out of memory error. Is there any way to
finish my query mission.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c15fcb9d-14f0-4eac-ba33-4b46d21c75a0%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/c15fcb9d-14f0-4eac-ba33-4b46d21c75a0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7mS0Z3CSBmBD52v668knp-nR5UpXfYUPC8c4VgAbaMAw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(vir.candy) #3

Thank you!

在 2014年3月31日星期一UTC+8下午5时26分09秒,Adrien Grand写道:

This kind of use-case requires memory for two main reasons:

  • field data,
  • counting values (aggregations).

Field data memory usage can be reduced by using doc values[1] which will
effectively store data on disk instead of memory and rely on the filesystem
cache.

Aggregations memory usage is more complicated to improve. In case you are
storing your IPs as string fields, you might want to use the map
execution hint that requires less memory than the ordinals execution hint
(please however note that we are working on improving the efficiency of
ordinals on high-cardinality fields so it might improve in future versions).

[1]
http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/
[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-aggregations-bucket-terms-aggregation.html#_execution_hint

On Mon, Mar 31, 2014 at 11:01 AM, <vir....@gmail.com <javascript:>> wrote:

I have 200 million lines of data(about port scanning). I want ES to
return those "ip" who open not only one port at the same time(order by
count). But, considering the volume of data and very little docs have same
value on "ip" field, obviously I get an out of memory error. Is there any
way to finish my query mission.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c15fcb9d-14f0-4eac-ba33-4b46d21c75a0%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/c15fcb9d-14f0-4eac-ba33-4b46d21c75a0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f09ca4ee-d34d-430d-ba58-9ec9136273e2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4