ES java.lang.OutOfMemoryError during Indexing?

Hi all,
My cluster consists of 3 master nodes and 6 data nodes. Each node has 4
vCPU, 16gb RAM (8gb for ES_HEAP). Each data node has 1TB storage (4 x
250gb, each connected to a different storage device). The main load for the
cluster is about 3K to 4K events per second (about 150 million events per
day, each event is about 200 to 300 bytes) from some network devices. I
use redis for caching and logstash for formatting. The caching ,formatting
and indexing to ES are done on other machines (8 vCPU 32gb RAM). The
cluster works fine, well not all the time. Sometime, I'll get
OutOfMemoryError in one of the data nodes, and then it ruins everything.
Well, I know I can disable heap dump on OutOfMemoryError, but still, I'd
like to know why these errors occur in the first place. Because I don't
think the "speed" is the key point. The ES cluster has been working
perfectly for 2 month, with the load about 150 million per day. But last
time OutOfMemoryError happened, the load was only 80M per day (estimated).
And even if there was a burst of events, there are 2 redis servers with
32gb RAM for caching.
My cluster generates 4 index per day, each index has 12 shards. The cluster
started collecting log event about a year ago, the main load started around
3 month ago and there are around 14K shards in the cluster. Each index has
serveral types, but the main load goes into one specific index/type.
Mappings for each index/type are pre-defined. Some are pretty complicated,
but the mapping for main load is rather simple. I did notice that on a
regular day, the heap usage on data nodes are quite high (around 60~90). So
my questions/concers are:
1 If I disable heap dump, will this data node act as normal as it could be?
Currently, during the heap dump, the data node will not be available
and will disconnect from the cluster. This will cause shard relocation and
trigger a series of OutOfMemoryError on other data nodes.
2 What causes these OutOf MemoryError?
Too many shards? Too many segments? Shard/Segment too big? I can
reconfigure my logstash so that indices are created every month/week rather
than every day. But if I want to store the logs by day with a reasonable
shard per index setting and overall index count. What kind of hardware
(heap size) do I need?
3 Is there any setting to be tuned?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

14 thousand shards is a lot! It's definitely going to be contributing to
your problem.

How much data is that in your cluster, how many indices?
What version of ES, what java version and release?

On 25 February 2015 at 20:34, zjxian.ptv@gmail.com wrote:

Hi all,
My cluster consists of 3 master nodes and 6 data nodes. Each node has 4
vCPU, 16gb RAM (8gb for ES_HEAP). Each data node has 1TB storage (4 x
250gb, each connected to a different storage device). The main load for the
cluster is about 3K to 4K events per second (about 150 million events per
day, each event is about 200 to 300 bytes) from some network devices. I
use redis for caching and logstash for formatting. The caching ,formatting
and indexing to ES are done on other machines (8 vCPU 32gb RAM). The
cluster works fine, well not all the time. Sometime, I'll get
OutOfMemoryError in one of the data nodes, and then it ruins everything.
Well, I know I can disable heap dump on OutOfMemoryError, but still, I'd
like to know why these errors occur in the first place. Because I don't
think the "speed" is the key point. The ES cluster has been working
perfectly for 2 month, with the load about 150 million per day. But last
time OutOfMemoryError happened, the load was only 80M per day (estimated).
And even if there was a burst of events, there are 2 redis servers with
32gb RAM for caching.
My cluster generates 4 index per day, each index has 12 shards. The
cluster started collecting log event about a year ago, the main load
started around 3 month ago and there are around 14K shards in the cluster.
Each index has serveral types, but the main load goes into one specific
index/type. Mappings for each index/type are pre-defined. Some are pretty
complicated, but the mapping for main load is rather simple. I did notice
that on a regular day, the heap usage on data nodes are quite high (around
60~90). So my questions/concers are:
1 If I disable heap dump, will this data node act as normal as it could
be?
Currently, during the heap dump, the data node will not be available
and will disconnect from the cluster. This will cause shard relocation and
trigger a series of OutOfMemoryError on other data nodes.
2 What causes these OutOf MemoryError?
Too many shards? Too many segments? Shard/Segment too big? I can
reconfigure my logstash so that indices are created every month/week rather
than every day. But if I want to store the logs by day with a reasonable
shard per index setting and overall index count. What kind of hardware
(heap size) do I need?
3 Is there any setting to be tuned?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-RgMSH2wwHaYaNiBH-N6axkkuiZ5V4jFpk0OJhAs4Skw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Sorry for my late response.
ES version is 1.4.2, planning for upgrading to 1.4.4
java version is Sun/Oracle Java 1.8.0_20
About the data in the cluster, it is quite complicated.
Before the last OutOfMemoryError, there were 629 indices, 15096 shards in
the cluster. The actual data is about 1510gb, with replica setting (1),
disk space used was about 3020gb.
The cluster has 4 types of log input (load), and 4 kind of indice to hold
them.
The indices are created on a daily basis. Each index has
12(pri)+12(replica) shards. So every day 96 shards are intialized and ready
for some data.
The 4 types of log input are not equal. The one from network devices is the
main load. This one alone will generate about 150 million log events every
day.
Something like this
green open device-log-2015.01.11 12 1 152642260 0 71.8gb 35.9gb
Before the last OutOfMemoryError, there were 68 such indcies in the
cluster. (The cluster itself has been running for some time, but this heavy
load comes just about 2 month ago.) All other indices are rather small,
ranging from 100m to 2gb each.

As for the shards in the cluster, well, there is very little I can do. 4
indices/96 shards every day, and this will go on at least for 12 month.
There will be 34560 shards in the cluster and whats worse, 1/4 of them will
be those "huge" shards.
I'm planning to upgrade the hardware, adding more data nodes (planning for
24), adding more RAM (planning for 32g). What concerns me is that, if heap
memory usage is related shard count, what kind of heap size can hold
15096(current)/34560(forseeable future) shards ? What if I have to keep the
log for 18 or 24 month ?

在 2015年2月25日星期三 UTC+8下午6:28:38,Mark Walkom写道:

14 thousand shards is a lot! It's definitely going to be contributing to
your problem.

How much data is that in your cluster, how many indices?
What version of ES, what java version and release?

On 25 February 2015 at 20:34, <zjxia...@gmail.com <javascript:>> wrote:

Hi all,
My cluster consists of 3 master nodes and 6 data nodes. Each node has 4
vCPU, 16gb RAM (8gb for ES_HEAP). Each data node has 1TB storage (4 x
250gb, each connected to a different storage device). The main load for the
cluster is about 3K to 4K events per second (about 150 million events per
day, each event is about 200 to 300 bytes) from some network devices. I
use redis for caching and logstash for formatting. The caching ,formatting
and indexing to ES are done on other machines (8 vCPU 32gb RAM). The
cluster works fine, well not all the time. Sometime, I'll get
OutOfMemoryError in one of the data nodes, and then it ruins everything.
Well, I know I can disable heap dump on OutOfMemoryError, but still, I'd
like to know why these errors occur in the first place. Because I don't
think the "speed" is the key point. The ES cluster has been working
perfectly for 2 month, with the load about 150 million per day. But last
time OutOfMemoryError happened, the load was only 80M per day (estimated).
And even if there was a burst of events, there are 2 redis servers with
32gb RAM for caching.
My cluster generates 4 index per day, each index has 12 shards. The
cluster started collecting log event about a year ago, the main load
started around 3 month ago and there are around 14K shards in the cluster.
Each index has serveral types, but the main load goes into one specific
index/type. Mappings for each index/type are pre-defined. Some are pretty
complicated, but the mapping for main load is rather simple. I did notice
that on a regular day, the heap usage on data nodes are quite high (around
60~90). So my questions/concers are:
1 If I disable heap dump, will this data node act as normal as it could
be?
Currently, during the heap dump, the data node will not be available
and will disconnect from the cluster. This will cause shard relocation and
trigger a series of OutOfMemoryError on other data nodes.
2 What causes these OutOf MemoryError?
Too many shards? Too many segments? Shard/Segment too big? I can
reconfigure my logstash so that indices are created every month/week rather
than every day. But if I want to store the logs by day with a reasonable
shard per index setting and overall index count. What kind of hardware
(heap size) do I need?
3 Is there any setting to be tuned?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c898d9da-3921-4957-afff-9d30f9748a2c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You do not need that level of sharding, it's excessive to the point of
insane.

On 26 February 2015 at 20:13, zjxian.ptv@gmail.com wrote:

Sorry for my late response.
ES version is 1.4.2, planning for upgrading to 1.4.4
java version is Sun/Oracle Java 1.8.0_20
About the data in the cluster, it is quite complicated.
Before the last OutOfMemoryError, there were 629 indices, 15096 shards in
the cluster. The actual data is about 1510gb, with replica setting (1),
disk space used was about 3020gb.
The cluster has 4 types of log input (load), and 4 kind of indice to hold
them.
The indices are created on a daily basis. Each index has
12(pri)+12(replica) shards. So every day 96 shards are intialized and ready
for some data.
The 4 types of log input are not equal. The one from network devices is
the main load. This one alone will generate about 150 million log events
every day.
Something like this
green open device-log-2015.01.11 12 1 152642260 0 71.8gb 35.9gb
Before the last OutOfMemoryError, there were 68 such indcies in the
cluster. (The cluster itself has been running for some time, but this heavy
load comes just about 2 month ago.) All other indices are rather small,
ranging from 100m to 2gb each.

As for the shards in the cluster, well, there is very little I can do. 4
indices/96 shards every day, and this will go on at least for 12 month.
There will be 34560 shards in the cluster and whats worse, 1/4 of them will
be those "huge" shards.
I'm planning to upgrade the hardware, adding more data nodes (planning for
24), adding more RAM (planning for 32g). What concerns me is that, if heap
memory usage is related shard count, what kind of heap size can hold
15096(current)/34560(forseeable future) shards ? What if I have to keep the
log for 18 or 24 month ?

在 2015年2月25日星期三 UTC+8下午6:28:38,Mark Walkom写道:

14 thousand shards is a lot! It's definitely going to be contributing to
your problem.

How much data is that in your cluster, how many indices?
What version of ES, what java version and release?

On 25 February 2015 at 20:34, zjxia...@gmail.com wrote:

Hi all,
My cluster consists of 3 master nodes and 6 data nodes. Each node has 4
vCPU, 16gb RAM (8gb for ES_HEAP). Each data node has 1TB storage (4 x
250gb, each connected to a different storage device). The main load for the
cluster is about 3K to 4K events per second (about 150 million events per
day, each event is about 200 to 300 bytes) from some network devices. I
use redis for caching and logstash for formatting. The caching ,formatting
and indexing to ES are done on other machines (8 vCPU 32gb RAM). The
cluster works fine, well not all the time. Sometime, I'll get
OutOfMemoryError in one of the data nodes, and then it ruins everything.
Well, I know I can disable heap dump on OutOfMemoryError, but still, I'd
like to know why these errors occur in the first place. Because I don't
think the "speed" is the key point. The ES cluster has been working
perfectly for 2 month, with the load about 150 million per day. But last
time OutOfMemoryError happened, the load was only 80M per day (estimated).
And even if there was a burst of events, there are 2 redis servers with
32gb RAM for caching.
My cluster generates 4 index per day, each index has 12 shards. The
cluster started collecting log event about a year ago, the main load
started around 3 month ago and there are around 14K shards in the cluster.
Each index has serveral types, but the main load goes into one specific
index/type. Mappings for each index/type are pre-defined. Some are pretty
complicated, but the mapping for main load is rather simple. I did notice
that on a regular day, the heap usage on data nodes are quite high (around
60~90). So my questions/concers are:
1 If I disable heap dump, will this data node act as normal as it could
be?
Currently, during the heap dump, the data node will not be available
and will disconnect from the cluster. This will cause shard relocation and
trigger a series of OutOfMemoryError on other data nodes.
2 What causes these OutOf MemoryError?
Too many shards? Too many segments? Shard/Segment too big? I can
reconfigure my logstash so that indices are created every month/week rather
than every day. But if I want to store the logs by day with a reasonable
shard per index setting and overall index count. What kind of hardware
(heap size) do I need?
3 Is there any setting to be tuned?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c898d9da-3921-4957-afff-9d30f9748a2c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c898d9da-3921-4957-afff-9d30f9748a2c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X891B1MniZFS2aC9zVYkzCucqM5O1aNqvszxWgDd-cNTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Well, I do suspect that too many shards caused my problem. I didnt think it
would be a problem when I created this cluster. I've already closed 2/3 of
the index (they will be re-opened and re-indexed), and I'm planning to
reduce the frequency of index creating from 1 index per day to 1 index per
month. But since the load is still the same, am I correct that, choosing
between "many index/shard with relatively fewer docs in each index" and
"few index/shard with relatively more docs in each index", the latter one
is the right choice ?

在 2015年2月26日星期四 UTC+8下午7:01:36,Mark Walkom写道:

You do not need that level of sharding, it's excessive to the point of
insane.

On 26 February 2015 at 20:13, <zjxia...@gmail.com <javascript:>> wrote:

Sorry for my late response.
ES version is 1.4.2, planning for upgrading to 1.4.4
java version is Sun/Oracle Java 1.8.0_20
About the data in the cluster, it is quite complicated.
Before the last OutOfMemoryError, there were 629 indices, 15096 shards in
the cluster. The actual data is about 1510gb, with replica setting (1),
disk space used was about 3020gb.
The cluster has 4 types of log input (load), and 4 kind of indice to hold
them.
The indices are created on a daily basis. Each index has
12(pri)+12(replica) shards. So every day 96 shards are intialized and ready
for some data.
The 4 types of log input are not equal. The one from network devices is
the main load. This one alone will generate about 150 million log events
every day.
Something like this
green open device-log-2015.01.11 12 1 152642260 0 71.8gb 35.9gb
Before the last OutOfMemoryError, there were 68 such indcies in the
cluster. (The cluster itself has been running for some time, but this heavy
load comes just about 2 month ago.) All other indices are rather small,
ranging from 100m to 2gb each.

As for the shards in the cluster, well, there is very little I can do. 4
indices/96 shards every day, and this will go on at least for 12 month.
There will be 34560 shards in the cluster and whats worse, 1/4 of them will
be those "huge" shards.
I'm planning to upgrade the hardware, adding more data nodes (planning
for 24), adding more RAM (planning for 32g). What concerns me is that, if
heap memory usage is related shard count, what kind of heap size can hold
15096(current)/34560(forseeable future) shards ? What if I have to keep the
log for 18 or 24 month ?

在 2015年2月25日星期三 UTC+8下午6:28:38,Mark Walkom写道:

14 thousand shards is a lot! It's definitely going to be contributing to
your problem.

How much data is that in your cluster, how many indices?
What version of ES, what java version and release?

On 25 February 2015 at 20:34, zjxia...@gmail.com wrote:

Hi all,
My cluster consists of 3 master nodes and 6 data nodes. Each node has 4
vCPU, 16gb RAM (8gb for ES_HEAP). Each data node has 1TB storage (4 x
250gb, each connected to a different storage device). The main load for the
cluster is about 3K to 4K events per second (about 150 million events per
day, each event is about 200 to 300 bytes) from some network devices. I
use redis for caching and logstash for formatting. The caching ,formatting
and indexing to ES are done on other machines (8 vCPU 32gb RAM). The
cluster works fine, well not all the time. Sometime, I'll get
OutOfMemoryError in one of the data nodes, and then it ruins everything.
Well, I know I can disable heap dump on OutOfMemoryError, but still, I'd
like to know why these errors occur in the first place. Because I don't
think the "speed" is the key point. The ES cluster has been working
perfectly for 2 month, with the load about 150 million per day. But last
time OutOfMemoryError happened, the load was only 80M per day (estimated).
And even if there was a burst of events, there are 2 redis servers with
32gb RAM for caching.
My cluster generates 4 index per day, each index has 12 shards. The
cluster started collecting log event about a year ago, the main load
started around 3 month ago and there are around 14K shards in the cluster.
Each index has serveral types, but the main load goes into one specific
index/type. Mappings for each index/type are pre-defined. Some are pretty
complicated, but the mapping for main load is rather simple. I did notice
that on a regular day, the heap usage on data nodes are quite high (around
60~90). So my questions/concers are:
1 If I disable heap dump, will this data node act as normal as it
could be?
Currently, during the heap dump, the data node will not be
available and will disconnect from the cluster. This will cause shard
relocation and trigger a series of OutOfMemoryError on other data nodes.
2 What causes these OutOf MemoryError?
Too many shards? Too many segments? Shard/Segment too big? I can
reconfigure my logstash so that indices are created every month/week rather
than every day. But if I want to store the logs by day with a reasonable
shard per index setting and overall index count. What kind of hardware
(heap size) do I need?
3 Is there any setting to be tuned?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/3f85f8cb-acc9-49c8-a153-de17576ba98b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c898d9da-3921-4957-afff-9d30f9748a2c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c898d9da-3921-4957-afff-9d30f9748a2c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f8ac87cf-ef6a-494e-a07e-96804d410cfc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.