Large Scale elastic Search Logstash collection system

Robert_Gardam · August 13, 2014, 3:18pm

Hello

We have a 10 node elasticsearch cluster which is receieving roughly 10k/s
worth of logs lines from our application.

Each elasticsearch node has 132gb of memory - 48gb heap size, the disk
subsystem is not great, but it seems to be keeping up. (This could be an
issue, but i'm not sure that it is)

The logs path is:

app server -> redis (via logstash) -> logstash filters (3 dedicated boxes)
-> elasticsearch_http

We currently bulk import from logstash at 5k documents per flush to keep up
with the volume of data that comes in.

Here are the es non standard configs.

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000

Refresh tuning.

index.refresh_interval: 15s

Field Data cache tuning

indices.fielddata.cache.size: 24g
indices.fielddata.cache.expire: 10m
#Segment Merging Tuning
index.merge.policy.max_merged_segment: 15g

Thread Tuning

threadpool:
bulk:
type: fixed
queue_size: -1

We have not had this cluster stay up for more than a week, but it also
seems to crash for no real reason.

It seems like one node starts having issues and then it takes the entire
cluster down.

Does anyone from the community have any experience with this kind of setup?

Thanks in Advance,
Rob

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · August 13, 2014, 4:09pm

Because you set queue_size: -1 in the bulk thread pool, you explicitly
allowed the node to crash.

You should use reasonable resource limits. Default settings, which are
reasonable, are sufficient in most cases.

Jörg

On Wed, Aug 13, 2014 at 5:18 PM, Robert Gardam robert.gardam@fyber.com
wrote:

Hello

We have a 10 node elasticsearch cluster which is receieving roughly 10k/s
worth of logs lines from our application.

Each elasticsearch node has 132gb of memory - 48gb heap size, the disk
subsystem is not great, but it seems to be keeping up. (This could be an
issue, but i'm not sure that it is)

The logs path is:

app server -> redis (via logstash) -> logstash filters (3 dedicated boxes)
-> elasticsearch_http

We currently bulk import from logstash at 5k documents per flush to keep
up with the volume of data that comes in.

Here are the es non standard configs.

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000

Refresh tuning.

index.refresh_interval: 15s

Field Data cache tuning

indices.fielddata.cache.size: 24g
indices.fielddata.cache.expire: 10m
#Segment Merging Tuning
index.merge.policy.max_merged_segment: 15g

Thread Tuning

threadpool:
bulk:
type: fixed
queue_size: -1

We have not had this cluster stay up for more than a week, but it also
seems to crash for no real reason.

It seems like one node starts having issues and then it takes the entire
cluster down.

Does anyone from the community have any experience with this kind of setup?

Thanks in Advance,
Rob

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEU84GhzjUuuo69K2xeu1N2-nQPCJXXCj3wu20Mz0R4VA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Robert_Gardam · August 13, 2014, 4:50pm

Hi,
The reason this is set is because without it we reject messages and there
fore don't have all the log entries.

I'm happy to be told this isn't required, but i'm pretty sure it is. We are
constantly bulk indexing large numbers of events.

On Wednesday, August 13, 2014 6:09:46 PM UTC+2, Jörg Prante wrote:

Because you set queue_size: -1 in the bulk thread pool, you explicitly
allowed the node to crash.

You should use reasonable resource limits. Default settings, which are
reasonable, are sufficient in most cases.

Jörg

On Wed, Aug 13, 2014 at 5:18 PM, Robert Gardam <robert...@fyber.com
<javascript:>> wrote:

Hello

We have a 10 node elasticsearch cluster which is receieving roughly 10k/s
worth of logs lines from our application.

Each elasticsearch node has 132gb of memory - 48gb heap size, the disk
subsystem is not great, but it seems to be keeping up. (This could be an
issue, but i'm not sure that it is)

The logs path is:

app server -> redis (via logstash) -> logstash filters (3 dedicated
boxes) -> elasticsearch_http

We currently bulk import from logstash at 5k documents per flush to keep
up with the volume of data that comes in.

Here are the es non standard configs.

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000

Refresh tuning.

index.refresh_interval: 15s

Field Data cache tuning

indices.fielddata.cache.size: 24g
indices.fielddata.cache.expire: 10m
#Segment Merging Tuning
index.merge.policy.max_merged_segment: 15g

Thread Tuning

threadpool:
bulk:
type: fixed
queue_size: -1

We have not had this cluster stay up for more than a week, but it also
seems to crash for no real reason.

It seems like one node starts having issues and then it takes the entire
cluster down.

Does anyone from the community have any experience with this kind of
setup?

Thanks in Advance,
Rob

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/be02ed79-31de-4002-9144-d124370d0c31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · August 13, 2014, 5:07pm

If Elasticsearch rejects bulk actions, this is serious and you should
examine the cluster to find out why this is so. From slow disks, cluster
health, or capacity problems, everything comes to mind. But if you ignore
problem solution and merely disable bulk resource control instead, you open
the gate wide to unpredictable node crashes, and you won't be able to
control the cluster at a certain point.

To reduce the number of active bulk requests per timeframe, for example,
you could increase the bulk request actions per request. Or simply increase
the number of nodes. Or think about the shard/replica organization while
indexing - it can be an advantage to bulk index to replica level 0 index
only and increase the replica level later.

Jörg

On Wed, Aug 13, 2014 at 6:50 PM, Robert Gardam robert.gardam@fyber.com
wrote:

Hi,
The reason this is set is because without it we reject messages and there
fore don't have all the log entries.

I'm happy to be told this isn't required, but i'm pretty sure it is. We
are constantly bulk indexing large numbers of events.

On Wednesday, August 13, 2014 6:09:46 PM UTC+2, Jörg Prante wrote:

Because you set queue_size: -1 in the bulk thread pool, you explicitly
allowed the node to crash.

You should use reasonable resource limits. Default settings, which are
reasonable, are sufficient in most cases.

Jörg

On Wed, Aug 13, 2014 at 5:18 PM, Robert Gardam robert...@fyber.com
wrote:

Hello

We have a 10 node elasticsearch cluster which is receieving roughly
10k/s worth of logs lines from our application.

Each elasticsearch node has 132gb of memory - 48gb heap size, the disk
subsystem is not great, but it seems to be keeping up. (This could be an
issue, but i'm not sure that it is)

The logs path is:

app server -> redis (via logstash) -> logstash filters (3 dedicated
boxes) -> elasticsearch_http

We currently bulk import from logstash at 5k documents per flush to keep
up with the volume of data that comes in.

Here are the es non standard configs.

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000

Refresh tuning.

index.refresh_interval: 15s

Field Data cache tuning

indices.fielddata.cache.size: 24g
indices.fielddata.cache.expire: 10m
#Segment Merging Tuning
index.merge.policy.max_merged_segment: 15g

Thread Tuning

threadpool:
bulk:
type: fixed
queue_size: -1

We have not had this cluster stay up for more than a week, but it also
seems to crash for no real reason.

It seems like one node starts having issues and then it takes the entire
cluster down.

Does anyone from the community have any experience with this kind of
setup?

Thanks in Advance,
Rob

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/be02ed79-31de-4002-9144-d124370d0c31%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/be02ed79-31de-4002-9144-d124370d0c31%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFUPQBUvrsFYtQcWZ0W%3DnLZz_LasQ1T9p0XTx%3DqoesNMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Robert_Gardam · August 13, 2014, 5:24pm

I appreciate your answers. I think IO could be a contributing factor. I'm
thinking of splitting the index into an hourly index with no replicas for
bulk importing and then switch it on afterwards.

I think the risk of loosing data would be too high if it was any longer
than that. Also Does the async replication from the logstash side of things
cause unknown issues?

On Wednesday, August 13, 2014 7:08:05 PM UTC+2, Jörg Prante wrote:

If Elasticsearch rejects bulk actions, this is serious and you should
examine the cluster to find out why this is so. From slow disks, cluster
health, or capacity problems, everything comes to mind. But if you ignore
problem solution and merely disable bulk resource control instead, you open
the gate wide to unpredictable node crashes, and you won't be able to
control the cluster at a certain point.

To reduce the number of active bulk requests per timeframe, for example,
you could increase the bulk request actions per request. Or simply increase
the number of nodes. Or think about the shard/replica organization while
indexing - it can be an advantage to bulk index to replica level 0 index
only and increase the replica level later.

Jörg

On Wed, Aug 13, 2014 at 6:50 PM, Robert Gardam <robert...@fyber.com
<javascript:>> wrote:

Hi,
The reason this is set is because without it we reject messages and there
fore don't have all the log entries.

I'm happy to be told this isn't required, but i'm pretty sure it is. We
are constantly bulk indexing large numbers of events.

On Wednesday, August 13, 2014 6:09:46 PM UTC+2, Jörg Prante wrote:

Because you set queue_size: -1 in the bulk thread pool, you explicitly
allowed the node to crash.

You should use reasonable resource limits. Default settings, which are
reasonable, are sufficient in most cases.

Jörg

On Wed, Aug 13, 2014 at 5:18 PM, Robert Gardam robert...@fyber.com
wrote:

Hello

We have a 10 node elasticsearch cluster which is receieving roughly
10k/s worth of logs lines from our application.

Each elasticsearch node has 132gb of memory - 48gb heap size, the disk
subsystem is not great, but it seems to be keeping up. (This could be an
issue, but i'm not sure that it is)

The logs path is:

app server -> redis (via logstash) -> logstash filters (3 dedicated
boxes) -> elasticsearch_http

We currently bulk import from logstash at 5k documents per flush to
keep up with the volume of data that comes in.

Here are the es non standard configs.

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000

Refresh tuning.

index.refresh_interval: 15s

Field Data cache tuning

indices.fielddata.cache.size: 24g
indices.fielddata.cache.expire: 10m
#Segment Merging Tuning
index.merge.policy.max_merged_segment: 15g

Thread Tuning

threadpool:
bulk:
type: fixed
queue_size: -1

We have not had this cluster stay up for more than a week, but it also
seems to crash for no real reason.

It seems like one node starts having issues and then it takes the
entire cluster down.

Does anyone from the community have any experience with this kind of
setup?

Thanks in Advance,
Rob

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/be02ed79-31de-4002-9144-d124370d0c31%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/be02ed79-31de-4002-9144-d124370d0c31%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29de6d50-0798-490d-903f-631e4d47a7a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

otisg · August 17, 2014, 12:03pm

Hi Robert,

Or maybe it's worth rethinking the architecture to avoid having to do
tricks like no-replicas for 1h. Kafka in front of ES comes to mind. We
use this setup for Logsene http://sematext.com/logsene/ and don't have
the problem with log loss, so it may work well for you, too.

I think you could also replace Redis + 3 Logstash servers with 1 rsyslog
server with omelasticsearch, which has built-in buffering in memory and on
disk (see links below for config examples).

Some pointers that may be helpful:

Otis

Elasticsearch Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Wednesday, August 13, 2014 7:24:09 PM UTC+2, Robert Gardam wrote:

I appreciate your answers. I think IO could be a contributing factor. I'm
thinking of splitting the index into an hourly index with no replicas for
bulk importing and then switch it on afterwards.

I think the risk of loosing data would be too high if it was any longer
than that. Also Does the async replication from the logstash side of things
cause unknown issues?

On Wednesday, August 13, 2014 7:08:05 PM UTC+2, Jörg Prante wrote:

If Elasticsearch rejects bulk actions, this is serious and you should
examine the cluster to find out why this is so. From slow disks, cluster
health, or capacity problems, everything comes to mind. But if you ignore
problem solution and merely disable bulk resource control instead, you open
the gate wide to unpredictable node crashes, and you won't be able to
control the cluster at a certain point.

To reduce the number of active bulk requests per timeframe, for example,
you could increase the bulk request actions per request. Or simply increase
the number of nodes. Or think about the shard/replica organization while
indexing - it can be an advantage to bulk index to replica level 0 index
only and increase the replica level later.

Jörg

On Wed, Aug 13, 2014 at 6:50 PM, Robert Gardam robert...@fyber.com
wrote:

Hi,
The reason this is set is because without it we reject messages and
there fore don't have all the log entries.

I'm happy to be told this isn't required, but i'm pretty sure it is. We
are constantly bulk indexing large numbers of events.

On Wednesday, August 13, 2014 6:09:46 PM UTC+2, Jörg Prante wrote:

Because you set queue_size: -1 in the bulk thread pool, you explicitly
allowed the node to crash.

You should use reasonable resource limits. Default settings, which are
reasonable, are sufficient in most cases.

Jörg

On Wed, Aug 13, 2014 at 5:18 PM, Robert Gardam robert...@fyber.com
wrote:

Hello

We have a 10 node elasticsearch cluster which is receieving roughly
10k/s worth of logs lines from our application.

Each elasticsearch node has 132gb of memory - 48gb heap size, the disk
subsystem is not great, but it seems to be keeping up. (This could be an
issue, but i'm not sure that it is)

The logs path is:

app server -> redis (via logstash) -> logstash filters (3 dedicated
boxes) -> elasticsearch_http

We currently bulk import from logstash at 5k documents per flush to
keep up with the volume of data that comes in.

Here are the es non standard configs.

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000

Refresh tuning.

index.refresh_interval: 15s

Field Data cache tuning

indices.fielddata.cache.size: 24g
indices.fielddata.cache.expire: 10m
#Segment Merging Tuning
index.merge.policy.max_merged_segment: 15g

Thread Tuning

threadpool:
bulk:
type: fixed
queue_size: -1

We have not had this cluster stay up for more than a week, but it also
seems to crash for no real reason.

It seems like one node starts having issues and then it takes the
entire cluster down.

Does anyone from the community have any experience with this kind of
setup?

Thanks in Advance,
Rob

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d04a643e-990b-40b0-b230-2ba560f08eea%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/be02ed79-31de-4002-9144-d124370d0c31%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/be02ed79-31de-4002-9144-d124370d0c31%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5ab612b8-31c3-47bd-b8a5-8687c7fc9108%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Cluster resource usage Elasticsearch	14	446	July 6, 2017
Garbage collection pauses causing cluster to get unresponsive Elasticsearch	20	1945	July 6, 2017
Write throughput test on elasticsearch 9 high configuration nodes cluster, Elasticsearch	13	2019	July 6, 2017
Extremly slow troughput on large index Elasticsearch	8	1004	July 6, 2017
Elasticsearch JVM options Elasticsearch	9	928	July 6, 2017

Large Scale elastic Search Logstash collection system

Refresh tuning.

Field Data cache tuning

Thread Tuning

Refresh tuning.

Field Data cache tuning

Thread Tuning

Refresh tuning.

Field Data cache tuning

Thread Tuning

Refresh tuning.

Field Data cache tuning

Thread Tuning

Refresh tuning.

Field Data cache tuning

Thread Tuning

Otis

Refresh tuning.

Field Data cache tuning

Thread Tuning

Related topics