Need help with configuration

I have the following setup for an ELK Platform:

  • logstash-forwarder on 20 machines as following:
  • 4 machines (Nginx + sys logs)
  • 4 machines (PostgreSQL)
  • 12 machines (Ruby on Rails + sys logs)

And I have one machine that hosts the ELK platform with the following specs:

  • 16GB Ram
  • Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (4 cores)
  • SATA 6 Gb/s 7200 rpm

I used digital ocean tutorial to do all that (https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-4-on-ubuntu-14-04) But I get the following errors:

/var/log/elasticsearch.log

[2015-10-14 13:56:22,743][ERROR][marvel.agent.exporter    ] [General Orwell Taylor] create failure (index:[.marvel-2015.10.14] type: [cluster_stats]): EsRejectedExecutionException[rejected execution (queue capacity 500) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1@65fac6b3]
[2015-10-14 13:58:31,174][ERROR][marvel.agent.exporter    ] [General Orwell Taylor] error sending data to [http://127.0.0.1:9200/.marvel-2015.10.14/_bulk]: [SocketTimeoutException[Read timed out]]
[2015-10-14 14:00:57,588][ERROR][marvel.agent.exporter    ] [General Orwell Taylor] create failure (index:[.marvel-2015.10.14] type: [node_stats]): EsRejectedExecutionException[rejected execution (queue capacity 500) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1@50605923]
[2015-10-14 14:02:26,153][ERROR][marvel.agent.exporter    ] [General Orwell Taylor] error sending data to [http://127.0.0.1:9200/.marvel-2015.10.14/_bulk]: [SocketTimeoutException[Read timed out]]
[2015-10-14 14:04:59,930][ERROR][marvel.agent.exporter    ] [General Orwell Taylor] error sending data to [http://127.0.0.1:9200/.marvel-2015.10.14/_bulk]: SocketTimeoutException[Read timed out]
[2015-10-14 14:05:44,498][ERROR][marvel.agent.exporter    ] [General Orwell Taylor] create failure (index:[.marvel-2015.10.14] type: [node_stats]): EsRejectedExecutionException[rejected execution (queue capacity 500) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1@5b840dba]
[2015-10-14 14:07:42,643][ERROR][marvel.agent.exporter    ] [General Orwell Taylor] create failure (index:[.marvel-2015.10.14] type: [cluster_stats]): EsRejectedExecutionException[rejected execution (queue capacity 500) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1@6126efae]
[2015-10-14 14:13:39,190][ERROR][marvel.agent.exporter    ] [General Orwell Taylor] create failure (index:[.marvel-2015.10.14] type: [cluster_stats]): EsRejectedExecutionException[rejected execution (queue capacity 500) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1@1c09e397]

/var/log/logstash.log

{:timestamp=>"2015-10-14T14:18:58.756000+0200", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2015-10-14T14:18:58.784000+0200", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-10-14T14:18:58.826000+0200", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2015-10-14T14:18:58.826000+0200", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::OpenBreaker, :level=>:warn}
{:timestamp=>"2015-10-14T14:19:02.786000+0200", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-10-14T14:19:03.754000+0200", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-10-14T14:19:03.755000+0200", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-10-14T14:19:03.755000+0200", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}

And thousands of

{:timestamp=>"2015-10-14T14:20:47.961000+0200", :message=>"retrying failed action with response code: 429", :level=>:warn}

So, most of the data are being missed and not indexed!

It means ES is overloaded.

What is the config you used for ES?

Nothing special... Just the default configurations!

Then you're only running with a maximum of 1GB of heap, which won't be helping.
You should increase that, I'd start at a min of 4GB.

Thank you @warkolm, I edited /etc/default/elasticsearch and set ES_HEAP_SIZE=8g and now I've very high load 25-30! and can't run Kibana because of timeout!

But still getting thousands of the following errors in logstash.log

{:timestamp=>"2015-10-15T01:35:03.842000+0200", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-10-15T01:35:35.969000+0200", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-10-15T01:35:35.969000+0200", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2015-10-15T01:35:03.895000+0200", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}

We had a similar problem where we installed everything on 1 machine. Started with 4Gb and realised pretty soon we needed more...

So we are running with 25Gb at the moment for the whole machine (I want to go up to 64Gb actually). Heapsizes are the following:

  • Logstash = 6Gb
  • Elasticsearch = 8Gb

But we have a problem where we run out of memory every 5 to 6 days and have to restart ES (v1.4.5)

You don't need that much heap for LS, you should be able to move some of that to ES>