Logstash/Elasticsearch 5.x Errors

Forgive me, for I am a n00b when I comes to Unix or even Logstash/Elasticsearch. I had several localized clusters online for almost a year when I began upgrading them to 5.x.

My environment is such that, each location operates as an island. So I built each site's Logstash instance to inject into the local install of Elasticsearch as well as pipe the output to my two data centers. My two data center clusters are comprised of 8 servers. There is a Kibana client node, 3 master nodes, and 4 data nodes. Both data center clusters were completely wiped and rebuilt, not upgraded from ELK 2.x. Since the upgrade things have run so much smoother, than they already were. But, my local sites, where the ELK 5.x stack is running I am now getting error messages such as this:

<13>Jan 16 14:49:22.111396 ELKSTACK [2017-01-16T14:49:22,039][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$6@2e62fc7f on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5e766343[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 14778073]]"})

I have not been able to find where I need to configure my change in my configuration files to accommodate increasing the bulk queue capacity. As it stands right now, there are roughly 7 million logs being stored int he cluster per hour.

I have also removed the output configuration options on the local site ELK stacks that piped to their local Elasticsearch instance. Now, they just Logstash parse the logs and ship them to my two data center sites.

Looking through the logs in the data centers, I do not see any error messages or anything suggesting it is a Elasticsearch problem.

My cluster health shows this:

user@CLUSTER-MASTER-01:/var/log/elasticsearch$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "CLUSTER",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"number_of_data_nodes" : 4,
"active_primary_shards" : 9306,
"active_shards" : 18611,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

Can someone please advise what file I need to modify to increase the bulk queue capacity?

Two things.

1 - you have too many shards, way too many. Reduce that and you should find that your problem reduces.

2 - have a read of Any idea what these errors mean version 2.4.2

Where on Earth do I define those?

While your post/reply was helpful, the link did not include any detail as to how to adjust the number of shards.

Depends how you load data into ES, but look into index templates for a start.

That link was around the bulk changes you were thinking of making.

I don't see it and this is what is irritating.

There is really good documentation on the Elastic.co website. But no where can I find where it tells me which file I use to "adjust"/"config"/etc the thread pool settings, number of shards, etc.

If it is there, I am sorry, my brain just isn't processing it off the pages I read.

For example:

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html

What reference do I have, to the file I need to update in order to change any of these?

Welp, one of the new master servers I brought online, had a much different config file compared to the other 15 servers from both clusters.

# Set the number of shards (splits) of an index (5 by default):
#
#index.number_of_shards: 5
#
# Set the number of replicas (additional copies) of an index (1 by default):
#
#index.number_of_replicas: 1

I am guessing these settings here, which I couldn't find any documentation referencing them, are the ones I need to change.

If it lists 5 by default, what would you suggest I make it and the replicas?

I just did a test and set the shards to 1 and replicas to 2 and the Elasticsearch service wouldn't start.

The point of the link is don't, it goes into why it's a bad idea.

You can't set shards and replicas in there any more, take a look at Settings changes | Elasticsearch Guide [5.1] | Elastic, and then Index Templates | Elasticsearch Guide [5.1] | Elastic

Please don't mistake my lack of understanding as something more than it is.

I get the "point" and it makes sense.

I am a guy, with zero experience in this area, struggling to get things working.

It was working (to the best of my knowledge in 2.x) and now I feel like I am losing logs.

I don't feel as if I am doing anything overly complicated and yet, something's wrong.

For example, I can mostly comprehend what you added in the first link, but it still does NOT tell me where or what file exactly to make changes to.

The second link, I am so lost there are no words to describe. But even it, doesn't tell me where the current templates are.

Elasticsearch's configuration settings are defined in elasticsearch.yml, described here: https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html

Index templates aren't stored in files but as state in Elasticsearch itself. The REST APIs described in https://www.elastic.co/guide/en/elasticsearch/reference/5.1/indices-templates.html explains how to obtain current templates and update them.

Ok, that makes sense.

What I am going to do today, since it appears all my fiddling merely added to the chaos is, I am going to start over and rebuild all my ELK stack instances as well as my clusters.

It's not that hard of an undertaking, given that all of them VMs.

The main thing I want to accomplish is fresh installs and having all the software on the same distros and versions.

So, before I start piping into the cluster, I was going to add this the template:

  "number_of_shards" : 2
},

Also, I pulled the "default template" from my existing cluster and this is what is in there now:

{
"logstash" : {
"order" : 0,
"template" : "logstash-",
"settings" : {
"index" : {
"refresh_interval" : "5s"
}
},
"mappings" : {
"default" : {
"dynamic_templates" : [
{
"message_field" : {
"mapping" : {
"fielddata" : {
"format" : "disabled"
},
"index" : "analyzed",
"omit_norms" : true,
"type" : "string"
},
"match_mapping_type" : "string",
"match" : "message"
}
},
{
"string_fields" : {
"mapping" : {
"fielddata" : {
"format" : "disabled"
},
"index" : "analyzed",
"omit_norms" : true,
"type" : "string",
"fields" : {
"raw" : {
"ignore_above" : 256,
"index" : "not_analyzed",
"type" : "string"
}
}
},
"match_mapping_type" : "string",
"match" : "
"
}
}
],
"_all" : {
"omit_norms" : true,
"enabled" : true
},
"properties" : {
"@timestamp" : {
"type" : "date"
},
"geoip" : {
"dynamic" : true,
"properties" : {
"ip" : {
"type" : "ip"
},
"latitude" : {
"type" : "float"
},
"location" : {
"type" : "geo_point"
},
"longitude" : {
"type" : "float"
}
}
},
"@version" : {
"index" : "not_analyzed",
"type" : "string"
}
}
}
},
"aliases" : { }
}
}

Any suggestions on changes?

Also, as it stands, I have 1150 fields (which divided by two is 575 fields). I've done my level best to standardize my Logstash configs so that similar fields have similar names. My cluster is receiving logs from Windows servers (winlogbeat), Cisco devices [ e.g. routers, switches, firewalls, wireless ] (syslog), Security appliances (syslog), Unix servers (syslog), Riverbed devices (syslog), F5 devices (syslog), ESX hosts (syslog), Monitoring servers (syslog), DNS servers (syslog), etc.

I am sure there is much more I could be doing in terms of ELK performance overall, but I am just a guy, with zero funding internally and no one experienced in anything similar to this type of setup, doing what I can, to capture logs, in order to keep an eye on my environment that spans nearly 30 locations across the US. I say all that to say this, "Thank you!"

Everyone's help here has been immensely appreciated.

Do you really need two shards per index? How much data do you expect every day? Will you use daily indexes?

I'd consider making strings non-analyzed by default (keyword in ES 5). Fields extracted from logs is rarely useful to analyze. The existing mappings will certainly work though.

Honestly, I don't know how many shards I need. From what I have been able to deduce from various posts on the Internet trying to detail how to size a cluster, mostly I've seen 1 shard per data node. At the end of the day, each data center cluster will have about 25-30 ELK instances piping into 3 master servers and 4 data nodes (currently).

Mostly what is used is searching via the Kibana client node. So, 1 shard and 2 or 3 replicas?

Honestly, I don't know how many shards I need. From what I have been able to deduce from various posts on the Internet trying to detail how to size a cluster, mostly I've seen 1 shard per data node.

Do you mean one shard per index?

At the end of the day, each data center cluster will have about 25-30 ELK instances piping into 3 master servers and 4 data nodes (currently).

  • Will you use daily indexes?
  • How many days of logs will you keep?
  • How many events per day?
  • How many replicas?

Mostly what is used is searching via the Kibana client node. So, 1 shard and 2 or 3 replicas?

Unless you're very well funded having more than one replica is probably wasteful. How many data node failures do you need to be able to cope with before shards start becoming unavailable? Having three replicas instead of one will basically reduce your cluster's capacity by 50%.

Do you mean one shard per index?

That one I am not sure I am following the terminology correctly on. (See below) My understanding has been that more shards mean faster indexing.

Will you use daily indexes?

I believe YES. In terms of indexes, are you referring to the logstash-YYYY-MM-DD files?

How many days of logs will you keep?

Our long term goal (which will require funding) is 365 days of logs.

How many events per day?

Roughly 8 million per hour, so, 190ish million per day.

How many replicas?

My comprehension to this point has been more replicas mean faster searches.

Magnus,
I got one of my data center clusters online last night. Default installation so far. What should I try for shards and replicas?

Anyone? I'm at the point where I need to load a template.

Is there anyone who can help?

That one I am not sure I am following the terminology correctly on. (See below) My understanding has been that more shards mean faster indexing.

Yes, that might be true but it's something that needs to be balanced together with other concerns.

I believe YES. In terms of indexes, are you referring to the logstash-YYYY-MM-DD files?

Correct.

Roughly 8 million per hour, so, 190ish million per day.

Oh, that's quite a lot. Perhaps somewhere around 30-40 GB/day then? Obviously very dependent on the size of each event.

My comprehension to this point has been more replicas mean faster searches.

That might be true under some circumstances, but I suspect that gain will be offset by the increased RAM need, cache thrashing, and so on. Replicas is something you use for redundancy and resiliency, not performance.

I'm no expert in sizing large ES clusters, but here are few things to consider:

  • You don't want to have too many shards per machine since shards have a fixed overhead. A couple of hundred shards is probably okay but you don't want thousands of shards.
  • Shards larger than, say, 50 GB is a bad idea because of the excessive recovery times. Aiming for a shard size of a few tens of GB is probably reasonable.
  • Spreading query load over multiple machines in the cluster is obviously good, but that doesn't necessarily mean that you need multiple shards per day. If queries frequently span over e.g. one week you'll likely spread the load even with a single daily shard. If most queries are for the past 24 hours things play out differently.

Ok, that all makes sense. And as another piece of information, my daily disk usage was about 240 gig, per day. I calculated that my 8T cluster could retain logs for 28 days or so.

I am at the point in my first data center cluster that I will be sending logs to it, hopefully this evening.

I've built a very basic template to start with, just that I get the shard count established.

PUT _template/logstash
{
"order": 0,
"template": "*",
"settings": {
"index": {
"number_of_shards": "4",
"number_of_replicas": "0",
"refresh_interval": "5s"
}
}
}

I think with this, instead of the 5 and 1, it should give me a good idea on overall usage. I believe I need to go with 4 shards, so that all 4 nodes participate in the load balancing. Correct?

I read that replica count can change at anytime, so for my proof of concept with the shard count change, I am going to set it to zero for the moment.

I have not seen where I can set the shard size configuration. Is that also done with a template?