Courier Fetch Errors

kopacko · February 9, 2017, 3:49pm

Right, but I am getting these error messages on all kinds of logs. This was just one of thousands I copied and pasted.

This message in particular, came from a winlogbeat message.

weltenwort · February 9, 2017, 3:58pm

Assuming you made no changes on the logstash side, maybe we can learn something by comparing the mappings and index templates between the cluster that works normally and the faulty cluster (via http[s]://${YOURCLUSTER}/_mapping and http[s]://${YOURCLUSTER}/_template).

kopacko · February 9, 2017, 4:28pm

Yea, I have 30'ish Logstash shippers.

The working "normal" cluster, I do not have the shippers sending to. I only did it as a test this morning. It has none of the tweaks and config changes that I've made over the past couple weeks based on recommendations off of here. Like refresh interval, RAM increases, etc.

I don't know how valid that will be. It hasn't received data (aside from my testing today) since Jan 30th.

weltenwort · February 9, 2017, 8:10pm

Ok, I can see how 30 shippers could produce quite some load on the cluster. Do they all share the same config? Is it possible that they are sending documents with conflicting types to the same index?

kopacko · February 9, 2017, 8:55pm

I know, during normal operations, I have seen errors with log not being UTF or whatever. Those were few and far between. Not thousands per hour.

All the shippers are sending to them index format: logstash-%{+YYYY.MM.dd}

I think some of the problems may have been hardware related and I am working with our server team on them.

But the number of shippers has been the same and will grow. Each of our locations gets one, does the parsing and then shipping to ES.

In the end, the bulk queue default config was the issue. I have made significant strides and tweaks based on feedfack from here. I am in a much better place now, even though I had to change the bulk queue capacity from 50. I scaled it back from 1000 to 500. I hope to scale it back more, now that I have added more CPUs.

kopacko · February 9, 2017, 9:21pm

So, based on this, I have stopped the ES service on DATANODE-05.

I stopped all the Logstash shippers. Cleared out all the logs. Reset the data base configuration on the cluster. Deleted all my old indexes. Changed the shard count from 6 to 4 (giving me one less than the number of active datanodes.

I have issued a service restart on all the Logstash shippers and I am now waiting for some results.

kopacko · February 9, 2017, 9:25pm

On a positive node, I've not had ANY rejected errors:

node_name       name size active queue rejected
DATANODE-01     bulk    8      0     0        0
DATANODE-02     bulk    8      1     0        0
DATANODE-03     bulk    8      1     0        0
DATANODE-04     bulk    8      2     0        0
DATANODE-06     bulk    8      1     0        0
KIBANANODE-01   bulk    0      0     0        0
MASTERNODE-01   bulk    0      0     0        0
MASTERNODE-02   bulk    0      0     0        0
MASTERNODE-03   bulk    0      0     0        0

My bulk queue config right now is 500. Is there any better way to see what the actual number is?

kopacko · February 9, 2017, 9:36pm

@weltenwort

So far, I have not seen any errors, since my restart. It appears it may have been a VM contention issue.

I am seeing errors like this:

<13>Feb  9 16:31:21.259538 LOGSTASH-01 [2017-02-09T16:31:20,609][WARN ][logstash.outputs.elasticsearch] Failed action. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2017.02.09", :_type=>"logs", :_routing=>nil},
###SYSLOG MESSAGE REMOVED####
], :RESPONSE=>{"INDEX"=>{"_INDEX"=>"LOGSTASH-2017.02.09", "_TYPE"=>"LOGS", "_ID"=>"AVOKYATBQ5O42TMJTX6N", "STATUS"=>400, "ERROR"=>{"TYPE"=>"MAPPER_PARSING_EXCEPTION", "REASON"=>"FAILED TO PARSE [VERSION]", "CAUSED_BY"=>{"TYPE"=>"NUMBER_FORMAT_EXCEPTION", "REASON"=>"FOR INPUT STRING

Now, I do not believe these messages to be new. I think they were hidden behind all the es_rejected messages I have been troubleshooting for the past few months.

kopacko · February 9, 2017, 9:41pm

I am also seeing some of these:

[2017-02-09T15:33:35,886][WARN ][logstash.codecs.line     ]
 Received an event that has a different character encoding than you configured.

{:text=>"
 <13>Feb  9 15:33:35.886077 x.x.x.x <181>Feb  9 16:33:35 ######## CISE_RADIUS_Accounting 0002603270 2 0 2017-02-09 16:33:35.861 -05:00 0048620316 3002 NOTICE Radius-Accounting: RADIUS Accounting watchdog update,
 ConfigVersionId=472,
 Device IP Address=x.x.x.x,
 RequestLatency=2,
 NetworkDeviceName=########,
 User-Name=########,
 NAS-IP-Address=x.x.x.x,
 NAS-Port=50022,
 Service-Type=Framed,
 Framed-IP-Address=x.x.x.x,
 Class=CACS:0000000000000CF8467ECD31:########/272246235/1309,
 Called-Station-ID=########,
 Calling-Station-ID=########,
 Acct-Status-Type=Interim-Update,
 Acct-Delay-Time=0,
 Acct-Input-Octets=2909701447,
 Acct-Output-Octets=4261835252,
 Acct-Session-Id=0000435D,
 Acct-Authentic=RADIUS,
 Acct-Session-Time=23991903,
 Acct-Input-Packets=1026492660,
 Acct-Output-Packets=1271093589,
 undefined-52=   \\xC6,
 undefined-53=  \\u0001\\u0005,
 NAS-Port-Type=Ethernet,
 NAS-Port-Id=GigabitEthernet0/22,
 undefined-151=86E2F888,
 cisco-av-pair=audit-session-id=########,
 cisco-av-pair=connect-progress=Auth Open,
", :expected_charset=>"UTF-8"}

kopacko · February 9, 2017, 9:48pm

Interestingly enough, I seems the error related to encoding, did NOT prevent it from getting stored into Elasticsearch. I can find the exact log message from the one I posted via Kibana.

When I look at logs from that host, it seems other messages are just fine. I am not sure how it is possible that some messages are encoded one way and some are encoded another way. Any suggestions on how to handle that kind of a situation?

kopacko · February 9, 2017, 9:50pm

In the last hour, there were 27 errors related to: Received an event that has a different character encoding than you configured. But, from the host that generated that error, I had 45,000 logs. So that is a minuscule amount of errors.

weltenwort · February 9, 2017, 11:37pm

Good to hear you are making progress. Let me try to answer the topics individually:

bulk queue size
You should be able to see the current bulk queue size if you add the queue_size column as in _cat/thread_pool?v&h=node_name,name,queue,queue_size. queue should be the current number of tasks in the queue and queue_size the maximum size of the queue. See the Elasticsearch Documentation for other available columns.

NUMBER_FORMAT_EXCEPTION in Logstash
This seems to be the same problem as described before where Logstash tries to send a document to Elasticsearch that contains a field version of type string even though the mapping defines it as a numeric type. Maybe you've run afoul of Elasticsearch's dynamic mapping feature, which tries to guess a field type for which no explicit type is defined in the mapping when it first sees a document containing that field. If the first version field contained a value that looked like a number, Elasticsearch determined the field type to be numeric. From now on it tries to parse all version fields in that index as a numeric value which seems to fail for some documents. In that case I would recommend to explicitly set that field type in the mapping to keyword in order to keep Elasticsearch from guessing incorrectly. You can find more details in the dynamic mapping in the Elasticsearch documentation

character encoding in Logstash
The default character encoding assumed by Logstash is UTF-8. This message indicates that the ingested line contains a byte sequence that can not be correctly parsed using that encoding. If your input has a different encoding than UTF-8 you can change it using the charset setting of the codec.

kopacko · February 10, 2017, 3:25am

Thank you sir. That does help.

In terms of the encoding errors, is there some way I can watch for messages that not recognized as UTF-8 and use something else? (On the fly.)

weltenwort · February 10, 2017, 10:19am

Automatic detection of the encoding is a hard problem which can only be "solved" heuristically. I don't think Logstash contains such functionality. The best way to deal with it would be to track down which sources emit non-UTF-8 encoded messages and either configure them to do so or change the charset setting in the corresponding input to match the source's encoding. I would watch out for Logstash warnings like the one you quoted above and look for escape sequences like

undefined-52=   \\xC6,
undefined-53=  \\u0001\\u0005,

to figure out the correct encoding.

kopacko · February 10, 2017, 3:01pm

My issue is, based on yesterday, there were 27 out of 45,000+ messages that were identified with incorrect encoding that came from that source.

kopacko · February 10, 2017, 7:14pm

QUICK UPDATE:

So far, no more Kibana errors, Elasticsearch errors, or Logstash errors. Quick numbers (just pulled):

root:~$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "CLUSTER",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 9,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 13,
  "active_shards" : 13,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

root:~$ curl -XGET 'http://localhost:9200/_cat/nodes?v&s=name&h=master,name,version,node.role,disk.avail,heap.max,ram.max,ram.current,cpu,uptime,jdk'
master name            version node.role disk.avail heap.max ram.max ram.current cpu uptime jdk
-      DATANODE-01     5.2.0   d              1.8tb   29.9gb  39.2gb      38.3gb  27    22h 1.8.0_121
-      DATANODE-02     5.2.0   d              1.8tb   29.9gb  39.2gb      37.8gb  23    22h 1.8.0_121
-      DATANODE-03     5.2.0   d              1.8tb   29.9gb  39.2gb      35.7gb   1    22h 1.8.0_121
-      DATANODE-04     5.2.0   d              1.8tb   29.9gb  39.2gb        39gb  11    22h 1.8.0_121
-      DATANODE-06     5.2.0   d              1.8tb   29.9gb  39.2gb        38gb  24    22h 1.8.0_121
-      KIBANANODE-01   5.2.0   -             39.5gb    7.9gb  15.6gb       9.9gb   0    22h 1.8.0_121
-      MASTERNODE-01   5.2.0   m             39.9gb    7.9gb  15.6gb       9.3gb   0    22h 1.8.0_121
-      MASTERNODE-02   5.2.0   m             39.9gb    7.9gb  15.6gb       9.6gb   0    22h 1.8.0_121
*      MASTERNODE-03   5.2.0   m             39.9gb    7.9gb  15.6gb       9.6gb   0    22h 1.8.0_121

root:~$ curl -XGET 'http://localhost:9200/_cat/nodes?v&s=name&h=master,name,indexing.index_total,indexing.index_current,indexing.index_failed,indexing.delete_total'
master name            indexing.index_total indexing.index_current indexing.index_failed indexing.delete_total
-      DATANODE-01               27906525                      0                     0                     0
-      DATANODE-02               35359641                      0                     0                     0
-      DATANODE-03                7743716                      0                     0                     0
-      DATANODE-04               35068850                      0                     0                     0
-      DATANODE-06               35371378                      0                     0                     0
-      KIBANANODE-01                    0                      0                     0                     0
-      MASTERNODE-01                    0                      0                     0                     0
-      MASTERNODE-02                    0                      0                     0                     0
*      MASTERNODE-03                    0                      0                     0                     0

root:~$ curl -XGET 'http://localhost:9200/_cat/thread_pool/bulk?v&s=node_name&h=node_name,name,size,active,queue,rejected'
node_name       name size active queue rejected
DATANODE-01     bulk    8      1     0        0
DATANODE-02     bulk    8      0     0        0
DATANODE-03     bulk    8      0     0        0
DATANODE-04     bulk    8      1     0        0
DATANODE-06     bulk    8      1     0        0
KIBANANODE-01   bulk    0      0     0        0
MASTERNODE-01   bulk    0      0     0        0
MASTERNODE-02   bulk    0      0     0        0
MASTERNODE-03   bulk    0      0     0        0

root:~$ curl -XGET 'http://localhost:9200/_cat/thread_pool/bulk?v&s=node_name&h=node_name,name,queue,queue_size'
node_name       name queue queue_size
DATANODE-01     bulk     0        500
DATANODE-02     bulk     0        500
DATANODE-03     bulk     0        500
DATANODE-04     bulk     0        500
DATANODE-06     bulk     0        500
KIBANANODE-01   bulk     0         50
MASTERNODE-01   bulk     0         50
MASTERNODE-02   bulk     0         50
MASTERNODE-03   bulk     0         50

I still have DATANODE-05 with the ES service stopped right now.

weltenwort · February 14, 2017, 11:34am

Awesome to hear! I'll go ahead to mark this as resolved then to ease the reading for other users. Feel free to open a new thread if you encounter any other problems.

system · March 14, 2017, 11:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Courier Fetch: 5 of 5 shards failed on Kibana Elasticsearch	2	3187	July 5, 2017
Courier Fetch : 5 of 15 shards failed. - Need Help Troubleshooting Kibana	5	878	April 15, 2018
Courier Fetch: 5 of 70 shards failed - Please help Elasticsearch	2	518	May 31, 2017
Courier Fetch error Kibana	7	2986	July 6, 2017
6.2.1: Courier Fetch: 5 of 160 shards failed Kibana	3	592	March 12, 2018

Courier Fetch Errors

Related topics