Courier Fetch Errors

I recently upgraded my 6 data nodes from 4 CPUs to 8 CPUs each.

Now I am seeing these errors in Kibana that I've never seen before:

Shard Failures
The following shard failures ocurred:

Index: logstash-2017.02.08 Shard: 0 Reason: {"type":"transport_serialization_exception","reason":"Failed to deserialize response of type [org.elasticsearch.search.fetch.FetchSearchResult]","caused_by":{"type":"illegal_state_exception","reason":"unexpected byte [0x69]"}}

Any suggestions?

seems like an elastic search error .... could you try asking in the elastic search discuss group ?

I did. Many posts I've found, point to Kibana. I don't know which is at fault, nor do I entirely care at this point. It is just another issue, for another made up reason, that I continue to have problems.

All my API checks show the ES cluster healthy.

I've spent months running down issue after issue, upgrading, rebuilding, etc.

All I did shut down my hosts, change from 4 CPUs to 8 CPUs and now....I get this stupid a$$ error randomly.

I was able to snag this:

Discover: Unable to parse/serialize body

 Less Info
OK
Error: Unable to parse/serialize body
    at respond (http://KIBANA/bundles/kibana.bundle.js?v=14695:14:6482)
    at checkRespForFailure (http://KIBANA/bundles/kibana.bundle.js?v=14695:14:6156)
    at http://KIBANA/bundles/kibana.bundle.js?v=14695:1:24479
    at processQueue (http://KIBANA/bundles/commons.bundle.js?v=14695:38:23621)
    at http://KIBANA/bundles/commons.bundle.js?v=14695:38:23888
    at Scope.$eval (http://KIBANA/bundles/commons.bundle.js?v=14695:39:4619)
    at Scope.$digest (http://KIBANA/bundles/commons.bundle.js?v=14695:39:2359)
    at Scope.$apply (http://KIBANA/bundles/commons.bundle.js?v=14695:39:5037)
    at done (http://KIBANA/bundles/commons.bundle.js?v=14695:37:25027)
    at completeRequest (http://KIBANA/bundles/commons.bundle.js?v=14695:37:28702)

It's almost like I need to reinstall.

If that's the case, that's fine, I just need to know.

Hi @kopacko,

sorry to hear that you're encountering so many problems. The two messages you posted both look like Elasticsearch is not responding correctly. You said your health checks indicated that the cluster is healthy. Would you mind posting the responses of calling http[s]://${YOURCLUSTER}/_cat/nodes?v and http[s]://${YOURCLUSTER}/_cat/indices?v?

Sure, give me a moment.

Here you go:

MASTERNODE-01:~$ curl -XGET 'http://localhost:9200/_cat/nodes?v'
ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
x.x.x.x           28          97  24    2.90    2.56     2.40 di        -      DATANODE-04
x.x.x.x           50          96  47    2.78    2.77     2.65 di        -      DATANODE-06
x.x.x.x            2          75   0    0.00    0.00     0.00 m         -      MASTERNODE-02
x.x.x.x            2          75   0    0.00    0.01     0.01 m         *      MASTERNODE-03
x.x.x.x           19          99  23    3.11    2.99     2.87 di        -      DATANODE-01
x.x.x.x            6          97  15    2.12    2.33     2.48 di        -      DATANODE-05
x.x.x.x           43          90   0    0.00    0.00     0.00 -         -      KIBANANODE-01
x.x.x.x            7          98  38    2.45    2.60     2.84 di        -      DATANODE-02
x.x.x.x           52          96  22    2.38    2.15     2.07 di        -      DATANODE-03
x.x.x.x            2          75   0    0.00    0.00     0.00 m         -      MASTERNODE-01

MASTERNODE-01:~$ curl -XGET 'http://localhost:9200/_cat/indices?v&s=index'
health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .kibana             FeTfCUGPQXalqH2_ukmEOA   1   0        152            3    393.8kb        393.8kb
green  open   logstash-2017.01.30 -aIieugMQ2e2C1w3W4R_fw   6   0       3716            0      4.1mb          4.1mb
green  open   logstash-2017.01.31 YaJbI3DbTHCyazLZ2mXsVw   6   0    2782506            0      2.1gb          2.1gb
green  open   logstash-2017.02.01 UXJXqMxlQeWPhjdlWImVhA   6   0  144779115            0    240.5gb        240.5gb
green  open   logstash-2017.02.02 MGbreY1wSHGW59nU2aHQqw   6   0  140247069            0    239.6gb        239.6gb
green  open   logstash-2017.02.03 MyfveMTcQXCOcevpYL-REw   6   0  158329442            0    293.3gb        293.3gb
green  open   logstash-2017.02.04 T0dv7nM8Ssecrr0W9RbKHw   6   0  112505237            0    202.7gb        202.7gb
green  open   logstash-2017.02.05 5f6mnrjnQmOqA2BmBrjraw   6   0  106129925            0    189.3gb        189.3gb
green  open   logstash-2017.02.06 uIuBbhYxSbO6OkTQfrBvQQ   6   0  160544435            0    291.5gb        291.5gb
green  open   logstash-2017.02.07 euU_74VhTHizqCV8qN94og   6   0  138924007            5    260.7gb        260.7gb
green  open   logstash-2017.02.08 o4g_UvovQ62c1oc5fTH1ew   6   0   96698802            1    187.6gb        187.6gb

Here are a few others:

MASTERNODE-01:~$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "CLUSTER",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 10,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 61,
  "active_shards" : 61,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

MASTERNODE-01:~$ curl -XGET 'http://localhost:9200/_cat/nodes?v&s=name&h=master,name,version,node.role,disk.avail,heap.max,ram.max,ram.current,cpu,uptime,jdk'
master name            version node.role disk.avail heap.max ram.max ram.current cpu uptime jdk
-      DATANODE-01     5.2.0   di             1.5tb   29.9gb  39.2gb      38.4gb  20  12.4h 1.8.0_121
-      DATANODE-02     5.2.0   di             1.5tb   29.9gb  39.2gb      38.1gb  20  12.4h 1.8.0_121
-      DATANODE-03     5.2.0   di             1.5tb   29.9gb  39.2gb      37.9gb  11  12.4h 1.8.0_121
-      DATANODE-04     5.2.0   di             1.5tb   29.9gb  39.2gb      38.1gb  25  12.4h 1.8.0_121
-      DATANODE-05     5.2.0   di             1.5tb   29.9gb  39.2gb      37.9gb  22  12.4h 1.8.0_121
-      DATANODE-06     5.2.0   di             1.5tb   29.9gb  39.2gb      38.5gb  12  12.4h 1.8.0_121
-      KIBANANODE-01   5.2.0   -             38.6gb   11.9gb  15.6gb      14.1gb   0  12.4h 1.8.0_121
-      MASTERNODE-01   5.2.0   m             39.3gb   15.9gb  23.5gb      17.6gb   0  12.4h 1.8.0_121
-      MASTERNODE-02   5.2.0   m             39.3gb   15.9gb  23.5gb      17.6gb   0  12.4h 1.8.0_121
*      MASTERNODE-03   5.2.0   m             39.3gb   15.9gb  23.5gb      17.6gb   0  12.4h 1.8.0_121

MASTERNODE-01:~$ curl -XGET 'http://localhost:9200/_cat/nodes?v&s=name&h=master,name,indexing.index_total,indexing.index_current,indexing.index_failed,indexing.delete_total'
master name            indexing.index_total indexing.index_current indexing.index_failed indexing.delete_total
-      DATANODE-01               12310817                      0                     0                     0
-      DATANODE-02               12138538                      0                     8                     0
-      DATANODE-03               12316266                      1                     0                     0
-      DATANODE-04               12318182                      1                     0                     0
-      DATANODE-05               12313275                      0                     0                     0
-      DATANODE-06               12134054                      2                     8                     0
-      KIBANANODE-01                    0                      0                     0                     0
-      MASTERNODE-01                    0                      0                     0                     0
-      MASTERNODE-02                    0                      0                     0                     0
*      MASTERNODE-03                    0                      0                     0                     0

MASTERNODE-01:~$ curl -XGET 'http://localhost:9200/_cat/nodes?v&s=name&h=master,name,merges.total,merges.current,merges.total_size,merges.current_size'
master name            merges.total merges.current merges.total_size merges.current_size
-      DATANODE-01           3335              1            82.3gb             957.5mb
-      DATANODE-02           3198              0            82.1gb                  0b
-      DATANODE-03           3040              0            82.8gb                  0b
-      DATANODE-04           3106              2            81.3gb                 1gb
-      DATANODE-05           3107              0            82.9gb                  0b
-      DATANODE-06           3220              1              81gb             968.5mb
-      KIBANANODE-01            0              0                0b                  0b
-      MASTERNODE-01            0              0                0b                  0b
-      MASTERNODE-02            0              0                0b                  0b
*      MASTERNODE-03            0              0                0b                  0b

MASTERNODE-01:~$ curl -XGET 'http://localhost:9200/_cat/nodes?v&s=name&h=master,name,search.fetch_total,search.fetch_current,search.query_total,search.query_current'
master name            search.fetch_total search.fetch_current search.query_total search.query_current
-      DATANODE-01                   63                    0              14149                    0
-      DATANODE-02                   63                    0              14149                    0
-      DATANODE-03                   60                    0              14149                    0
-      DATANODE-04                   60                    0              14149                    0
-      DATANODE-05                   61                    0              14118                    0
-      DATANODE-06                17962                    0              32048                    0
-      KIBANANODE-01                  0                    0                  0                    0
-      MASTERNODE-01                  0                    0                  0                    0
-      MASTERNODE-02                  0                    0                  0                    0
*      MASTERNODE-03                  0                    0                  0                    0

MASTERNODE-01:~$ curl -XGET 'http://localhost:9200/_cat/nodes?v&s=name&h=master,name,segments.count,segments.memory,suggest.total,suggest.current'
master name            segments.count segments.memory suggest.total suggest.current
-      DATANODE-01              318         413.3mb             0               0
-      DATANODE-02              312         406.6mb             0               0
-      DATANODE-03              318         413.8mb             0               0
-      DATANODE-04              302         410.3mb             0               0
-      DATANODE-05              296         406.4mb             0               0
-      DATANODE-06              299         407.3mb             0               0
-      KIBANANODE-01              0              0b             0               0
-      MASTERNODE-01              0              0b             0               0
-      MASTERNODE-02              0              0b             0               0
*      MASTERNODE-03              0              0b             0               0

Oh and it is Kibana version 5.2.0.

KIBANANODE-01:~$ sudo /usr/share/kibana/bin/kibana --version
5.2.0

The plot thickens:

Shard Failures
The following shard failures ocurred:

Index: logstash-2017.02.08 Shard: 0 Reason: {"type":"node_not_connected_exception","reason":"[DATANODE-05][x.x.x.x:9300] Node not connected"}

Soon as I saw this, I check the APIs and the node question was green, all shards accounted for, etc.

EDIT UPDATE: I can repeat this error anytime I search for something in Kibana that does not exist.

I was able to pull this from a Logstash shipper's logs:

[2017-02-08T13:31:08,363][WARN ][logstash.outputs.elasticsearch] Failed action.

 {:status=>500, :action=>["index", {:_id=>nil, :_index=>"logstash-2017.02.08", :_type=>"logs", :_routing=>nil},

 ####<SYSLOG MESSAGE REMOVED>####],

 :response=>{"index"=>{"_index"=>"logstash-2017.02.08",
   "_type"=>"logs",
   "_id"=>"AVofNUFqw60d2OLj9AA8",
   "status"=>500,
   "error"=>{"type"=>"illegal_state_exception",
   "reason"=>"Message not fully read (request) for requestId [8882329],
   action [indices:data/write/bulk[s]],
   available [14377];

 resetting"}}}}

@ppisljar

New topic created for Logstash related errors that I think are tied to the Kibana errors.

All those errors indicate that something is wrong with the communication within the Elasticsearch cluster. The log output from the Elasticsearch nodes (e.g. DATANODE-05) does not show anything that could be correlated to this? Where are you hosting your cluster? If it is AWS, there are known issues caused by the networking setup of the AWS Ubuntu images as described in http://logz.io/blog/elasticsearch-cluster-disconnects/.

On site servers.

Well, there is more to the story.

Last night, I completely rebuilt these 10 servers. I deleted their disks and started from scratch.

I soon as I started the Logstash service on one of my shipping servers, I was immediately hit with these errors. On the ES side, I am getting errors on all 6 data nodes who are receiving the logs.

UPDATE: I had turned of shipping to one of my data center clusters while I finalized my setup with the 6 datanode cluster. In that, the CPU upgrade is the apparent root cause of the issues now. What I did, this morning, was turned off shipping to the 6 datanode cluster and pointed it back at my older, 4 datanode cluster that I had not upgraded the CPUs from 4 to 8 on. So far, as it stands, I have not received a SINGLE error.

What are the errors you are seeing in the Elasticsearch output?

Here you go:

[2017-02-08T23:45:05,054][DEBUG][o.e.a.b.TransportShardBulkAction] [KNOX-LOGDB-01] [logstash-2017.02.08][5] failed to execute bulk item (index) index {

[logstash-2017.02.08][wineventlog][AVohMGiU11AUrPoYaON-],

<#### LOG MESSAGE REMOVED ####>

}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [process_id]
        at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:298) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:438) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:564) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:384) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:361) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:275) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:533) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:510) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:196) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:201) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:348) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.index(TransportShardBulkAction.java:155) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.handleItem(TransportShardBulkAction.java:134) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.onPrimaryShard(TransportShardBulkAction.java:120) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.onPrimaryShard(TransportShardBulkAction.java:73) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportWriteAction.shardOperationOnPrimary(TransportWriteAction.java:76) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportWriteAction.shardOperationOnPrimary(TransportWriteAction.java:49) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:914) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:884) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:327) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:262) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:864) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:861) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:142) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1652) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:873) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.access$400(TransportReplicationAction.java:92) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:279) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:258) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:250) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:610) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.lang.NumberFormatException: For input string: "0x0000000000000c08"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_121]
        at java.lang.Long.parseLong(Long.java:589) ~[?:1.8.0_121]
        at java.lang.Long.parseLong(Long.java:631) ~[?:1.8.0_121]
        ... 40 more

So what this exception indicates is that the field process_id is set to a long type in the mapping, but the document being indexed contains the string "0x0000000000000c08" in that field. So either the mapping seems to be incorrect or Logstash seems to ship documents with incorrect field types. I am not sure how that could result in the cluster returning such errors as shown before, but maybe its a result of getting swamped with such indexing errors. Is there any error regarding the cluster memberships in the logs?