viniciof
(Vinicio Flores)
December 11, 2019, 9:20pm
1
Hi all,
Noticed today something very weird, our Logstash workers keep emitting data consistently to two of our ES clusters, rate as expected.
Disk storage in one of them gets consumed as expected also, but the other remains flat (and there are rejected events in thread pool in most of the data nodes)
Problematic cluster
Good cluster
how can I find out what's wrong ?
rgds,
What does your Logstash outputs look like?
viniciof
(Vinicio Flores)
December 12, 2019, 3:13pm
3
They send bulk requests, like the one as follows:
bulk_requests.json
{
"nodes": {
"a5voC7oXSf-78LfPlJ45Yg": {
"name": "arm-or-009_ingest",
"transport_address": "**.**.**.**:9600",
"host": "mytinynode1.server.com",
"ip": "**.**.**.**:9600",
"roles": [
"ingest"
This file has been truncated. show original
What does the configuration look like? Do you have a separate output plugin per cluster?
viniciof
(Vinicio Flores)
December 12, 2019, 9:23pm
5
I have separate output plugin for each, here the Logstash pipeline:
Problematic one
output {
elasticsearch {
hosts => [
"arm-or-006.myserver.com:9996",
"arm-or-007.myserver.com:9996",
"arm-or-008.myserver.com:9996",
"arm-or-010.myserver.com:9996"
]
ssl => true
cacert => "/app/ssl/cert.pem"
user => "myuser"
password => "mypass"
document_type => "arm"
document_id => "%{ibi_id}"
index => "%{ibi_target}-%{+YYYY-MM}"
doc_as_upsert => true
action => "update"
retry_max_interval => 5
retry_on_conflict => 5
flush_size => 10000
timeout => 1000000
}
}
Good one
output {
elasticsearch {
hosts => [
"arm-lc-001.myserver.com:9996",
"arm-lc-003.myserver.com:9996",
"arm-lc-004.myserver.com:9996",
"arm-lc-005.myserver.com:9996"
]
ssl => true
cacert => "/app/ssl/cert.pem"
ssl_certificate_verification => false
user => "myuser"
password => "mypass"
document_type => "arm"
document_id => "%{ibi_id}"
index => "%{ibi_target}-%{+YYYY-MM}"
doc_as_upsert => true
action => "update"
retry_max_interval => 5
retry_on_conflict => 5
flush_size => 10000
timeout => 1000000
}
}
viniciof
(Vinicio Flores)
December 12, 2019, 11:56pm
6
Also, this is what I see in hot threads:
gistfile1.txt
::: {arm-or-001_master}{vACvBzd5RiqvDDgWNvY6EQ}{TrPh5fRwRjKvxpYgV2h_oA}{plxcq8197.myserver.com}{x.x.x.x:9301}{ibi_site=pdx, box_type=hot}
Hot threads at 2019-12-12T23:52:10.447Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
0.0% (50.8micros out of 500ms) cpu usage by thread 'elasticsearch[arm-or-001_master][[timer]]'
10/10 snapshots sharing following 2 elements
java.lang.Thread.sleep(Native Method)
org.elasticsearch.threadpool.ThreadPool$CachedTimeThread.run(ThreadPool.java:541)
::: {arm-or-009_data}{f0aSRTkDQommICaE9nQRBg}{JMnusso4RfCBpFObfysMTA}{plxcq8205.myserver.com}{x.x.x.x:9300}{ibi_site=pdx, box_type=hot}
Hot threads at 2019-12-12T23:52:10.410Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
This file has been truncated. show original
@Christian_Dahlqvist , what does this mean ?
I have never run bulk updates so am not sure if errors here would cause the update to be retried from Logstash or simply dropped. You seem to have a lot of time spent on management. Do you have a very large number of shards in the cluster? Are you using dynamic mappings? Does the hardware profiles supporting the cluster s differ, especially with respect to the type of storage used? Is there anything in the Elasticsearch logs?
viniciof
(Vinicio Flores)
December 13, 2019, 5:10pm
8
Here is all sharding info in my cluster:
Here's the mapping info of the most problematic index we have in the cluster (I'm using dynamic mappings template for it)
We use SAN/LUN based storage (magnitude of TBs of space) and all servers have same specs. How can I find out if it's due to bad disk I/O?
Didn't find anything in ES logs though
viniciof
(Vinicio Flores)
December 13, 2019, 7:49pm
9
I see also my thread pools very packed:
@Christian_Dahlqvist , How can I find out the details of those specific threads taking all available slots in each queue ?
viniciof
(Vinicio Flores)
December 13, 2019, 7:55pm
10
And here my global cluster's settings:
{
"persistent": {
"cluster": {
"routing": {
"allocation": {
"awareness": {
"attributes": ""
}
}
}
},
"indices": {
"breaker": {
"fielddata": {
"limit": "60%"
},
"request": {
"limit": "30%"
}
}
}
},
"transient": {
"indices": {
"recovery": {
"max_bytes_per_sec": "256mb"
}
}
}
}
viniciof
(Vinicio Flores)
December 17, 2019, 12:20am
11
@Christian_Dahlqvist can you help ? Let me know if any more information is needed
viniciof
(Vinicio Flores)
December 17, 2019, 3:23pm
12
Hi @Christian_Dahlqvist ,
Here's my node stats. I notice one of my nodes (arm-or-009_data) is 99% memory utilization .
nodes stats
{
"_nodes": {
"total": 20,
"successful": 20,
"failed": 0
},
"cluster_name": "ibi.arm2.us",
"nodes": {
"a5voC7oXSf-78LfPlJ45Yg": {
"timestamp": 1576595678727,
This file has been truncated. show original
And these are the stats for the most problematic (slow indexing) index in the cluster, "daas-arm-prod-users-2019-12-new"
gistfile1.txt
{
"_shards": {
"total": 80,
"successful": 80,
"failed": 0
},
"_all": {
"primaries": {
"docs": {
"count": 6717777,
This file has been truncated. show original
Are there any error messages in the Elasticsearch logs? Can you try enabling the dead-letter queue to see if this captures any errors that would otherwise be ignored/dropped?
viniciof
(Vinicio Flores)
December 18, 2019, 10:42pm
14
I checked the logs and no errors seem to be there. There's only one that the cluster complains a lot:
[2019-12-18T14:39:44,682][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [arm-or-002_master] collector [cluster_stats] timed out when collecting data
and
[2019-12-18T14:29:24,586][ERROR][o.e.x.m.c.i.IndexStatsCollector] [arm-or-002_master] collector [index-stats] timed out when collecting data
Whenever this is logged, it causes a "blank" patch in the overview section of monitoring of the cluster in Kibana (like cluster is unresponsive during that time of exception)
system
(system)
Closed
January 15, 2020, 10:42pm
15
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.