Hi,guys, thanks for your reading my poor english.
I got a situation that elasticsearch cause my linux kernel crash.However, what information i got from logs(linux messsage log) is Garbled. What's worse i found nothing from my elasticsearch's logs (info level).The crash will happen at least once time each day.
linux message log:
(plenty of that info ,i ask for red hed help, they just say io is too heavey. And the cloud serivices say their software is ok.)
[12655.166750] sd 14:65535:11:0: [sdi] CDB: Write(10) 2a 00 40 87 42 d8 00 01 70 00
[12655.188749] SD100EP: [ERR][epfront_scmd_printk][1096]: scsi_cmnd retrying: serial_number[4192555] retries[1], allowed[5]
[12655.188756] sd 14:65535:11:0: [sdi] CDB: Write(10) 2a 00 40 87 45 78 00 00 50 00
[12655.211419] SD100EP: [ERR][epfront_scmd_printk][1096]: scsi_cmnd retrying: serial_number[4192614] retries[1], allowed[5]
[12655.211424] sd 14:65535:11:0: [sdi] CDB: Write(10) 2a 00 40 87 47 38 00 00 78 00
[12655.224413] SD100EP: [ERR][epfront_scmd_printk][1096]: scsi_cmnd retrying: serial_number[4192637] retries[1], allowed[5]
[12655.224418] sd 14:65535:11:0: [sdi] CDB: Write(10) 2a 00 40 87 47 a8 00 02 70 00
Here is my linux env:
Linux version 3.10.0-514.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11)
12655.400653] SD100EP: [ERR][epfront_io_send][2198]: alloc_iod failed, no memory
[12655.402837] SD100EP: [ERR][epfront_io_send][2198]: alloc_iod failed, no memory
after that ,linux kernel dead.
jvm version:
1.8.0_91-b14
30 g for each instance,each computer run four instance.
hardware:
1556core 256G ram 22t ssd 2*6t sas
logstash conf:
http.port: 9600-9700
pipeline.workers: 80
pipeline.batch.size: 1500
#pipeline.batch.delay: 100
config.reload.automatic: true
config.reload.interval: 3s
elasticsearch conf:
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12
node.max_local_storage_nodes: 10
thread_pool.get.queue_size: 10000
thread_pool.write.queue_size: 10000
thread_pool.analyze.queue_size: 1000
thread_pool.search.queue_size: 10000
thread_pool.listener.queue_size: 10000
bootstrap.system_call_filter: false
node.attr.disk_type: ssd
node.attr.zone: data-3-90
cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.awareness.force.zone.values: data-2-52,data-3-90,data-3-222,data-3-19,data-2-126,data-2-89,data-2-153,data-2-211,data-2-4,data-2-146,data-3-220,data-2-38
#transport.compress: true
I check my jvm gc suck like is ok. But es takes my memory nerlly 99% of my service(es jvm+buffer). The crash will happen at least once time each day.