Hardware crashing because of ELASTICSEACH

Hi Folks,

My hardware where elasticsearch is the only process is crashing. There is high possibility that elasticsearch process is the one causing it as it the sole process running. NEED HELP IN FINDING WHAT PART OF ELASTICSEARCH IS CAUSING THE ISSUE.

When the hardware crashes, I have to manually go to lab and press reboot button. Once the hardware is up, hardware level logs like "messages" has no information but information after reboot.

ELASTICSEARCH AND MY APPLICATION:
I have incoming 1000 syslogs per second. This is processed by one of my application and sent to Elasticsearch for indexing. Then my another application sends elasticsearch query via http connection to fetch the information needed.

Below are the indices, shards count, few of their details, Elastic search details, my hardware disk usage, RAM(1 minute before cronjob itself went down)the exceptions that are continuously been thrown in /var/log/elasticsearch/log-collector.log

Also, one interesting thing is elasticsearch process crashed on 20th Dec 12:30. This is the last process that went down after which hardware crashed. I was able to figure it out with the logs. But even the cronjob process died one hour back. So elasticsearch process was the last process that was running and finally hardware crashed.

Elasticsearch related log:
2018-12-20 12:13:14,143 [WARN ]Log Receive Rate is 2698 and Sent to ES 2661 and Forward to 0

Last updated time in the cronjob script output:
Taking the server performance for time Thu Dec 20 11:13:01 UTC 2018

ELASTICSEARCH DETAILS:
[root@LogCollector ~]# curl -s --insecure --tlsv1.2 'https://127.0.0.1:9200/_cluster/stats'
{"_nodes":{"total":1,"successful":1,"failed":0},
"cluster_name":"log-collector","timestamp":1546939759196,"status":"green",

"indices":{"count":729,"shards":{"total":1457,"primaries":1457,"replication":0.0,"index":{"shards":{"min":1,"max":2,"avg":1.9986282578875172},"primaries":{"min":1,"max":2,"avg":1.9986282578875172},"replication":{"min":0.0,"max":0.0,"avg":0.0}}},"docs":{"count":6213465976,"deleted":0},"store":{"size_in_bytes":1219155608349,"throttle_time_in_millis":0},"fielddata":{"memory_size_in_bytes":3537760,"evictions":0},"query_cache":{"memory_size_in_bytes":1356991482,"total_count":16841806,"hit_count":16294115,"miss_count":547691,"cache_size":53971,"cache_count":60458,"evictions":6487},"completion":{"size_in_bytes":0},"segments":{"count":27841,"memory_in_bytes":5689037001,"terms_memory_in_bytes":4139017064,"stored_fields_memory_in_bytes":1202970960,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":640,"points_memory_in_bytes":336146733,"doc_values_memory_in_bytes":10901604,"index_writer_memory_in_bytes":11392772,"version_map_memory_in_bytes":1564290,"fixed_bit_set_memory_in_bytes":0,"max_unsafe_auto_id_timestamp":1546938000170,"file_sizes":{}}},

"nodes":{"count":{"total":1,"data":1,"coordinating_only":0,"master":1,"ingest":1},"versions":["5.5.1"],"os":{"available_processors":4,"allocated_processors":4,"names":[{"name":"Linux","count":1}],"mem":{"total_in_bytes":33007747072,"free_in_bytes":630726656,"used_in_bytes":32377020416,"free_percent":2,"used_percent":98}},"process":{"cpu":{"percent":19},"open_file_descriptors":{"min":4058,"max":4058,"avg":4058}},"jvm":{"max_uptime_in_millis":1283022363,"versions":[{"version":"1.8.0-jdk8u132-b00","vm_name":"OpenJDK 64-Bit Server VM","vm_version":"25.71-b20161115","vm_vendor":"Oracle Corporation","count":1}],"mem":{"heap_used_in_bytes":11916407512,"heap_max_in_bytes":16071262208},"threads":60},"fs":{"total_in_bytes":2615087370240,"free_in_bytes":1394096578560,"available_in_bytes":1394096578560,"spins":"true"},"plugins":[{"name":"search-guard-5","version":"5.5.1-15","description":"Provide access control related features for Elasticsearch 5","classname":"com.floragunn.searchguard.SearchGuardPlugin","has_native_controller":false}],"network_types":{"transport_types":{"com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport":1},"http_types":{"com.floragunn.searchguard.http.SearchGuardHttpServerTransport":1}}}}

FEW INDICES:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open dummy12.13-23 hEbfgMgnTjOCyNcI6CU_ng 2 0 10446032 0 1.9gb 1.9gb
green open dummy12.15-05 x9Qfu1VZSI2Q2OD1GJncUg 2 0 10610653 0 1.9gb 1.9gb

INDICIES COUNT:

curl -s --insecure --tlsv1.2 'https://localhost:9200/_cat/indices?v' | wc -l
730

FEW SHARDS:
dummy11.30-18 0 p STARTED 719487 134.3mb 172.30.157.10 vjtpKa1
dummy12.07-19 1 p STARTED 5339871 970.9mb 172.30.157.10 vjtpKa1
d
SHARDS COUNT:

curl -s --insecure --tlsv1.2 "https://localhost:9200/_cat/shards" | wc -l
1457

DISK USAGE
[root@LogCollector ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/dummy_vg-root
50G 3.3G 44G 8% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/sda1 485M 33M 428M 8% /boot
/dev/mapper/dummy_dummy-dummy
300G 1.3G 299G 1% /var
/dev/mapper/dummy_vg-dummy
2.4T 1.2T 1.3T 47% /var/lib/elasticsearch

RAM USAGE:
total used free shared buffers cached
Mem: 31478 30546 931 0 4 12373
-/+ buffers/cache: 18167 13310
Swap: 4095 254 3841

EXCEPTIONS in log-collector.log under elasticsearch
[2019-01-08T10:15:03,943][DEBUG][o.e.a.s.TransportSearchAction] [vjtpKa1] [junoslogs-2019.01.07-22][1], node[vjtpKa1nQ3ecF2cOayhctA], [P], s[STARTED], a[id=XE_QsNMZRFCwBl-QSGtgJg]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=, indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=, routing='null', preference='null', requestCache=false, scroll=null, source={
..
..
..
org.elasticsearch.transport.RemoteTransportException: [vjtpKa1][172.30.157.10:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService$7@6acd74b5 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@42a67388[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 3350278]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50) ~[elasticsearch-5.5.1.jar:5.5.1]

Thank you for reading!!

Hi All,

here is the elastic search version:
[root@LogCollector ~]# curl -s --insecure --tlsv1.2 'https://localhost:9200'
{
"name" : "vjtpKa1",
"cluster_name" : "log-collector",
"cluster_uuid" : "-nMvD4fVQMizwsZjYUEqLA",
"version" : {
"number" : "5.5.1",
"build_hash" : "19c13d0",
"build_date" : "2017-07-18T20:44:24.823Z",
"build_snapshot" : false,
"lucene_version" : "6.6.0"
},

Hi All,

here is the java version:
java -version
openjdk version "1.8.0-jdk8u132-b00"
OpenJDK Runtime Environment (build 1.8.0-jdk8u132-b00-20161115)
OpenJDK 64-Bit Server VM (build 25.71-b20161115, mixed mode)

I am going to guess that you effectively ran out of memory and started thrashing because you have swap enabled. Elasticsearch recommends disabling swap.

1 Like

Thank you for your reply. I will disable the swapping memory.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.