Elasticsearch Died on me

kurtiskurtis · May 31, 2018, 6:13pm

Last night during some document processing, a MASSIVE query was run that would have returned something like 30GB worth of data. As a result, the nodes started to die.

Each machine has 30GB of JVM and 60GB total of RAM.

Is there a way to prevent Elasticsearch from killing itself? If it is working on a query that is going to cause an OOM exception how can I get it to abandon the query instead of committing suicide?

dadoonet · May 31, 2018, 6:47pm

What is the version?
What was the query?

kurtiskurtis · May 31, 2018, 9:28pm

6.2

The query was effectively

GET index/doctype/_search
{
query on some field that returns 30GB worth of data
}

dadoonet · June 1, 2018, 2:54am

query on some field that returns 30GB worth of data

That's exactly what I'd like to know. Could you tell please what was it?

kurtiskurtis · June 1, 2018, 3:56pm

GET index/doc_type/_search
{
    "query": {
        "range" : {
            "datefield" : {
                "gte" : "2018-05-15T15:02:54.197980"
            }
        }
    }
}

dadoonet · June 1, 2018, 6:35pm

I don't think that kind of query can OOM a node. Are you totally sure it's caused by this?

kurtiskurtis · June 1, 2018, 6:45pm

I am not totally sure, I just saw that at almost the same time that query was being run the node died.

dadoonet · June 1, 2018, 7:07pm

It's probably something else which makes your node dying.
Do you monitor it with x-pack monitoring?

kurtiskurtis · June 1, 2018, 7:09pm

I do, the odd thing is that our cluster (its kind of new) has been running without any indication of failure or strain for 2 weeks, then all of the sudden it failed during that query and has been much less stable ever since.

We have added 2 more data nodes, and have had very little improvement. I thought it was the query that was causing the damage, but now I am just lost.

dadoonet · June 1, 2018, 7:32pm

Could you share some details about your cluster?
Like:

GET /_cat/health?v
GET /_cat/nodes?v
GET /_cat/fielddata?v
GET /_cat/indices?v

kurtiskurtis · June 1, 2018, 7:41pm

Our cluster is under very heavy load right now and I am reallocating shards off a bad node. but I will dump the info here anyway.

kurtiskurtis · June 1, 2018, 7:43pm

epoch      timestamp cluster              status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1527882097 19:41:37  feathrelasticcluster yellow          8         5   4469 2269    4    0       72             0                  -                 98.4%

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.31.3.48             1          54   0    0.02    0.03     0.00 m         -      es-master-2
172.31.2.16            66          99  82    4.15    4.63     5.02 di        -      es-data-1
172.31.3.158           68          99  61    4.91    5.57     6.45 di        -      es-data-2
172.31.1.165           10          54  11    0.47    0.40     0.31 m         *      es-master-3
172.31.2.37             2          54   2    0.20    0.15     0.06 m         -      es-master-1
172.31.3.242           13          70   8    0.54    0.93     0.94 di        -      es-data-4
172.31.1.59            46          99  18    1.92    1.29     0.87 di        -      es-data-5
172.31.1.99            52          99  62    4.72    4.92     4.96 di        -      es-data-3

\

kurtiskurtis · June 1, 2018, 7:44pm

we have ~500 indexes so pasting this in here is not gonna fit.

dadoonet · June 1, 2018, 8:00pm

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

kurtiskurtis · June 1, 2018, 8:06pm

I will certainly consider it, but it seems odd that the cluster was fully operational and operating well within reasonable limits, for several weeks (in pre production state) and 2 weeks in full production state. There hasnt been any changes, but it suddenly started failing.

kurtiskurtis · June 1, 2018, 8:15pm

Also one more question if I can snag you while youre still here

Is there a "rebalance shards" API I can trigger? I have looked through quite a few of the shard allocation api endpoints / settings, but it doesnt seem that there is a "Balance" type command. I added a few nodes, and they did not take up and proportional amount of shards. so 90% of the shards are sitting on the first 3 data nodes (that were the original 3 nodes)

dadoonet · June 1, 2018, 8:21pm

Sure? Did you create new indices, index new documents in the last 2 weeks?

dadoonet · June 1, 2018, 8:26pm

Well. I would not touch at the default settings. Unless you don't have enough disk space on a specific node or using specific allocation filtering, everything should be nicely balanced.

kurtiskurtis · June 1, 2018, 8:38pm

Well.. let me correct myself... We index a TON of documents per day. nearly 3m a day.

But we have not created any new indexes. So I guess our document count has changed, but 3million is a drop in the bucket compared to our overall dataset.

dadoonet · June 1, 2018, 8:59pm

What kind of data is that?
3m per days might be enough to add some pressure on your nodes. Specifically if you are using fielddata instead of doc values and are aggregating tons of different values.

Topic		Replies	Views
Elasticsearch Cleint Nodes OOM Killed by Gargantuan Query Elasticsearch	3	668	June 27, 2019
Client Nodes being oom-killed Elasticsearch	17	2325	June 28, 2018
A few general questions about Elasticsearch Elasticsearch	14	865	April 6, 2018
ElasticSearch crashes OS? Elasticsearch	13	434	July 6, 2017
Using the Bulk Indexing API, if my node crashes, my elasticsearch heap memory does not get freed Elasticsearch	6	800	July 6, 2017

Elasticsearch Died on me

Related topics