Elasticsearch Died on me

Last night during some document processing, a MASSIVE query was run that would have returned something like 30GB worth of data. As a result, the nodes started to die.

Each machine has 30GB of JVM and 60GB total of RAM.

Is there a way to prevent Elasticsearch from killing itself? If it is working on a query that is going to cause an OOM exception how can I get it to abandon the query instead of committing suicide?

What is the version?
What was the query?

6.2

The query was effectively

GET index/doctype/_search
{
query on some field that returns 30GB worth of data
}

query on some field that returns 30GB worth of data

That's exactly what I'd like to know. Could you tell please what was it?

GET index/doc_type/_search
{
    "query": {
        "range" : {
            "datefield" : {
                "gte" : "2018-05-15T15:02:54.197980"
            }
        }
    }
}

I don't think that kind of query can OOM a node. Are you totally sure it's caused by this?

I am not totally sure, I just saw that at almost the same time that query was being run the node died.

It's probably something else which makes your node dying.
Do you monitor it with x-pack monitoring?

I do, the odd thing is that our cluster (its kind of new) has been running without any indication of failure or strain for 2 weeks, then all of the sudden it failed during that query and has been much less stable ever since.

We have added 2 more data nodes, and have had very little improvement. I thought it was the query that was causing the damage, but now I am just lost.

Could you share some details about your cluster?
Like:

GET /_cat/health?v
GET /_cat/nodes?v
GET /_cat/fielddata?v
GET /_cat/indices?v

Our cluster is under very heavy load right now and I am reallocating shards off a bad node. but I will dump the info here anyway.

epoch      timestamp cluster              status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1527882097 19:41:37  feathrelasticcluster yellow          8         5   4469 2269    4    0       72             0                  -                 98.4%

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.31.3.48             1          54   0    0.02    0.03     0.00 m         -      es-master-2
172.31.2.16            66          99  82    4.15    4.63     5.02 di        -      es-data-1
172.31.3.158           68          99  61    4.91    5.57     6.45 di        -      es-data-2
172.31.1.165           10          54  11    0.47    0.40     0.31 m         *      es-master-3
172.31.2.37             2          54   2    0.20    0.15     0.06 m         -      es-master-1
172.31.3.242           13          70   8    0.54    0.93     0.94 di        -      es-data-4
172.31.1.59            46          99  18    1.92    1.29     0.87 di        -      es-data-5
172.31.1.99            52          99  62    4.72    4.92     4.96 di        -      es-data-3

\

we have ~500 indexes so pasting this in here is not gonna fit.

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

1 Like

I will certainly consider it, but it seems odd that the cluster was fully operational and operating well within reasonable limits, for several weeks (in pre production state) and 2 weeks in full production state. There hasnt been any changes, but it suddenly started failing.

Also one more question if I can snag you while youre still here :smiley:

Is there a "rebalance shards" API I can trigger? I have looked through quite a few of the shard allocation api endpoints / settings, but it doesnt seem that there is a "Balance" type command. I added a few nodes, and they did not take up and proportional amount of shards. so 90% of the shards are sitting on the first 3 data nodes (that were the original 3 nodes)

Sure? Did you create new indices, index new documents in the last 2 weeks?

Well. I would not touch at the default settings. Unless you don't have enough disk space on a specific node or using specific allocation filtering, everything should be nicely balanced.

Well.. let me correct myself... We index a TON of documents per day. nearly 3m a day.

But we have not created any new indexes. So I guess our document count has changed, but 3million is a drop in the bucket compared to our overall dataset.

What kind of data is that?
3m per days might be enough to add some pressure on your nodes. Specifically if you are using fielddata instead of doc values and are aggregating tons of different values.