Date histogram agg crashes cluster

Hi running 2.3.0

This issue is 100% reproducible.

We have a 12 node cluster running with 20GB of RAM per node so 240GB total.

There is 2 indexes Index1 is about 35,000,000 records and Index2 about 6,000,000. Both indexes are "identical" except for the mapping difference noted below (We attempted a new mapping).

Index1 has a mapping of...

"myDate": {
        "format": "dateOptionalTime",
        "type": "date"
},

Index2 has a mapping of...

"myDate": {
        "type": "long"
 },

The type is the same on both indexes. The documents are inserted with myDate as yyyyMMdd (no time just the days).

Index1 has...
8,000,0000 documents for 20160101
6,000,0000 documents for 20160102
7,000,0000 documents for 20160103
8,000,0000 documents for 20160104

Index2 has
6,000,0000 documents for 20160407

When we run the below query the cluster crashes. We loose nodes...

GET index*/myType/_search
{
  "size" : 0,
  "aggregations" : {
    "Date" : {
      "date_histogram" : {
        "field" : "myDate",
        "interval" : "1d"
      },
      "aggregations" : {
        "Record Count" : {
          "value_count" : {
            "field" : "myId"
          }
        }
      }
    }
  },
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "myUser" : {
            "type" : "phrase",
            "query" : "user1"
          }
        }
      }
    }
  }
}

If we run the same agg individually on each index with out the wildcard. It seems to work. Though we have noticed that on Index1, the agg will return a couple thousand records for each "day". Index2 the agg returns a bucket with a doc count.

When we run the agg on a wildcard for both indexes that's where the problem occurs and we lose the cluster. From application stand point we are trying to rectify the issue by revising the mapping and the data inserted.

Just letting you know that the above combination wreaks havoc on Elasticsearch and hopefully something you can reproduce and fix to avoid this kind of crash.

Thanks

Crashes it how?

We lose nodes. You see them disconnect and cluster tries to rebalance itself. They come back eventually I think. But the first time I manually bounced the nodes. I'll reproduce it again. And see what happens...

Yeah but why are they lost - CPU, OOM, other?

I didn't see oom exception but it looks like memory issue.

It has something to do with the fact that one index is date optional and the other is long and it seems to try to load to much data in ram?

It's GC. The node the that received the query, KOPF reported the RAM usage to 100%

Below query ok.

Where myDate is dateOptionalTime
GET index1-201601/myType/_search
{
  "size": 0,
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "myUserId" : {
            "type" : "phrase",
            "query" : 100000
          }
        }
      }
    }
  },
  "aggs": {
    "bydate": {
      "date_histogram": {
        "field": "myDate",
        "interval": "day",
        "format" : "yyyyMMdd"
      }
    }
  }
}

Below query ok. Returns just a doc count. Should date histogram even work on long?

Where myDate is long
GET index2-201604/myType/_search
{
  "size": 0,
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "myUserId" : {
            "type" : "phrase",
            "query" : 100000
          }
        }
      }
    }
  },
  "aggs": {
    "bydate": {
      "date_histogram": {
        "field": "myDate",
        "interval": "day",
        "format" : "yyyyMMdd"
      }
    }
  }
}

Below query the culprit.

Query on wildcard.
GET index*/myType/_search
{
  "size": 0,
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "myUserId" : {
            "type" : "phrase",
            "query" : 100000
          }
        }
      }
    }
  },
  "aggs": {
    "bydate": {
      "date_histogram": {
        "field": "myDate",
        "interval": "day",
        "format" : "yyyyMMdd"
      }
    }
  }
}

Causes immediate GC thrashing on the node that received the query.

Logs here: http://pastebin.com/Dbczc8qK

The rest of the nodes seem ok. But the 1 node still hasn't recovered.

Ok, the node core dumped. The rest of the nodes are fine. So wherever that query is run it causes kaos to the node.

Maybe check with the https://www.elastic.co/guide/en/elasticsearch/reference/current/search-validate.html API and see what is happening?

Though "one index is date optional and the other is long" doesn't sound good even if it isn't related.

Validate doesn't support aggs.

Anyways I can reproduce this all the time. @warkolm is there anybody who can take a closer look at this?

@warkolm

Hello Mark what should I do? File a bug? Any other thoughts on this?