Elasticsearch 7.8 worse heap management

Hello @Ignacio_Vera ,

To give you some context and do some summary I will explain deeply our situation.

This is our current cluster overview:

After updating from 7.6.2 to 7.8.0 we noticed a huge performance degradation in elasticsearch, even noticeable by the users in kibana.

For us were easy to measure it, we only need to take a look at the heaviest process that we do with elasticsearch, it is a nightly data extraction done by our Data Analytics team.

Before the update this process was completed in 3h 51min:

After the update it takes around 6h 5m:

Based on the metrics, our first suspicions were worse heap management because we see a great difference in max usage and usage variation.

@Xeraa pointed to us that this elasticsearch version changes the bundled JDK from 13 to 14 and we are using an old jvm.options configuration. After applying the new configuration we recovered the max heap usage saw before but not the usage variation.

And here we are, the nightly data extraction process is a process that we are executing exactly the same way every night for one year. We didn't notice a degradation in 6.8.0, neither in 7.6.2, our issue became with the update to 7.8.0.

The extraction usually gets around 58.325.154 hits and it's done with queries like this:

GET /varys/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "ts": {
              "gte": "2020-07-12T00:00:00",
              "lt": "2020-07-12T01:00:00"
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "type": "pageview"
          }
        },
        {
          "match": {
            "type": "user_event"
          }
        },
        {
          "match": {
            "type": "session_info"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

I've done a profile from kibana because the original process doesn't allow us to configure it, here it's limited to 10000 hits, the original process isn't. The result, removing the hits is:

EDIT: I reduced it because this message only allows 13.000 characters.

{
  "took" : 310,
  "timed_out" : false,
  "_shards" : {
    "total" : 120,
    "successful" : 120,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 4.8003182,
    "hits" : [...]
  },
  "profile" : {
    "shards" : [
      {
        "id" : "[-jHLObIoRYyFZTwhrBLNWA][varys-000446][10]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "(+ts:[1594512000000 TO 1594547999999] type:pageview type:user_event type:session_info)~1",
                "time_in_nanos" : 14264690,
                "breakdown" : {
                  "set_min_competitive_score_count" : 0,
                  "match_count" : 12,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 0,
                  "next_doc" : 431132,
                  "match" : 6065,
                  "next_doc_count" : 12,
                  "score_count" : 12,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 12220986,
                  "advance_count" : 34,
                  "score" : 12446,
                  "build_scorer_count" : 68,
                  "create_weight" : 247447,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 1346475
                },
                "children" : [
                  {
                    "type" : "IndexOrDocValuesQuery",
                    "description" : "ts:[1594512000000 TO 1594547999999]",
                    "time_in_nanos" : 794246,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 0,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 3704,
                      "match" : 0,
                      "next_doc_count" : 12,
                      "score_count" : 12,
                      "compute_max_score_count" : 0,
                      "compute_max_score" : 0,
                      "advance" : 5087,
                      "advance_count" : 56,
                      "score" : 1088,
                      "build_scorer_count" : 102,
                      "create_weight" : 433,
                      "shallow_advance" : 0,
                      "create_weight_count" : 1,
                      "build_scorer" : 783751
                    }
                  },
                  {
                    "type" : "TermQuery",
                    "description" : "type:pageview",
                    "time_in_nanos" : 372810,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 136,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 12,
                      "compute_max_score_count" : 136,
                      "compute_max_score" : 15084,
                      "advance" : 111806,
                      "advance_count" : 68,
                      "score" : 1353,
                      "build_scorer_count" : 102,
                      "create_weight" : 111162,
                      "shallow_advance" : 8404,
                      "create_weight_count" : 1,
                      "build_scorer" : 124546
                    }
                  },
                  {
                    "type" : "TermQuery",
                    "description" : "type:user_event",
                    "time_in_nanos" : 10098149,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 126,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 126,
                      "compute_max_score" : 11107,
                      "advance" : 9958321,
                      "advance_count" : 66,
                      "score" : 0,
                      "build_scorer_count" : 98,
                      "create_weight" : 60922,
                      "shallow_advance" : 6379,
                      "create_weight_count" : 1,
                      "build_scorer" : 61003
                    }
                  },
                  {
                    "type" : "TermQuery",
                    "description" : "type:session_info",
                    "time_in_nanos" : 2580168,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 132,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 0,
                      "compute_max_score_count" : 132,
                      "compute_max_score" : 12018,
                      "advance" : 2441590,
                      "advance_count" : 68,
                      "score" : 0,
                      "build_scorer_count" : 102,
                      "create_weight" : 55690,
                      "shallow_advance" : 5968,
                      "create_weight_count" : 1,
                      "build_scorer" : 64467
                    }
                  }
                ]
              }
            ],
            "rewrite_time" : 9313,
            "collector" : [
              {
                "name" : "SimpleTopScoreDocCollector",
                "reason" : "search_top_hits",
                "time_in_nanos" : 33375
              }
            ]
          }
        ],
        "aggregations" : [ ]
      }
    ]
  }
}

As you said, I've also requested the /_nodes/hot_threads while the process is running. I've got the next response (with erased IPs):

::: {fesn18}{PFBPEqxWQt6B8w7Tgocb5w}{G-dN6qhcTWeZLzgG2ZA_gw}{X.X.X.X}{X.X.X.X:9300}{dlrt}{ml.machine_memory=67456950272, ml.max_open_jobs=20, xpack.installed=true, box_type=defaultnodetype, transform.node=true}
   Hot threads at 2020-07-16T12:57:23.861Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {fesn61}{KOrNRROwQhqi9HwhaYUXug}{tETgc45NTdyFznN-s0--9A}{X.X.X.X}{X.X.X.X:9300}{dlrt}{ml.machine_memory=67457081344, ml.max_open_jobs=20, xpack.installed=true, box_type=defaultnodetype, transform.node=true}
   Hot threads at 2020-07-16T12:57:23.861Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {fesn28}{xRbiCMKUTFe84KNaq9p7eQ}{MLjzVmvCQvOqW-xccEA27g}{X.X.X.X}{X.X.X.X:9300}{dlrt}{ml.machine_memory=67456978944, ml.max_open_jobs=20, xpack.installed=true, box_type=defaultnodetype, transform.node=true}
   Hot threads at 2020-07-16T12:57:23.863Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {fesn71}{mLS9UDsjQtSEwNgtVRjXhg}{mPG5R6UhRjykmRycy2c2Cw}{X.X.X.X}{X.X.X.X:9300}{dlrt}{ml.machine_memory=67457142784, ml.max_open_jobs=20, xpack.installed=true, box_type=defaultnodetype, transform.node=true}
   Hot threads at 2020-07-16T12:57:23.863Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

    3.9% (19.4ms out of 500ms) cpu usage by thread 'elasticsearch[fesn71][write][T#6]'
     8/10 snapshots sharing following 2 elements
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)

::: {fesn52}{KYqnV8x1R5mf-AJwD1wXJg}{NnI7_zJoTFaViDK5ShaGgw}{X.X.X.X}{X.X.X.X:9300}{dlrt}{ml.machine_memory=67456999424, ml.max_open_jobs=20, xpack.installed=true, box_type=defaultnodetype, transform.node=true}
   Hot threads at 2020-07-16T12:57:23.862Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {fesn36}{a_BLI3CETt-t21_FV6XxFw}{uLz-XsN1SS-XAxwiEU8irg}{X.X.X.X}{X.X.X.X:9300}{dlrt}{ml.machine_memory=67456991232, ml.max_open_jobs=20, xpack.installed=true, box_type=defaultnodetype, transform.node=true}
   Hot threads at 2020-07-16T12:57:23.862Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {master1}{LW5sVKysQVay2aiBo09txA}{jiT3nOL0QfuDC9XQkFl5AQ}{X.X.X.X}{X.X.X.X:9300}{lmr}{ml.machine_memory=67326771200, ml.max_open_jobs=20, xpack.installed=true, box_type=master, transform.node=false}
   Hot threads at 2020-07-16T12:57:23.864Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {fesn31}{9H4TJDvLQqa1RCNcQgXlDQ}{5FhpCYM-RvikgO9ZdVJW5Q}{X.X.X.X}{X.X.X.X:9300}{dlrt}{ml.machine_memory=67456978944, ml.max_open_jobs=20, xpack.installed=true, box_type=defaultnodetype, transform.node=true}
   Hot threads at 2020-07-16T12:57:23.863Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   14.3% (71.6ms out of 500ms) cpu usage by thread 'elasticsearch[fesn31][management][T#3]'
     10/10 snapshots sharing following 2 elements
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)

::: {fesn33}{3eS__3kIS6CMoy536JiUWw}{J51wh8aVR6C84IyyxNzHGw}{X.X.X.X}{X.X.X.X:9300}{dlrt}{ml.machine_memory=67456978944, ml.max_open_jobs=20, xpack.installed=true, box_type=defaultnodetype, transform.node=true}
   Hot threads at 2020-07-16T12:57:23.863Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

    7.3% (36.5ms out of 500ms) cpu usage by thread 'elasticsearch[fesn33][write][T#2]'
     8/10 snapshots sharing following 2 elements
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)

I want to point that the performance degradation is global, I'm speaking about this nightly process because is where it's easier to see it.

Thanks in advance,
Carlos