Performance issue running on ECK vs EC2 on AWS

Hello, I am new here and I am trying to understand performance issues.

I am running an index on two different deployments of Elasticsearch. The index is the same (I've used the backup/restore operation to make sure I have the exact same database contents) on both sides.
One of the elasticsearches is running on EC2, with "as much" RAM (up to 64) and CPU it can eat. The other is deployed with the ECK operator on a node that can have up to 16 GiB or RAM (I tried giving it 32, but it failed...)

EC2 has 0 master and 1 data (no other replica), a single shard.
The ECK instance has 1 master and 1 data (they are deployed on the same node). The restore operation, to create the index, created the same single shard.

I am now shooting(*) /_search requests on both, and I observe a difference in the processing time - the EC2 instance being a lot faster than the ECK one. I am trying to understand what's going on and how to have them have the same efficiency. The difference I observe is around +200ms for the ECK instance vs the EC2 instance, on _search requests (see below for an example).

(*) in order to compare results, I port-forward my EC2 port and I port-forward my ECK http service to my local machine.

It is important to note that, right now, the goal is not to get the BEST EVER result, but to get, at least, a similar result. I know my index could benefit from sharding, but I want to compare two comparable things - EC2 vs ECK, 1 data node, no replica, 1 shard.

The index is 460 million entries, around 100 GiB of data (so, yes, I should definitely be using 4 shards, I know).

I don't know what to look for, and will be happy to post responses from ES to your suggested queries.

Here are the descriptions of a few important sections :

▶ curl -k http://127.0.0.1:9200/library | jq  // this is the same index for both ES
{
  "library": {
    "aliases": {},
    "mappings": {
      "properties": {
        "author": {
          "type": "wildcard"
        },
        "book_registered_title": {
          "type": "keyword"
        },
        "isbn": {
          "type": "keyword"
        },
        "internal_id": {
          "type": "keyword"
        },
        "book_title": {
          "type": "wildcard"
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": "1",
        "provided_name": "library",
        "creation_date": "1657643120070",
        "sort": {
          "field": "internal_id",
          "order": "asc"
        },
        "number_of_replicas": "1",
        "uuid": "QOTS5xzhSMe7KvGEUmifmQ",
        "version": {
          "created": "7090099"
        }
      }
    }
  }
}

Here is the operator I used:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: pascal
  namespace: es-library-staging
spec:
  version: 7.17.5
  nodeSets:
  - name: data
    count: 1
    config:
      node.roles: [data]
      node.store.allow_mmap: false
      cluster.initial_master_nodes:
        - pascal-es-library-master-0
      xpack.security.authc.realms:
        native:
          native1:
            order: 1
    podTemplate:
      metadata:
        labels:
          dedicatedLabel: "es-library"
      spec:
        tolerations:
          - key: "dedicatedTaint"
            operator: "Equal"
            value: "es-library"
            effect: "NoSchedule"
        affinity:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    elasticsearch.k8s.elastic.co/cluster-name: pascal
                topologyKey: kubernetes.io/hostname
        containers:
        - name: elasticsearch
          resources:
            requests:
              cpu: "125m"
              memory: 2Gi
            limits:
              cpu: 4
              memory: 16Gi
    volumeClaimTemplates:
      - metadata:
          name: elasticsearch-data
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 250Gi
  - name: master
    count: 1
    config:
      node.roles: [master]
      node.store.allow_mmap: false
      cluster.initial_master_nodes:
        - pascal-es-library-master-0
      xpack.security.authc.realms:
        native:
          native1:
            order: 1
    podTemplate:
      metadata:
        labels:
          dedicatedLabel: "es-library"
      spec:
        tolerations:
          - key: "dedicatedTaint"
            operator: "Equal"
            value: "es-mlc"
            effect: "NoSchedule"
        affinity:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    elasticsearch.k8s.elastic.co/cluster-name: pascal
                topologyKey: kubernetes.io/hostname
        containers:
        - name: elasticsearch
          env:
          - name: ES_JAVA_OPTS
            value: -Xms512m -Xmx512m
          resources:
            requests:
              cpu: "100m"
              memory: 1Gi
            limits:
              cpu: 2
              memory: 2Gi
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

Example of a query:

 curl -k http://127.0.0.1:9201/library/_search -H "content-type:application/json" -d'{
    "post_filter": {
        "bool": {
            "should": {
                "wildcard": {
                    "author": {
                        "value": "*VICTOR HUGO*"
                    }
                }
            }
        }
    },
    "query": {
        "bool": {
            "should": {
                "term": {
                    "book_registered_title": "NOTRE-DAME DE PARIS"
                }
            }
        }
    },
    "search_after": [
        ""
    ],
    "size": 10000,
    "sort": [
        {
            "internal_id": {
                "order": "asc"
            }
        }
    ]
}'

On ECK, I get the following results :

{
  "took": 626,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 74,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [...]
  }
}

and on EC2 I get this :

{
  "took": 143,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 74,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [...]
  }
}

Do you have a suggestion to help me investigate why one of them is so slower than the other ?

The difference could be the speed/performance of the storage, which often is the limiting factor. The node with more RAM has more space available for the operating system page cache, which means more data can be cached. If storage is equivalent this could make a difference. If the type and performance of stoarge differ this could be the cause. You are not really comparing apples to apples so I am not surprised to see different latencies.

As each query runs in a single thread against each shard, the CPU speed might also be a factor. Given the size of the index you would likely benefit from more primary shards.

Both are running on SSDs.

How can I check how much RAM is currently used by cache ?
I tried to check how much RAM ES was using, it seems the overall docker container on EC2 uses 18 GiB whereas it uses 12 GiB on ECK - but that's just the output of docker stats or kubectl.

Here's the CPU configuration:
ECK: 3.1 GHz Intel Xeon® Platinum 8000 series processors (Skylake 8175M or Cascade Lake 8259CL) with new Intel Advanced Vector Extension (AVX-512) instruction set
EC2: Arm-based AWS Graviton2 processors

I agree I'm not comparing exactly the same things, but I would have expected the difference to be a lot smaller than 2 or 3 times the amount of time to process a request.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.