Re-evaluating shard setup

I'm running 3 node (26 GB EBS volume, 4GB Memory, c5.large) ES cluster v7.7 (AWS) for 1 index with the following setting

PUT /my-index
{
  "settings": {
     "number_of_shards": 1,
     "number_of_replicas": 1
    }
}

After initial research, I understood that i will not get the shard allocation "right", unless at least i know ahead of time the size of index etc. Therefore i believe now it's the time to re evaluate the setup.

This index currently holds searchable ~37M documents (deleted documents currently sit at ~4M) and its occupying 5GB. It is not expected to grow rapidly. By end of the year might increase to 40M docs.

What id like to understand is what I got wrong as the cluster is experiencing the following:

  1. One node not receiving search requests (screenshot attached)

  1. At random points in time, the cluster is having spikes in Search Latency over 6secs, http 400 response codes are thrown, without any cpu spike indication, and the way its "resolved", its by not sending any more requests to the ES cluster.
    Screenshot 2020-09-24 at 11.27.43 Screenshot 2020-09-24 at 11.27.50

Hi Chris,

The first point is to be expected I think: You have 3 nodes and configured only 1 primary and 1 replica shard so one server has the primary and another the secondary shard. The third server does not have anything. You might want to increase the number of replicas to 2 so the queries can use all available servers.

Unfortunately, I do not have an idea why you have such spikes in latency.

Best regards
Wolfram

What type of instances are you using? What type of storage?

Thank you for your response Wolfram

Hi Christian, i've updated my post. Each node is set for 26GB EBS storage, 8GB Memory

Which instance type are you using? t2?

It is the compute optimised c5.large

There are a couple of things I can think of that could cause latency spikes. The first is GC. Could you check in the logs if there is any long GC reported around the time of the spike? It is also worth noting that EBS IOPS are proportional to the volume size unless you have PIOPS. Are you monitoring disk I/O so you can see if there is any correlation?

I'm using EBS IOPS, thank you for mentioning that, ill look further into that.
Sorry I don't have visibility on disk I/O at the time of the spike.

The only screenshot i can provide is this one which shows the 2 spikes in the threadpool and the GC metrics

I am having a similar issue. For me a few things helped. I still have questions myself.

Context: I have 3 8GB AWS nodes with a 100GB SSD attached to each in AWS

  1. Limited shards to 15GB with ILM. Why not 8 or 3.5GB? 20GB is slower, I tried.
  2. I use 3 primaries and 1 replica. Wrong?
  3. Turned of atime on the SSD. It no longer writes at a read.
  4. Ensured memory lock and max open files were set in systemd config
  5. Limited the max JVM memory to 3.5 Gb. This allows to use 32 bit compact pointers and gives about half the memory to file cache. 300Mb is used by the OS.
  6. Ensured my JSON is sorted.
  7. No swap file or partition.
  8. Strict schema, almost no text fields that are indexed.

My data is time based but my queries are not always time limited. No data is thrown away..
I have +400 fields per document. Most keywords, only a few text fields.

Limiting the shard size from 40Gb to 15Gb seemed to do the most for me. Performance went from dramatic slow tot wow fast.

Having guidance from Elastic for several very different use cases would be nice. It is a bit of whack a mole now.

  • What size should my shards be?
  • What optimal AWS vm config?
  • Elastic config?
  • Schema tuning?
  • Use nested, flattened?
  • Turn off source?
  • Compress?
    Having most to default works! But limiting the max shard size helped a lot.