Term query by _id very slow (30s+) occasionally

Mapping:

{
  "dynamic": "strict",
  "_source": {
    "enabled": false
  },
  "properties": {
    "data": {
      "type": "binary",
      "doc_values": false,
      "store": true
    }
  }
}

Query:

{"query":{
  "bool" : {
    "filter" : [
      {
        "term" : {
          "_id" : {
            "value" : "a7884205-3d10-11ee-8fe7-0b299f77c536",
            "boost" : 1.0
          }
        }
      }
    ],
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}
}

3 ES nodes (1 replica, 3 shards)
Each node: 4Core, 32G RAM and 14G heap
1T data (100 * 10G)
JVM statics as following:

"jvm": {
                  "timestamp": 1695089656498,
                  "uptime_in_millis": 8293249814,
                  "mem": {
                      "heap_used_in_bytes": 5938777088,
                      "heap_used_percent": 39,
                      "heap_committed_in_bytes": 15032385536,
                      "heap_max_in_bytes": 15032385536,
                      "non_heap_used_in_bytes": 270824264,
                      "non_heap_committed_in_bytes": 277282816,
                      "pools": {
                          "young": {
                              "used_in_bytes": 3430940672,
                              "max_in_bytes": 0,
                              "peak_used_in_bytes": 9009364992,
                              "peak_max_in_bytes": 0
                          },
                          "old": {
                              "used_in_bytes": 2340064256,
                              "max_in_bytes": 15032385536,
                              "peak_used_in_bytes": 10826308096,
                              "peak_max_in_bytes": 15032385536
                          },
                          "survivor": {
                              "used_in_bytes": 167772160,
                              "max_in_bytes": 0,
                              "peak_used_in_bytes": 700494512,
                              "peak_max_in_bytes": 0
                          }
                      }
                  },
                  "threads": {
                      "count": 194,
                      "peak_count": 197
                  },
                  "gc": {
                      "collectors": {
                          "young": {
                              "collection_count": 27822,
                              "collection_time_in_millis": 1431077
                          },
                          "old": {
                              "collection_count": 0,
                              "collection_time_in_millis": 0
                          }
                      }
                  },
                  "buffer_pools": {
                      "mapped": {
                          "count": 9938,
                          "used_in_bytes": 511147464733,
                          "total_capacity_in_bytes": 511147464733
                      },
                      "direct": {
                          "count": 173,
                          "used_in_bytes": 11555891,
                          "total_capacity_in_bytes": 11555889
                      },
                      "mapped - 'non-volatile memory'": {
                          "count": 0,
                          "used_in_bytes": 0,
                          "total_capacity_in_bytes": 0
                      }
                  },
                  "classes": {
                      "current_loaded_count": 29654,
                      "total_loaded_count": 31313,
                      "total_unloaded_count": 1659
                  }
              }

Which version of Elasticsearch are you using?

Do you have a single index with 3 primary shards and 1 replica in the cluster? If so, what is the size of this index in trem of shard size in GB as well as document count? What is the average size of the data in this binary field?

What type of storage are you using? Local SSD?

What load is the cluster under (read and write) when you see these long latencies?

This usage pattern is quite unusual. It looks to me like you are trying to use Elasticsearch as a key-value store for some binary data. Why are you using Elasticsearch this way?

Yeah Iā€™d be interested to know how big the blobs are in this doc.

1 Like

Hi Christian,
Thanks for you reply.
Here are the details:
ES version: 8.4.2
Each index 3 primary shards, 1 replica
ILM:

 {
      "max_size": "5gb",
      "max_primary_shard_size": "5gb",
      "max_age": "1d",
      "max_docs": 10000000
  }

Average size of binary field data: 8kb. The followings are part of the indices.

health status index                            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   messages-000370                  DneyyWoHQp-T17_kjn9kCg   3   1    1471584            0       12gb            6gb
green  open   messages-000371                  -h50Q3JiSrmsu0mmt7ZCMw   3   1    1394995            0     10.9gb          5.4gb
green  open   messages-000372                  XQFIvqAvSeCRMe_GCNYZ5A   3   1    1380346            0       11gb          5.5gb
green  open   messages-000373                  wTQC9L3dS_CiRMbC8rSJpQ   3   1    1655118            0     10.3gb          5.1gb
green  open   messages-000374                  Wuft_rovRDqs1i8mX_Ygyw   3   1    1182124            0       10gb            5gb

When doing indexing and search - long latencies, CPU usage is less than 20%, OS memory is roughly 18G. The hardware resource is reasonable but the performance is not good. The worst case is 30s+, normal situation is seconds (2-10s).

How many indices do you have in the cluster? What type of storage do you have?

Currently the total indices is 258 and message itself is 185.
The entry point: 20-40 messages per second.

The storage is SSD + SATA ( HCI: hyper-converged infrastructure)

You did not answer the following questions:

How many documents are indexed per second? What bulk size are you using? How many concurrent queries are you running when you see the large latencies?

Can you please describe the use case and why you are using Elasticsearch this way?

Why have you limited your shard size to 5GB? That seems quite small and will result in a large number of shards that need to be searched in every query.

How many documents are indexed per second? What bulk size are you using? How many concurrent queries are you running when you see the large latencies?
At least 40 message/s at the entry points. The total must be more than that. Bulk action: 5000, flushInterval: 5s; Only one query when you see the large latencies.

Can you please describe the use case and why you are using Elasticsearch this way?
Sure. The system is kind of trace e.g. a message come to the system and the system processes the message via A->B->C->D->E. And at each point, we trace the event and keep the message content in order to reprocess the message. It may not be the perfect design but it is our current design. We may improve it to extract message out to other DB but not at this stage.

What is the retention period for this data in the cluster?

Is the amount of data indexed static or expected to grow in the future?

Retention days is 14. The system has run 7 days which means the data would be kind of double i.e. 2T.

Why have you limited your shard size to 5GB?
Is there any formula we can follow to set the right value of the shard size?

Given the number of indices in the cluster I will assume your retention period is reasonably long, potentially up to 6 months.

Although I am not sure Elasticsearch is the ideal choice for this use case, I think there are a few things you could try in order to improve performance.

1. Limit the number of shards queried
When Elasticsearch indexes a document it routs the document to one of the shards based on the document ID by default. Your query searches for one spcific document ID but queries all shards. You can try adding the document ID as a routing value with each search request. This should allow Elasticsearch to only search one shard (the one that could hold the document based on the ID) for each index.

This should work even if you change the number of primary shards.

2. Increase shard size
You currently have a lot of very small shards, which can be slow to query. I would recommend increasing the shard size so you get fewer shards to query. At the same time I would also recommend you increase the number of primary shards as that will make the routing change described earlier more efficient. If you go to e.g. 6 primary shards per index only 1/6 of all shards would need to be searched instead of 1/3. This does not mean that you need to reindex - you can simply let the smaller indices with 3 primary shards age out over time.

This likely mean that you will need to make each index cover a larger time period, but that should not be a problem if your retention period is quite long. Maybe change the rollover criteria to something like this:

{
    "max_primary_shard_size": "50gb",
    "max_age": "10d"
}

3. Forcemerge indices no longer written to
It may also be beneficial to add a step to ILM that forcemerges indices that have rolled over down to a single segment if you do not already have this in place.

Really appreciate your professional suggestions.

  1. We do include routing in request.

  2. About the shard size and shard count, is there any consideration? Like nodes count, hard drive or memory? Currently most of the case, we keep data for 14 days, maybe shorter.

  3. We do a little bit upsert to data. From hot to warm stage, I can add merge. Will this affect upsert?

  4. The performance issues only happens when messages come through i.e. indexing. I am thinking whether it helps if increasing refresh interval and increasing query cache size or not. Any suggestions?
    Thanks.

If you are already doing this that is good.

5GB is a quite small shard size. Given that you search by ID I would expect you to benefit from larger shards. 50GB shard size is not uncommon and something I would try to achieve.

A larger number of primary shards would make your routing more efficient, so would probably be a worthwhile change. I would look to increase this gradually, e.g. initially go from 3 to 6. If you with this achieve the larger shard size you may later increase this to 9.

If you have a short retention period, e.g. 14 days, you need to make sure that you get a suitable number of indices to cover this period. In this case I would probably keep the rollover period at 1 day so you get at least 14 indices covering the retention period.

If you are updating existing data it does not make any sense to forcemerge down to a single segment. Forcemerging down to a single segment is I/O intensive and completely undone if you update or write to the index.

If that is the case I would look at storage I/O performance, especially at times you are performing indexing and see slow queries. What does iostat -x show?

Thanks Christian. I will try 6 shards and shard size = 30G first.

Here is one of the nodes iostat:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          28.18    0.00    8.49   43.96    0.00   19.37

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0             0.00      0.00     0.00   0.00    0.00     0.00    1.50      0.01     0.00   0.00   39.00     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.06   2.00
dm-1             0.00      0.00     0.00   0.00    0.00     0.00   46.50      0.18     0.00   0.00   66.11     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.07  19.25
dm-2           402.50    177.94     0.00   0.00  103.34   452.70  136.00     15.49     0.00   0.00   77.03   116.66    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   52.07  95.50 
sda            570.50    181.68     0.00   0.00  100.72   326.11   44.00     13.04   150.00  77.32   71.44   303.55    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   60.60  96.50
sr0              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

That is very high (bad) values for r_await and w_await, which indicates that the performance of the storage you are using likely is the bottleneck.

I would looking into concentrating on improving the storage you are using. If the storage is this slow I do not think any of the other optimisations will make much difference.

Hi Christian,
Thanks for your help. We setup an environment with 30G + 6 shards with SSD had drive. The performance of query while writing is not ideal still (20s +). I am thinking that might help if we can separate the write and read. Does Es support write and read separately? like one node is setup for writing, another node is setup for reading whose data is syncing from writing node.

No, that kind of separation is not possible.