When we're writing to the Elasticsearch cluster, Reads are 10000x slower. (Understanding slow term query)

Hello,

I have recently indexed 25M docs into 2 nodes (+1 master) of 5 shards x 3 Availability Zones. I'm running in AWS ES, large boxes with with plenty of EBS storage, resource usage is low, average doc size is < 1KiB. I'm performing the following terms query with profiling (also very slow without profiling); subscriber_ids is mapped as a keyword.

curl -X GET "host/index/_search?pretty&human=true" -H 'Content-Type: application/json' -d'
{
  "profile": true,
  "size": 0,
  "query": {
    "term": {
      "subscriber_ids": {
        "value": 17224548,
        "boost": 1.0
      }
    }
  }
}

Total took is ~20 seconds. However, my profiling comes in like the below example with only hundreds of microseconds time, which seem to demonstrate the actual query operations on ES is very fast but there's some other bottleneck. I'm guessing it could be the network? What's the best way to better understand and dive deeper into where the bottleneck in performance is?

{
        "id" : "[qgiqODOIQqOlWaprcK3KyQ][flink_groups][2]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "PointRangeQuery",
                "description" : "subscriber_ids:[17224548 TO 17224548]",
                "time" : "484.1micros",
                "time_in_nanos" : 484190,
                "breakdown" : {
                  "set_min_competitive_score_count" : 0,
                  "match_count" : 0,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 0,
                  "next_doc" : 0,
                  "match" : 0,
                  "next_doc_count" : 0,
                  "score_count" : 0,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 9917,
                  "advance_count" : 9,
                  "score" : 0,
                  "build_scorer_count" : 19,
                  "create_weight" : 908,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 473336
                }
              }
            ],
            "rewrite_time" : 14461,
            "collector" : [
              {
                "name" : "EarlyTerminatingCollector",
                "reason" : "search_count",
                "time" : "13.9micros",
                "time_in_nanos" : 13992
              }
            ]
          }
        ],
        "aggregations" : [ ]
      }

Thanks!

Forgot to mention, there's simultaneously lots of bulk indexing traffic updating these documents. Thousands of docs per seconds.

Interesting progress, I started a new index and when I hit the old index now (0 writes) my reads are sub 10 ms. So somehow the writes are slowing down all the reads it would seem.

Setting refresh_interval to 1s seems to have fixed the issue. If anyone has an explanation as to why I'd appreciate it.

curl -X PUT "host/index/_settings?pretty&human=true" -H 'Content-Type: application/json' -d'
{
  "index" : {
    "refresh_interval" : "1s"
  }
}
'

I guess you're using a 7.x version?

May be this explains?

By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting.

And

How often to perform a refresh operation, which makes recent changes to the index visible to search. Defaults to 1s . Can be set to -1 to disable refresh. If this setting is not explicitly set, shards that haven’t seen search traffic for at least index.search.idle.after seconds will not receive background refreshes until they receive a search request. Searches that hit an idle shard where a refresh is pending will wait for the next background refresh (within 1s ). This behavior aims to automatically optimize bulk indexing in the default case when no searches are performed. In order to opt out of this behavior an explicit value of 1s should set as the refresh interval.

I'm running in AWS ES

BTW did you look at Cloud by Elastic, also available if needed from AWS Marketplace ?

Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, Maps UI, Alerting and built-in solutions named Observability, Security, Enterprise Search and what is coming next :slight_smile: ...

1 Like

You nailed it. This makes sense, the refresh rate changes dynamically based on read load unless it's explicitly set.

I'll take a look at your cloud suggestion, thanks!

I'm having a very similar problem, but I don't have the capacity to play around with the settings, let alone know how to on Elastic Cloud..!

How would I go about changing the refresh interval on elastic cloud? I'm thinking 30 seconds? I don't need the data there instantly as it loads every 5 mins or so, but searching is so slow, and don't get me started on trying to use graph.

It's surprising, we're using Elastic cloud and having massive wait times searching and no-one has spotted this performance issue and suggested anything to us, so I think "managed by us" is from a "is it up?" only..!

It depends.

If you are an e-commerce website and doing a lot of search then the refresh period will be like 1s.

If you have a log use case, most likely you are running few searches then the refresh will be higher in delay. Which is ok I think.

So the behavior looks good to me for the majority of the use cases.
But if something is wrong you can always fix the old behavior by refreshing everything second. You'll just "pay" the normal price of much higher IOs on disk as more merges will be needed.

Very similar to the OP, we're ingesting about 300,000 documents an hour but the main thing that we use it for is searching the logs and providing insight as to the trends. When using Graph most of the time the data search just crashes, but with normal querying it's normally about 30 seconds to search a week (any more and you could be waiting a significant time!)

In my opinion, search speed is far more important than indexing speed or "current" data ~5 mins delay is fine!

What would be the best setting, 1s sounds like it's possibly getting hammered with the ingestion and can't return the results...?

It sounds like your cluster is simply overloaded. Changing refresh interval can make some difference but I would not necessarily expect a dramatic difference. What is the size of your cluster? How much data do you index per day? What is your retention period? How many indices and shards are you generating per day?

It's currently 4 nodes (2 hot, 2 warm) with a tiebreaker node thrown in for free. We're ingesting about 17GB per day, about 1M documents, retention is 60 days with ILM moving data from Hot, to Warm after 7 days. We have a daily index with the default shard settings, 1 primary and 1 replica.

It probably is overloaded but to work out what i'd need is just a finger in the air job for me, no clue really! In terms of performance from the metrics, it's sitting at about 30% CPU most of the time, and JVM doesn't look crazy high either tbh..

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.