Query on timeseries data very slow

kk123 · September 20, 2022, 9:01pm

Hi, this follows previous questions. I have merged all my indices into 1 ilm based index, with a added field that contains the previous "index" name...

Generally this is fine, with some performance hit. Queries are generally a but slower, but OK. I've done this for scalability as I had to many indices. However I am finding a particular query to be much slower...

I am trying to determine the min and max date of a range of documents. On the same test node (which has the new merged index and all the old indices) using the previous many index model, the query returns in 260ms but the same query on the newly merged index it takes 2500ms. What is going on?

I run two queries and sort by date in both direction, taking the top document.

{
      "query": {
        "bool": {
          "must": [
            {
              "term": {
                "dataseriesId": dataseriesId
              }
            },
            {
              "range": {
                "date": {
                  "gte": from.format(),
                  "lte": to.format()
                }
              }
            }
          ]
        }
      },
      "sort": {
        "date": sort
      },
      "size": 1
    }

which gives me the document with the min and max of the date field for docs with the given dataseriesId.

from and to are my chosen date range.

index template mapping is

mappings: {
          _source: {
            enabled: true
          },
          _routing: {
            required: false
          },
          properties: {
            type: { type: 'keyword', index: true },
            datasetId: { type: 'keyword', index: true },
            dataseriesId: { type: 'keyword', index: true },
            date: { type: 'date', format: 'date_time_no_millis' },
            floatValue: { type: 'float', index: false, ignore_malformed: true, coerce: false },
            stringValue: { type: 'keyword', index: false },
            dateValue: { type: 'date', index: false, format: 'date_time_no_millis' }
          }
        },

kk123 · September 21, 2022, 9:56am

I've managed to massively improve this query (now 40ms instead of 2500ms)... by using aggreagations... although it would be good to understand why it is so much slower/faster just because of merging the indices..

My new query is

"query": {
        "bool": {
          "must": [
            {
              "term": {
                "dataseriesId": dataseriesId
              }
            },
            {
              "range": {
                "date": {
                  "gte": from.format(),
                  "lte": to.format()
                }
              }
            }
          ]
        }
      },
      "aggs": {
        "min": { "min": { "field": "date" } },
        "max": { "max": { "field": "date" } }
      },
      "size": 0

Christian_Dahlqvist · September 23, 2022, 6:33am

How many indices and shards did you originally have?

How many do you have now?

Do the query target all indices in both scenarios?

kk123 · September 23, 2022, 7:01am

Hi Christian, thanks for the reply.

Originally there where 275 indices and now there is 1. This is a staging server, so not at full load. I am hosting both the 275 indices and the merged 1 index on the same ES node for testing purposes and comparing the same query going through the 275 indices vs the 1 index.

In both versions of the query (275 vs 1 index) the actual query would only target 1 index each. That is because it can choose the index based on an id, which has become a field in the merged index.

Thanks,

Christian_Dahlqvist · September 23, 2022, 7:37am

Querying a single small index should be faster than querying a single larger index, so that is not surprising. The significant different in timings does surprise me though, but would depend on how caching is applied. Do the latency stay at 2500ms even if you run a few queries in sequence?

When deciding how to shard data there is always a tradeoff. Indexing into a single index can be a lot more efficient than spreading the data over a lot of small indices. There is always a potentially significant difference in overhead.

kk123 · September 23, 2022, 9:23am

Yes indeed. I was expecting a slight hit, but not that much. Yes the latency stays at 2500ms in fact it starts at 3000ms then comes down after a few requests...

I did also find it interesting that then using the aggregation option on the query which would have been going through the same documents on the merged index was so much faster... from 2500ms down to 40ms? I don't understand what is happening behind the scene, that could cause that difference on a similar query.... That said it does take it from two queries down to one... but 1250ms to 50ms is still a big change...

Christian_Dahlqvist · September 23, 2022, 9:37am

An aggregation does not need to score the documents that match, so that saves some time. It also does not need to retrieve any document from disk. What is the specification of your cluster? How much data do you have in it?

system · October 21, 2022, 9:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimizing Elasticsearch for Timeseries data Elasticsearch	4	1648	July 5, 2017
Time range query performance [7.6] Elasticsearch	18	5189	May 5, 2020
Performance querying time-based indices in a date range Elasticsearch	3	2374	August 3, 2020
Slow Aggregation over Time based indices. Each index has around ~10M records Elasticsearch	6	427	October 14, 2019
Speed Up Date range query Elasticsearch	5	1214	August 21, 2020

Query on timeseries data very slow

Related topics