Query much faster against timestamp in seconds than milliseconds


(Pricj004) #1

I have a query which is significantly faster if I run it against a field containing a timestamp in seconds instead of milliseconds.

Query 1- uses timestampMilliseconds, averages 500ms to execute:

{
  index: 'events',
  body: {
    query: {
      bool: {
        filter: {
          and: [
            { term: { userId: ... } },
            { range: { timestampMilliseconds: { lt: ... } } }
          ]
        }
      }
    }
  },
  sort: 'timestampMilliseconds:desc',
  size: 300,
  requestTimeout: Infinity
}

Query 2 - uses timestampSeconds, averages 30ms to execute:

{
  index: 'events',
  body: {
    query: {
      bool: {
        filter: {
          and: [
            { term: { userId: ... } },
            { range: { timestampSeconds: { lt: ... } } }
          ]
        }
      }
    }
  },
  sort: 'timestampSeconds:desc',
  size: 300,
  requestTimeout: Infinity
}

Those two fields are defined as:

  timestampMilliseconds: {
    type : 'date',
    format: 'epoch_millis'
  },
  timestampSeconds: {
    type: 'date',
    format: 'epoch_second'
  },

Any ideas why this might be the case?

Removing the sort clauses doesn't seem to make any real difference to the execution time.


Using a different date range format than what is defined on Elasticsearch cluster
Performance issue on some requests
(Daniel Mitterdorfer) #2

Hi @pricj004,

both fields get stored as long internally but timestampMilliseconds contains 1,000 times more distinct values than timestampSeconds and that's where the difference comes from.

Daniel


Would replacing millisecond with seconds still help query faster?
(Pricj004) #3

Hey @danielmitterdorfer,

Thanks, but I'm curious what's going on under the hood to cause that difference.

My data structures knowledge is a little rusty, but I thought indexes are usually built as either trees or hashes.

In a tree, more granular values might make a small difference to the depth of traversal, but probably wouldn't account for the improvement we saw?

And in a hash, if the values are ordered, the algorithm would simply be guessing the memory offset and 'hopping', narrowing down until it finds it. Again, I wouldn't think it would account for this kind of improvement?

Just trying to get my head around how it all works :slight_smile:

Cam.


(Adrien Grand) #4

Numbers are indexed in an inverted index, so each unique value is mapped to the list of documents that contain this value.

To have acceptable performance for range queries, numeric fields also index some prefix terms. For instance long values (which dates are based on) index 4 values by default: one that identifies all bits, one that identifies the first 48 bits, one that identifies the first 32 bits and one that identifies the first 16 bits. These prefix terms help querying fewer terms at search time, which makes search faster: queries typically try to use these terms that match multiple terms and just need to match exact values on the edge of the range. But since we use a precision step of 16 bits, there can still be op to 2^16=65536 values on the edges. However, if your dates are all multiples of 1000, suddenly, there are only ~66 unique values at most on the edges, which helps querying be faster.

Note that as of the next major release, querying will be based on a tree, so the performance characteristics of range queries will become completely different.


(Pricj004) #5

@jpountz This is really interesting. Thanks for the detailed explanation! :slight_smile:


(Jimferenczi) #6

Hi @pricj004,
I am writing this after reading https://mixmax.com/blog/30x-faster-elasticsearch-queries. It seems that the blog doesn't test the range query and makes the impression that sorting is faster on timestampSeconds. This is the opposite of what you've found and shared here:

Removing the sort clauses doesn't seem to make any real difference to the execution time.

I would not expect that sorting be 30x faster with this change :wink:


(system) #7