Timeline query on timestamped indices


(Srinath C) #1

Hi,

We write data into timestamped indices of the form {prefix}_{yyyyMMdd}. The "timestamp" field in the documents is used to write into one of these timestamp indices.

For a query that spans over longer intervals of time, say 30 days, what is the best way to do a timeline query on these timestamped indices?

A search with wild cards as POST /{prefix}_*/_search will run the query on all shards in the cluster.

Since the indices are timestamped can we specify a range of indices to run the query on?


(Zachary Tong) #2

I would include the date condition as a range query inside your search request. Elasticsearch will take care of the rest.

Elasticsearch has a "pre-search" phase which we call a CanMatch phase. This was introduced for your exact situation, which happens frequently with Kibana users too.

During the CanMatch phase, shards are checked at a high level to see if they can possibly match the query. In particular, things like ranges on dates are very easy to check since each shard knows the min/max value in date fields.

So the shard will report back to the coordinating node that it can't possibly match the range specified in the query, and that entire shard is skipped during query processing. So only the subset of shards that can actually answer the query are checked... the other shards/indices that are outside the range are ignored.

Edit: alternatively, you could just enumerate out all the indices you want to search:

POST /foo_2018-01-01,foo_2018-01-02,foo_2018-01-03,.../_search


(Srinath C) #3

@polyfractal Thanks much for the reply.

Consider a 1 hour time range query as:
{ "query": { "range": { "timestamp": { "gte": 1524697200000, "lt": 1524700800000 } } } }

On our systems, we see that:

A query with wild card pattern takes around 8 seconds

curl -XPOST -d@date-range.json http://localhost:9200/tenant*/my_type/_search?pretty&size=0
{
"took" : 8841,
"timed_out" : false,
"_shards" : {
"total" : 855,
"successful" : 855,
"failed" : 0
},
"hits" : {
"total" : 5361,
"max_score" : 0.0,
"hits" : [ ]
}
}

Whereas the same query directed to the specific index takes around 83 milliseconds

curl -XPOST -d@date-range.json http://localost:9200/tenant*2018042*/my_type/_search?pretty\&size=0
{
  "took" : 83,
  "timed_out" : false,
  "_shards" : {
    "total" : 50,
    "successful" : 50,
    "failed" : 0
  },
  "hits" : {
    "total" : 5361,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

Does the "pre-search" phase account for such latencies?

Regarding the comma separated indices in path: it works as long as the path does not exceed the max length of URL. Is there a better way?

Thanks.


(Zachary Tong) #4

Hm, what version of ES are you using? The pre-search phase I mentioned was added in 5.6, I forgot it was so recent. Based on the format of your response JSON, I'm thinking you're on an older version (the new format includes a skipped counter to show how many shards didn't match).

In that case, the extra 8s is almost certainly caused by talking to all the shards, since there's no pre-search phase. You'll have to specify the indices you care about via comma-delimited list I'm afraid (or wildcards where appropriate). Or upgrade :slight_smile:

Correct. You can increase the max URL length if you wish... that limitation is mainly there for security reasons (to prevent a denial of service attack by a malicious third party sending multi-megabyte/gigabyte URLs which might fill up the heap quickly).

Alternatively, if it's a set of indices you search frequently, you could hide the long URL behind an alias.

As an aside, I noticed the one index has 50 shards (and 8k total). That's probably too many shards for a happy cluster, unless you happen to have a very large cluster with many nodes.


(Srinath C) #5

@polyfractal appreciate your help in understanding things better.

We are on 5.5.2 and I think it would be worth upgrading to 5.6 to get this fixed.

Ya, this was one possible solution but its cumbersome. We need 30 day views for each day. So each day's index will be a part of 30 aliases. Not sure if there is a better way.

No, the query ran on 880 shards and each index has 5 shards - which comes upto around 171 indices (we keep upto 6 months of data).


(Zachary Tong) #6

Yeah, the alias approach is definitely burdensome. Even if automated it's still a pain when you need fine granularity views. Upgrading is probably the easier route at this point, since you're already on the 5.x series.

Oops, you're right. Misread the took time as the shard count. :slight_smile:

Although the second query (against tenant*2018042*) is the one that I was referring to with 50 shards:

 "_shards" : {
    "total" : 50,

(Srinath C) #7

Oh, this was probably the pattern matching 5 indices 20180420, 20180421, 20180422, 20180423, 20180424 and 20180425 (the ones that existed on that day). Each with 5 shards would be 30 primary shards though. Am not sure how I got 50 shards in the response. As of today it does match 50 shards because we have indices for 20180420 - 20180429.

Thanks for checking anyways. I went back and inspected the cluster to ensure we were not having any indices with such a high shard count.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.