Query optimization!


(Jurgis Babinskas) #1

Hello elastic gurus, I am pretty new to the whole stack and some help would be much appreciated. So I have this query running pretty often and it eats A LOT of resources. It works as intended results wise but I just want to know if you would have any advice how to make it any better!

GET cdr/_search
{
  "query":{
    "constant_score":{
      "filter":{
        "bool":{
          "should":[
            {
              "range":{
                "startTime":{
                  "gte":"2017-01-21T23:00+02:00",
                  "lte":"2017-01-21T23:50+02:00"
                }
              }
            },
            {
              "range":{
                "stopTime":{
                  "gte":"2017-01-21T23:00+02:00",
                  "lte":"2017-01-21T23:50+02:00"
                }
              }
            },
            {
              "bool":{
                "must":[
                  {
                    "range":{
                      "startTime":{
                        "lte":"2017-01-21T23:00+02:00"
                      }
                    }
                  },
                  {
                    "range":{
                      "stopTime":{
                        "gte":"2017-01-21T23:50+02:00"
                      }
                    }
                  }
                ]
              }
            }
          ],
          "must":[
            {
              "term":{
                "sourceIp":{
                  "value":"XX.XX.X.XX"
                }
              }
            }
          ]
        }
      }
    }
  }
}

Sorry for the bad formatting. Anyway, so the first part of the query is pretty simple and is used to get all the documents in the provided date range and the second bool is used to get all the documents that crosses the provided range but is not exactly in it (for example documents that have 2017-01-21T22:59+02:00 as startTime and 2017-01-21T23:59+02:00 as stopTime). The last step is to get only those documents that are associated with particular client (IP address). The data types are startTime DATE, stopTime DATE, sourceIp KEYWORD. Any ideas how I could make this better? The total data size on disk is around 50 TB if that plays any role.


(Adrien Grand) #2

I could be wrong but it seems to me that the first two range queries are not necessary since they are covered by the inner bool query?

This will require reindexing but I suspect you might want to look into indexing your startTime and stopTime fields as a single date range field rather than two separate date fields. Then you can directly query the range field to find the intersection with a single range query, and it is usually faster than using two range queries. https://www.elastic.co/guide/en/elasticsearch/reference/current/range.html


(Jurgis Babinskas) #3

Nah, the inner bool query only covers records that are less than startTime and more than stopTime so it does not take into account the records that for example are inside the given range, it takes only records that are outside the given range but crosses it. I actually was thinking about the date_range type but I kinda fear reindexing because I have around 10 TB of data running on one node :frowning: I guess that could take some time... but I have no choice I guess. Thanks for help.


(Adrien Grand) #4

I had misread your query indeed. But then could you do range matching with just this query:

            {
              "bool":{
                "must":[
                  {
                    "range":{
                      "startTime":{
                        "lte":"2017-01-21T23:50+02:00" // query stop time
                      }
                    }
                  },
                  {
                    "range":{
                      "stopTime":{
                        "gte":"2017-01-21T23:00+02:00" // query start time
                      }
                    }
                  }
                ]
              }
            }

It looks similar to your inner bool query except that I swapped the start / stop bounds. If all your indexed documents have a stop time that is gte the start time, then this should match all intersecting ranges?


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.