Optimizing access log query


#1

Hello all!

I'm looking for ways to optimize the current query for my access log. I'm trying to get all non-static reuqests to my dashboard as fast as possible. At the moment I'm just looking for ways to get the query as sleek as possible.
At the moment it looks loke this:

{
  "query": {
"constant_score" : {
   "filter" : {
      "bool": {
        "must": [
          { "match_phrase": { "file": "/www/myoldwebsite/logs/access.log" }},
          { "match_phrase": { "verb": "GET" }},
          { "range": { "@timestamp": { "gte": 1477231099000, "lte": 1477317499000, "format": "epoch_millis" }}}
        ],
        "must_not": [
          {"query": {
            "query_string": {
              "query" : "request.raw:/.*\\.(css|js|jpg|png|gif).*/"
            }
          }}
       ]
      }
   }
}
  }
}

I'll have to expand the file types later, but at the moment this one day query takes more than a second to finish. Also, I want to create various queries from this one to fulfill my needs. Simply caching after the first request isn't a solution, they'll have to be live data.

Thank you in advance,
YvorL


(Isabel Drost-Fromm) #2

One thing that stands out to me in the query above: Why are you using match_phrase queries for file and verb?

Wouldn't it be sufficient to store these two in a not-analyzed/ keyword like explained here:

https://www.elastic.co/guide/en/elasticsearch/guide/master/mapping-intro.html

and then use a TermQuery like so:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

Hope this helps,
Isabel


#3

Thank you! I had already those in both analyzed and non-analyzed index. It boosted the query and learned that it'll be faster to search for a fix term when it's possible. Now it looks like this:

{
  "query": {
"constant_score" : {
   "filter" : {
  "bool": {
    "must": [
      {"query": { "term": {  "file.raw": "/www/myoldwebsite/logs/access.log" }}},
      {"query": { "term": {  "verb.raw": "GET"   }  }},
      { "range": { "@timestamp": { "gte": 1477231099000, "lte": 1477317499000, "format": "epoch_millis" }}}
    ],
    "must_not": [
      {"query": {
        "query_string": {
          "query" : "request.raw:/.*\\.(css|js|jpg|png|gif).*/"
        }
      }}
   ]
  }
   }
}
  }
}

Other suggestions?


(Isabel Drost-Fromm) #4

The other thing that looks odd is your wildcard usage here:

"query" : "request.raw:/.\.(css|js|jpg|png|gif)./"

From https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_wildcards using wildcards can lead to performance penalties. Why not extract this information at ingest time and store it in a separate field?

Isabel


#5

Yes, I know that this is the main problem with this query :frowning:
Unfortunately, there are several various cases where I can't extract the information I need. Requests can have arguments, and there can be more than one file request at the same time. I'm working on a solution to make it work, but it'll take time to finish and implement it. Meanwhile, I need this to be as efficient as it can.
Thank you, Isabel!


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.