Optimizing access log query

YvorL · November 22, 2016, 2:16am

Hello all!

I'm looking for ways to optimize the current query for my access log. I'm trying to get all non-static reuqests to my dashboard as fast as possible. At the moment I'm just looking for ways to get the query as sleek as possible.
At the moment it looks loke this:

{
  "query": {
"constant_score" : {
   "filter" : {
      "bool": {
        "must": [
          { "match_phrase": { "file": "/www/myoldwebsite/logs/access.log" }},
          { "match_phrase": { "verb": "GET" }},
          { "range": { "@timestamp": { "gte": 1477231099000, "lte": 1477317499000, "format": "epoch_millis" }}}
        ],
        "must_not": [
          {"query": {
            "query_string": {
              "query" : "request.raw:/.*\\.(css|js|jpg|png|gif).*/"
            }
          }}
       ]
      }
   }
}
  }
}

I'll have to expand the file types later, but at the moment this one day query takes more than a second to finish. Also, I want to create various queries from this one to fulfill my needs. Simply caching after the first request isn't a solution, they'll have to be live data.

Thank you in advance,
YvorL

mainec · November 22, 2016, 10:57am

One thing that stands out to me in the query above: Why are you using match_phrase queries for file and verb?

Wouldn't it be sufficient to store these two in a not-analyzed/ keyword like explained here:

https://www.elastic.co/guide/en/elasticsearch/guide/master/mapping-intro.html

and then use a TermQuery like so:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

Hope this helps,
Isabel

YvorL · November 22, 2016, 11:29am

Thank you! I had already those in both analyzed and non-analyzed index. It boosted the query and learned that it'll be faster to search for a fix term when it's possible. Now it looks like this:

{
  "query": {
"constant_score" : {
   "filter" : {
  "bool": {
    "must": [
      {"query": { "term": {  "file.raw": "/www/myoldwebsite/logs/access.log" }}},
      {"query": { "term": {  "verb.raw": "GET"   }  }},
      { "range": { "@timestamp": { "gte": 1477231099000, "lte": 1477317499000, "format": "epoch_millis" }}}
    ],
    "must_not": [
      {"query": {
        "query_string": {
          "query" : "request.raw:/.*\\.(css|js|jpg|png|gif).*/"
        }
      }}
   ]
  }
   }
}
  }
}

Other suggestions?

mainec · November 22, 2016, 11:36am

The other thing that looks odd is your wildcard usage here:

"query" : "request.raw:/.\.(css|js|jpg|png|gif)./"

From Query string query | Elasticsearch Guide [8.11] | Elastic using wildcards can lead to performance penalties. Why not extract this information at ingest time and store it in a separate field?

Isabel

YvorL · November 22, 2016, 11:51am

Yes, I know that this is the main problem with this query
Unfortunately, there are several various cases where I can't extract the information I need. Requests can have arguments, and there can be more than one file request at the same time. I'm working on a solution to make it work, but it'll take time to finish and implement it. Meanwhile, I need this to be as efficient as it can.
Thank you, Isabel!

system · December 20, 2016, 11:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.