Query that finds duplicate log lines and their respective count

Hi all,

I am doing a school project where I am using your product for a log management system.
I have lots of data and I want to know which log lines are duplicate and how many duplicates there are for that particular log line.

I tried this query in which I succesfully extracted the duplicate numbers.

GET /_all/_search
{
  "query": {
"bool": {
  "must": [        
    {
      "match": {
        "beat.hostname": "server-x"
      }
    },
    {
      "match": {
        "log_level": "WARNING"
      }
    },{
      "range": {
      "@timestamp" : {
        "gte" : "now-48h",
        "lte" : "now"
      }
    }
    }
  ]
}
  },
  "aggs": {
"duplicateNames": {
  "terms": {
    "field": "message_description.keyword",
    "min_doc_count": 2,
    "size": 10000
  }
}
  }
}

This works for a log line that contains: 'AuthToken not Found' as you can see here:

"aggregations" : {
"duplicateNames" : {
  "doc_count_error_upper_bound" : 0,
  "sum_other_doc_count" : 0,
  "buckets" : [
    {
      "key" : "AuthToken not found [ ]",
      "doc_count" : 657
    }
  ]
}
  }

But it doesn't work for a log line that contains more characters for some weird reason. I tried the very same query only with log_level : "CRITICAL". In that way I'll get other log lines from the CRITICAL level but somehow the bucket is empty.

I hope someone can help me with this weird problem.

Thanks,
Andries

What do you mean by "a log line that contains more characters" exactly? Can you give an example of those log lines?

Well I have a feeling it doesn't work when the log line has more than x characters for example this log line: """Uncaught PHP Exception ErrorException: "Warning: include(/data/httpd/api/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory" at /data/httpd/api/xxx/vendor/composer/ClassLoader.php line 444"""

I have multiple log lines who are exactly like this but for some reason the query mentioned above doesn't give me a bucket

Can it be that the .keyword messes it up? Or is my query incorrect?

You are absolutely right - by default a .keyword field will only contain values up to 256 characters. You can see that by looking at your index' mappings:

GET my_index/_mapping

You will see that the .keyword fields in the mapping have an ignore_above parameter with a value of 256.

You can change the value of ignore_above. You would typically do that when creating the index, by providing an explicit mapping. You can also change it dynamically on existing indexes but be aware that this is quite an expensive operation as it requires Elasticsearch to rewrite all the data.

To update existing indexes, first update the mapping:

PUT my_index/_mapping
{
  "properties": {
    "message_description": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 1024
        }
      }
    }
  }
}

Next you can then reindex the data by executing an _update_by_query request:

POST my_index/_update_by_query?wait_for_completion=false

The last operation will run in the background and will take some time to complete, depending on how much data you have.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.