Query that finds duplicate log lines and their respective count

AndriesN · September 18, 2019, 7:35am

Hi all,

I am doing a school project where I am using your product for a log management system.
I have lots of data and I want to know which log lines are duplicate and how many duplicates there are for that particular log line.

I tried this query in which I succesfully extracted the duplicate numbers.

GET /_all/_search
{
  "query": {
"bool": {
  "must": [        
    {
      "match": {
        "beat.hostname": "server-x"
      }
    },
    {
      "match": {
        "log_level": "WARNING"
      }
    },{
      "range": {
      "@timestamp" : {
        "gte" : "now-48h",
        "lte" : "now"
      }
    }
    }
  ]
}
  },
  "aggs": {
"duplicateNames": {
  "terms": {
    "field": "message_description.keyword",
    "min_doc_count": 2,
    "size": 10000
  }
}
  }
}

This works for a log line that contains: 'AuthToken not Found' as you can see here:

"aggregations" : {
"duplicateNames" : {
  "doc_count_error_upper_bound" : 0,
  "sum_other_doc_count" : 0,
  "buckets" : [
    {
      "key" : "AuthToken not found [ ]",
      "doc_count" : 657
    }
  ]
}
  }

But it doesn't work for a log line that contains more characters for some weird reason. I tried the very same query only with log_level : "CRITICAL". In that way I'll get other log lines from the CRITICAL level but somehow the bucket is empty.

I hope someone can help me with this weird problem.

Thanks,
Andries

abdon · September 19, 2019, 1:29pm

What do you mean by "a log line that contains more characters" exactly? Can you give an example of those log lines?

AndriesN · September 19, 2019, 2:07pm

Well I have a feeling it doesn't work when the log line has more than x characters for example this log line: """Uncaught PHP Exception ErrorException: "Warning: include(/data/httpd/api/xxx/var/cache/dev/overblog/graphql-bundle/__definitions__/QueryType.php): failed to open stream: No such file or directory" at /data/httpd/api/xxx/vendor/composer/ClassLoader.php line 444"""

I have multiple log lines who are exactly like this but for some reason the query mentioned above doesn't give me a bucket

Can it be that the .keyword messes it up? Or is my query incorrect?

abdon · September 19, 2019, 3:21pm

You are absolutely right - by default a .keyword field will only contain values up to 256 characters. You can see that by looking at your index' mappings:

GET my_index/_mapping

You will see that the .keyword fields in the mapping have an ignore_above parameter with a value of 256.

You can change the value of ignore_above. You would typically do that when creating the index, by providing an explicit mapping. You can also change it dynamically on existing indexes but be aware that this is quite an expensive operation as it requires Elasticsearch to rewrite all the data.

To update existing indexes, first update the mapping:

PUT my_index/_mapping
{
  "properties": {
    "message_description": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 1024
        }
      }
    }
  }
}

Next you can then reindex the data by executing an _update_by_query request:

POST my_index/_update_by_query?wait_for_completion=false

The last operation will run in the background and will take some time to complete, depending on how much data you have.

system · October 17, 2019, 3:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicates Query not returning all results Elasticsearch	3	1587	July 5, 2017
How to find duplicate documents containing super long text fields? Elasticsearch	4	2547	November 27, 2018
Hit a log only all specific keyword exists? Elasticsearch	6	1285	July 5, 2017
Retrieve duplicate data using Kibana search bar Kibana	2	8334	March 9, 2017
Group same errors on .keyword Kibana	3	1072	February 13, 2019

Query that finds duplicate log lines and their respective count

Related topics