Find documents with multiple occurrences within a time span


(Adam Michalik) #1

I have documents with a following structure:

{
  "timestamp": "2018-01-01T12:34:56.000Z",
  "message": "Some message",
  "idNumber": 12345678
}

The message field holds various messages, there are multiple messages per idNumber.
I would like to find the idNumbers for which there are two or more documents with message equal to "Hello" within one hour. How can I achieve that (if it's possible at all)?


(Mark Harwood) #2

This feels a little hacky but works on a small set of data:

PUT test
{
  "settings": {
	"number_of_replicas": 0,
	"number_of_shards": 1
  },
  "mappings": {
	"doc":{
	  "properties": {
		"date":{
		  "type":"date",
		  "format":"yyyy-MM-dd HH:mm"
		},
		"id":{
		  "type":"keyword"
		},
		"message":{
		  "type":"text"
		}
	  }
	}
  }
}
POST test/doc/_bulk
{"index":{}}
{"date":"2018-01-01 01:01", "id":"bad","message":"hello"}
{"index":{}}
{"date":"2018-01-01 01:02", "id":"bad","message":"hello"}
{"index":{}}
{"date":"2018-01-01 01:02", "id":"good","message":"hello"}
{"index":{}}
{"date":"2018-01-01 03:02", "id":"good","message":"hello"}



GET test/_search?q=hello
{
  "size": 0,
  "aggs": {
	"terms": {
	  "terms": {
		"min_doc_count": 2,
		"shard_min_doc_count": 1,
		"script": {
		  "source": """
 doc['date'].value.getHourOfDay().toString() + "-"+
 doc['date'].value.getDayOfMonth().toString()+ "-"+ 
 doc['date'].value.getMonthOfYear().toString()+ "-"+
 doc['date'].value.getYear().toString()+ "-"+
 doc["id"].value
"""
		}
	  }
	}
  }
}  

I'm sure there's a better way of formatting the date but essentially I combine the ID and a string representation of an hour-level bucket. Obviously this doesn't spot 2 events close to an hour boundary e.g. 12:59 and 13:01 but might be good enough. If you have many unique IDs and they are spread across multiple shards then this will not scale and you need to think about indexing approaches that bring related data closer together.


(Adam Michalik) #3

Thanks, Mark. This is for reporting purposes, so speed is not so much of a requirement and the results can be approximate as long as I can get a good sample of occurrences. I'll try that out.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.