Could I query all docs where there are multiple occurrences of my term?


(Thomas Chung) #1

I want to use Elasticsearch to index event data and as events come in, I want to append to an "events" array in the document. Say it ends up looking like this:

{
  "user":"abc",
  "events": [
    {
      "action": "foo",
      "date": "2015-01-01"
    },
    {
      "action": "foo",
      "date": "2016-01-01"
    },
    {
      "action": "bar",
      "date": "2017-01-01"
    }
  ]
}

And another doc looks like:

{
  "user":"def",
  "events": [
    {
      "action": "foo",
      "date": "2017-01-01"
    },
    {
      "action": "baz",
      "date": "2018-01-01"
    }
  ]
}

I get that I can query/filter for all users/docs with action=foo but could I:

  • find all the documents where action=foo has occurred, say, exactly 2 times?
  • find all the documents where action=foo has occurred 2 times between some start and end date.

Thanks in advance.


How to do a self-join type of query
(Mark Harwood) #2

To avoid "cross matching" of the dates for foo and bar you'll need to use the nested data type for the events.

I've not yet figured out how to trim results to those with 2 different events but this should at least avoid confusing matches on the dates of your "foo" events with those of "bar" events::

DELETE test
PUT test
{
  "mappings": {
	"_doc": {
	  "properties": {
		"user":{
		  "type":"keyword"
		},
		"events": {
		  "type": "nested",
		  "properties": {
			"action": {
			  "type": "keyword"
			},
			"date":{
			  "type":"date"
			}
		  }
		}
	  }
	}
  }
}
PUT test/_doc/1
{
  "user":"abc",
  "events": [
	{
	  "action": "foo",
	  "date": "2015-01-01"
	},
	{
	  "action": "foo",
	  "date": "2016-01-01"
	},
	{
	  "action": "bar",
	  "date": "2017-01-01"
	}
  ]
}
PUT test/_doc/2
{
  "user":"abc",
  "events": [
	{
	  "action": "foo",
	  "date": "2015-01-01"
	},
	{
	  "action": "bar",
	  "date": "2016-01-01"
	},
	{
	  "action": "foo",
	  "date": "2017-01-01"
	}
  ]
}
GET test/_search
{
  "query": {
	"bool": {
	  "must": [
		{
		  "nested": {
			"path": "events",
			"query": {
			  "bool": {
				"must": [
				  {
					"match":{
					  "events.action":"foo"
					}
				  },
				  {
					"range": {
					  "events.date": {
						"gte": "2013-01-01",
						"lte": "2016-12-31"
					  }
					}
				  }
				]
			  }
			},
			"inner_hits": {}
		  }
		}
	  ]
	}
  },
  "stored_fields": [
	"user"
  ]
}

The use of inner_hits prevents you from seeing non-matching events. Your client would need to trim results that only matched one event,

Having said that, what can be hard to match at query time with searches/aggregations can be trivial at index time using an update script - see entity centric indexing
Useful attributes like an "outage" range field can be derived by a script recognising the 2 event types of interest and storing a derived value when those events come in as an update.


(Thomas Chung) #3

Thanks @Mark_Harwood. I watched the video about entity centric indexing and it looks good and I will try that approach.

I think I can index individual events as they occur (the "doc" will be the event) into an "events" index. To find users that have done "foo exactly twice" (like in my original post), I can periodically update a secondary "user" index and a user doc, that has a "didFooTwice" boolean. Then a simple bool query can get me what I want.


(system) closed #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.