About intersection functionality

Acepcs · August 20, 2018, 11:36am

Hi all, I am encountering a problem about intersection functionality.

I have the following mapping:
{
"message": {"type": "text"},
"created_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"}
}

Now I want to implement a monitor program, finding out the messages that do NOT show up in the past N days (say N=3 or N=7) but show up today with count >= M. For now, I am able to find out the messages that do NOT show up in the past few days independently, and the messages that show up today with count >= M independently, but HOW CAN I PERFORM AN INTERSECTION?

Is there any better solution? Other than finding out two separate set and doing intersection.

tamara · August 20, 2018, 2:52pm

Hi,

I am not sure if it is what you want but you can execute the following query and check the values of the aggregations. First you should aggregate by the message, your message is a text which can not be aggregate unless you turn on field_data (the message would be broken into tokens), you can add the keyword to the mapping and do an update_by_query. Than you would do a second range aggregation by the dates and check if the first range aggregation would be 0 and the second more than M.

{
  "size": 0,
  "aggs": {
    "status_field": {
      "terms": {
        "field": "message.keyword",
        "size": 1000
      },
      "aggs": {
        "date_monitoring": {
          "date_range": {
            "field": "created_at",
            "ranges": [
              {
                "from": "now-3d/d",
                "to": "now/d"
              },
               {
                "from": "now/d",
                "to": "now"
              }             
            ]
          }
        }
      }
    }
  }
}

Acepcs · August 21, 2018, 1:54am

Yes, it works in the ES's way! I know ES can do such a thing but I don't have that kind of logic in my mind.

It is very close to what I need. Thank you.

Acepcs · August 21, 2018, 2:00am

BTW, I add a bucket selector right after the last aggregation, which helps me filter out the range agg that shows 0.

tamara · August 21, 2018, 8:06am

Hi,

As you don't have the logic I think this could fit your needs.

  {
      "size": 0,
      "aggs": {
        "status_field": {
          "terms": {
            "field": "message.keyword",
            "size": 1000
          },
          "aggs": {
            "date_3d": {
              "filter": {
                "range": {
                  "created_at": {
                    "gte": "now-3d/d",
                    "lt": "now/d"
                  }
                }
              }
            },
            "date_today": {
              "filter": {
                "range": {
                  "created_at": {
                    "gte": "now/d",
                    "lt": "now"
                  }
                }
              }
            },
            "days_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "countDays": "date_3d._count",
                  "countToday": "date_today._count"
                },
                "script": "params.countDays == 0 && params.countToday > 0"
              }
            }
          }
        }
      }
    }

Acepcs · August 21, 2018, 8:18am

Cool, actually I've come up with a 95%-similar query by myself
Thank you so much.

tamara · August 21, 2018, 8:28am

You are welcome. That's awesome.

Acepcs · August 22, 2018, 2:51am

Hi Tamara, I got another trouble here, could u please help me with this?

Say I'm having a mapping like

{
    "transaction_id": {"type": "keyword"},
    "transaction_pattern": {"type": "keyword"},
    "transaction_cost": {"type": "integer"},
    "created_at": {"type": "date", "format": "......"}
}

There are like 500 distinct transaction_pattern in my index, and the number of these patterns can be viewed as fixed.

Now I am asked to extract the documents which are created in the last 10 minutes (because I have to run this query every 10 minutes), and the transaction_cost is 50% higher than the AVG cost of the same transaction_pattern in the last 7 days.

The following is what I've done so far, but I am still far from my goal.

{
"size": 0,
"query": {
    "bool": {
        "filter": {
            "range": {
                "created_at": {
                    "gte": "now-7d/d",
                    "lte": "now/d"
                }
            }
        }
    }
},
"aggs": {
    "groupby": {
        "terms": {
            "field": "transaction_pattern",
            "size": 1000
        },
        "aggs": {
            "last_ten_min": {
                "filter": {
                    "range": {
                        "created_at": {
                            "gte": "now-10m/m",
                            "lte": "now-1m/m"
                        }
                    }
                }
            },
            "last_ten_min_selector": {
                "bucket_selector": {
                    "buckets_path": {
                        "count": "last_ten_min._count"
                    },
                    "script": {
                        "inline": "params.count > 0"
                    }
                }
            },
            "last_seven_day": {
                "filter": {
                    "range": {
                        "created_at": {
                            "gte": "now-7d/d",
                            "lte": "now-1d/d"
                        }
                    }
                },
                "aggs": {
                	"avg_cost": {
                		"avg": {
                			"field": "transaction_cost"
                		}
                	}
                }
            },
        	"last_seven_day_selector": {
                "bucket_selector": {
                    "buckets_path": {
                        "count": "last_seven_day._count"
                    },
                    "script": {
                        "inline": "params.count > 0"
                    }
                }
            }
        }
    }
}
}

Obviously, I can get the docs created in the last 10 mins in last_ten_min aggs, and also I can calculate the AVG cost of the same pattern in the last 7 days in last_seven_day aggs, but HOW CAN I MAKE USE OF the avg cost in filtering out the docs in last_ten_min aggs?

For example, say there is a transaction_pattern like "hello, world", which occurred in the last 7 days for many times, and the AVG cost of this specific pattern is 100. And in the last 10 minutes, "hello, world" occurred 5 times, the costs are 180, 120, 90, 100, 170 respectively. So my goal is to find out the docs with cost 180 and 170, because both 180 & 170 are 50% higher than the 7-day-average-cost, which is 100.

tamara · August 22, 2018, 8:15am

Hi Acepcs,

Try something like this:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "range": {
          "created_at": {
            "gte": "now-7d/d",
            "lte": "now/d"
          }
        }
      }
    }
  },
  "aggs": {
    "groupby": {
      "terms": {
        "field": "transaction_pattern",
        "size": 1000
      },
      "aggs": {
        "last_ten_min": {
          "filter": {
            "range": {
              "created_at": {
                "gte": "now-10m/m",
                "lte": "now-1m/m"
              }
            }
          },
          "aggs": {
            "avg_cost": {
              "avg": {
                "field": "transaction_cost"
              }
            }
          }
        },
        "last_ten_min_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "last_ten_min._count"
            },
            "script": "params.count > 0"
          }
        },
        "last_seven_day": {
          "filter": {
            "range": {
              "created_at": {
                "gte": "now-7d/d",
                "lte": "now-1d/d"
              }
            }
          },
          "aggs": {
            "avg_cost": {
              "avg": {
                "field": "transaction_cost"
              }
            }
          }
        },
        "last_seven_day_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "last_seven_day._count"
            },
            "script":  "params.count > 0"
          }
        },
        "avg_filter": {
          "bucket_selector": {
            "gap_policy": "insert_zeros",
            "buckets_path": {
              "avgSevenD": "last_seven_day>avg_cost.avg",
              "avgTenM": "last_ten_min>avg_cost.avg"
            },
            "script": "params.avgSevenD * 1.5  <= params.avgTenM"
          }
        }
      }
    }
  }
}

Acepcs · August 22, 2018, 8:33am

I've tried the same logic before, but then I realized it's NOT exactly what I need.

In your solution, you calculate the AVG cost of the patterns created in the last 10 mins, then compare this AVG cost with the seven_day_avg_cost. But I want to know exactly the ones whose cost is 50% higher than the seven_day_avg_cost.

Let me use the "hello, world" example above again. The avg cost of (180, 120, 90, 100, 170) is 132, which is lower than 100 * 1.5. If I use your query above, I am NOT able to find out the doc with cost 180 and the doc with cost 170.

tamara · August 22, 2018, 8:53am

Could you post some examples of documents you have?

Acepcs · August 22, 2018, 9:19am

Sure, I am trying to give u the simplest examples.

{
    "transaction_id": “abc001”,
    "transaction_pattern": "hello",
    "transaction_cost": 150,
    "created_at": "2018-08-17 15:30:00"
}
{
    "transaction_id": “abc008”,
    "transaction_pattern": "hello",
    "transaction_cost": 60,
    "created_at": "2018-08-17 19:00:00"
}
{
    "transaction_id": “abc030”,
    "transaction_pattern": "hello",
    "transaction_cost": 100,
    "created_at": "2018-08-18 12:30:00"
}
{
    "transaction_id": “abc017”,
    "transaction_pattern": "hello",
    "transaction_cost": 80,
    "created_at": "2018-08-17 22:30:00"
}
{
    "transaction_id": “abc087”,
    "transaction_pattern": "hello",
    "transaction_cost": 135,
    "created_at": "2018-08-22 15:00:00"
}
{
    "transaction_id": “abc088”,
    "transaction_pattern": "hello",
    "transaction_cost": 160,
    "created_at": "2018-08-22 15:30:00"
}

So you can clearly see there are 6 docs that share the same pattern "hello", 4 of them were created in the last 7 days and 2 of them were created in today.

The AVG cost of the docs created in the last 7 days is (150+60+100+80) / 4 = 97.5. So I wanna find out the doc whose transaction_id is "abc088" for its transaction_cost is HIGHER than 97.5 * 1.5. But I DONT want to include "abc087" in the returning results.

tamara · August 22, 2018, 3:34pm

Hi Acepcs,

I have tried a lot of different solutions and I could not find a way to filter a top_hits based in a parent metric aggregation. I would recommend you to filter the hits in the client side or make a a second query with the average in the filter.

Acepcs · August 23, 2018, 2:02am

Appreciate your time and effort, thank you Tamara!
Yes, my current solution is filtering in the client side.

system · September 20, 2018, 2:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How can i compare the fields between index OR compare fields between the result hit? Elasticsearch	19	897	February 7, 2019
Complex nested query Elasticsearch	8	410	July 6, 2017
Joining aggregation results to the original documents Elasticsearch	5	423	October 4, 2021
Intersect statement in elasticsearch? Elasticsearch	1	439	July 6, 2017
Intersection aggregation Elasticsearch	3	5534	July 17, 2017

About intersection functionality

Related topics