About intersection functionality

Hi all, I am encountering a problem about intersection functionality.

I have the following mapping:
{
"message": {"type": "text"},
"created_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"}
}

Now I want to implement a monitor program, finding out the messages that do NOT show up in the past N days (say N=3 or N=7) but show up today with count >= M. For now, I am able to find out the messages that do NOT show up in the past few days independently, and the messages that show up today with count >= M independently, but HOW CAN I PERFORM AN INTERSECTION?

Is there any better solution? Other than finding out two separate set and doing intersection.

Hi,

I am not sure if it is what you want but you can execute the following query and check the values of the aggregations. First you should aggregate by the message, your message is a text which can not be aggregate unless you turn on field_data (the message would be broken into tokens), you can add the keyword to the mapping and do an update_by_query. Than you would do a second range aggregation by the dates and check if the first range aggregation would be 0 and the second more than M.

{
  "size": 0,
  "aggs": {
    "status_field": {
      "terms": {
        "field": "message.keyword",
        "size": 1000
      },
      "aggs": {
        "date_monitoring": {
          "date_range": {
            "field": "created_at",
            "ranges": [
              {
                "from": "now-3d/d",
                "to": "now/d"
              },
               {
                "from": "now/d",
                "to": "now"
              }             
            ]
          }
        }
      }
    }
  }
}

Yes, it works in the ES's way! I know ES can do such a thing but I don't have that kind of logic in my mind.

It is very close to what I need. Thank you.

BTW, I add a bucket selector right after the last aggregation, which helps me filter out the range agg that shows 0.

Hi,

As you don't have the logic I think this could fit your needs.

  {
      "size": 0,
      "aggs": {
        "status_field": {
          "terms": {
            "field": "message.keyword",
            "size": 1000
          },
          "aggs": {
            "date_3d": {
              "filter": {
                "range": {
                  "created_at": {
                    "gte": "now-3d/d",
                    "lt": "now/d"
                  }
                }
              }
            },
            "date_today": {
              "filter": {
                "range": {
                  "created_at": {
                    "gte": "now/d",
                    "lt": "now"
                  }
                }
              }
            },
            "days_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "countDays": "date_3d._count",
                  "countToday": "date_today._count"
                },
                "script": "params.countDays == 0 && params.countToday > 0"
              }
            }
          }
        }
      }
    }

Cool, actually I've come up with a 95%-similar query by myself :slight_smile:
Thank you so much.

You are welcome. That's awesome. :slight_smile:

Hi Tamara, I got another trouble here, could u please help me with this?

Say I'm having a mapping like

{
    "transaction_id": {"type": "keyword"},
    "transaction_pattern": {"type": "keyword"},
    "transaction_cost": {"type": "integer"},
    "created_at": {"type": "date", "format": "......"}
}

There are like 500 distinct transaction_pattern in my index, and the number of these patterns can be viewed as fixed.

Now I am asked to extract the documents which are created in the last 10 minutes (because I have to run this query every 10 minutes), and the transaction_cost is 50% higher than the AVG cost of the same transaction_pattern in the last 7 days.

The following is what I've done so far, but I am still far from my goal.

{
"size": 0,
"query": {
    "bool": {
        "filter": {
            "range": {
                "created_at": {
                    "gte": "now-7d/d",
                    "lte": "now/d"
                }
            }
        }
    }
},
"aggs": {
    "groupby": {
        "terms": {
            "field": "transaction_pattern",
            "size": 1000
        },
        "aggs": {
            "last_ten_min": {
                "filter": {
                    "range": {
                        "created_at": {
                            "gte": "now-10m/m",
                            "lte": "now-1m/m"
                        }
                    }
                }
            },
            "last_ten_min_selector": {
                "bucket_selector": {
                    "buckets_path": {
                        "count": "last_ten_min._count"
                    },
                    "script": {
                        "inline": "params.count > 0"
                    }
                }
            },
            "last_seven_day": {
                "filter": {
                    "range": {
                        "created_at": {
                            "gte": "now-7d/d",
                            "lte": "now-1d/d"
                        }
                    }
                },
                "aggs": {
                	"avg_cost": {
                		"avg": {
                			"field": "transaction_cost"
                		}
                	}
                }
            },
        	"last_seven_day_selector": {
                "bucket_selector": {
                    "buckets_path": {
                        "count": "last_seven_day._count"
                    },
                    "script": {
                        "inline": "params.count > 0"
                    }
                }
            }
        }
    }
}
}

Obviously, I can get the docs created in the last 10 mins in last_ten_min aggs, and also I can calculate the AVG cost of the same pattern in the last 7 days in last_seven_day aggs, but HOW CAN I MAKE USE OF the avg cost in filtering out the docs in last_ten_min aggs?

For example, say there is a transaction_pattern like "hello, world", which occurred in the last 7 days for many times, and the AVG cost of this specific pattern is 100. And in the last 10 minutes, "hello, world" occurred 5 times, the costs are 180, 120, 90, 100, 170 respectively. So my goal is to find out the docs with cost 180 and 170, because both 180 & 170 are 50% higher than the 7-day-average-cost, which is 100.

Hi Acepcs,

Try something like this:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "range": {
          "created_at": {
            "gte": "now-7d/d",
            "lte": "now/d"
          }
        }
      }
    }
  },
  "aggs": {
    "groupby": {
      "terms": {
        "field": "transaction_pattern",
        "size": 1000
      },
      "aggs": {
        "last_ten_min": {
          "filter": {
            "range": {
              "created_at": {
                "gte": "now-10m/m",
                "lte": "now-1m/m"
              }
            }
          },
          "aggs": {
            "avg_cost": {
              "avg": {
                "field": "transaction_cost"
              }
            }
          }
        },
        "last_ten_min_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "last_ten_min._count"
            },
            "script": "params.count > 0"
          }
        },
        "last_seven_day": {
          "filter": {
            "range": {
              "created_at": {
                "gte": "now-7d/d",
                "lte": "now-1d/d"
              }
            }
          },
          "aggs": {
            "avg_cost": {
              "avg": {
                "field": "transaction_cost"
              }
            }
          }
        },
        "last_seven_day_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "last_seven_day._count"
            },
            "script":  "params.count > 0"
          }
        },
        "avg_filter": {
          "bucket_selector": {
            "gap_policy": "insert_zeros",
            "buckets_path": {
              "avgSevenD": "last_seven_day>avg_cost.avg",
              "avgTenM": "last_ten_min>avg_cost.avg"
            },
            "script": "params.avgSevenD * 1.5  <= params.avgTenM"
          }
        }
      }
    }
  }
}

I've tried the same logic before, but then I realized it's NOT exactly what I need.

In your solution, you calculate the AVG cost of the patterns created in the last 10 mins, then compare this AVG cost with the seven_day_avg_cost. But I want to know exactly the ones whose cost is 50% higher than the seven_day_avg_cost.

Let me use the "hello, world" example above again. The avg cost of (180, 120, 90, 100, 170) is 132, which is lower than 100 * 1.5. If I use your query above, I am NOT able to find out the doc with cost 180 and the doc with cost 170.

Could you post some examples of documents you have?

Sure, I am trying to give u the simplest examples.

{
    "transaction_id": “abc001”,
    "transaction_pattern": "hello",
    "transaction_cost": 150,
    "created_at": "2018-08-17 15:30:00"
}
{
    "transaction_id": “abc008”,
    "transaction_pattern": "hello",
    "transaction_cost": 60,
    "created_at": "2018-08-17 19:00:00"
}
{
    "transaction_id": “abc030”,
    "transaction_pattern": "hello",
    "transaction_cost": 100,
    "created_at": "2018-08-18 12:30:00"
}
{
    "transaction_id": “abc017”,
    "transaction_pattern": "hello",
    "transaction_cost": 80,
    "created_at": "2018-08-17 22:30:00"
}
{
    "transaction_id": “abc087”,
    "transaction_pattern": "hello",
    "transaction_cost": 135,
    "created_at": "2018-08-22 15:00:00"
}
{
    "transaction_id": “abc088”,
    "transaction_pattern": "hello",
    "transaction_cost": 160,
    "created_at": "2018-08-22 15:30:00"
}

So you can clearly see there are 6 docs that share the same pattern "hello", 4 of them were created in the last 7 days and 2 of them were created in today.

The AVG cost of the docs created in the last 7 days is (150+60+100+80) / 4 = 97.5. So I wanna find out the doc whose transaction_id is "abc088" for its transaction_cost is HIGHER than 97.5 * 1.5. But I DONT want to include "abc087" in the returning results.

Hi Acepcs,

I have tried a lot of different solutions and I could not find a way to filter a top_hits based in a parent metric aggregation. I would recommend you to filter the hits in the client side or make a a second query with the average in the filter.

Appreciate your time and effort, thank you Tamara!
Yes, my current solution is filtering in the client side.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.