[Aggregations] Average for document count per term field


(eliasah) #1

It might be a silly question, so excuse me.

I have the following index events with a type click and the simplified following mapping:

 {
    "events": {
       "mappings": {
          "click": {
             "properties": {
                "evenType": {
                   "type": "string"
                },
                "eventDate": {
                   "type": "date",
                   "format": "dateOptionalTime"
                },
                "userId": {
                   "type": "long"
                }
             }
          }
       }
    }
 }

I'm trying to compute the number of clicks for the top user compared to number of of unique uses.

I am kind of stuck with the following which returns the number of clicks for the top user :

 curl -XGET "http://localhost:9200/events/click/_search" -d'
 {
   "size": 0,
   "aggs": {
     "users with most clicks": {
       "terms": {
         "field": "userId",
         "size": 10
       }
     }
   }
 }'

and computing the number of unique users :

 curl -XGET "http://localhost:9200/events/click/_search" -d'
 {
   "size": 0,
   "aggs": {
     "uniques users": {
       "cardinality": {
         "field": "userId"
       }
     }
   }
 }' 

Is it possible to combine those two aggregations so it can returns the average number of clicks per user?

Thanks in advance.


(Mark Harwood) #2

I'm not 100% sure what you are after but if you want average number of clicks per user then that would be the total number of clicks recorded divided by the number of unique users. Both of these numbers are returned in the results of your last example query in the JSON fields hits/total and aggregations/uniques user/value

Cheers
Mark


(eliasah) #3

I know how to process them outside of elasticsearch, if I send both queries I have mentioned above.

What I'm trying to do is the following :

Let's consider that a user I made a number of clicks Y_i what means that we have Y_i observations on the userId = I.
What I'm trying to compute is the density of the user X clicks Y upon all the observations.

  • I compute Y_i with the terms aggregations for a specific user I, described above.
  • U is the number of unique users computed with the second cardinality aggregation.
  • I am looking want to compute Y_i / U for the top 10 users.

I don't know if that is clear enough.


(Mark Harwood) #4

Less clear I'm afraid.

U is the number of unique users
Y_i / U for the top 10 users.

U here is a constant. So you want to scale all of the doc_counts reported for each of the top 10 users by this constant?
I don't get why this would be useful. It's like finding the top 10 vehicles with the most miles on the clock and then dividing these numbers by the total number of car manufacturers. It does not look to serve any purpose.

Can we start with stating the business problem you are trying to solve rather than how you intend to solve it?


(eliasah) #5

U can move over time, that's why I don't want to consider it as a constant.

I'm not interested in the business value as much as the way to do this operation.

We can simplify the problem to the following :

"buckets": [
        {
           "key": 2096544,
           "doc_count": 48168
        },
        {
           "key": 68753,
           "doc_count": 33208
        },
        {
           "key": 34,
           "doc_count": 31665
        },
        [...]
}

As a result of the first terms aggregation.

I want to compute the doc_count / number of hits where the keys are the userId.

How can I perform this operation?


(Mark Harwood) #6

I'm not interested in the business value
I want to compute the doc_count / number of hits where the keys are the userId.

This changes the previous definition: Y_i / U is not the same as doc_count / number of hits

U can move over time, that's why I don't want to consider it as a constant.

I meant it can be considered as a constant for the purposes of your single request. It's equivalent to saying "I want to multiply all reported doc_counts by 0.234235" - it is an arbitrary fixed boost applied to rebase all doc_count values and does nothing to change the ranking order used to select the top 10 users.
We do see real-world examples of using elasticsearch on click data and there are powerful analysis techniques available but unfortunately they do not extend to your example of re-basing doc_count numbers for display purposes.


(eliasah) #7

Ok. Thank you for trying to help!


(system) #8