Aggregation Framework, possible to get distribution of requests per user


(Thomas) #1

Hi,

I wanted to ask whether it is possible to get with the aggregation
framework the distribution of one specific type of documents sent per user,
I'm interested for occurrences of documents per user, e.g. :

1000 users sent 1 document
500 ussers sent 2 documents
X number of unique users sent Y documents (each)
etc.

on each document i index the user_id

Is there a way to support such a query, or partially support it? get the
first 10 rows of this type of list not the exhaustive list. Can you give me
some hint?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c9e7e543-372c-4441-9cac-e7c0f259ed4e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #2

Imagine that you have indexed users.
User has a numberOfDocs field.

You can build a range aggregation on top of that and gives back the count for buckets like:

numberOfDocs < 2
1 < numberOfDocs < 3

See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-range-aggregation.html#search-aggregations-bucket-range-aggregation

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 24 juin 2014 à 12:32:16, Thomas (thomas.bolis@gmail.com) a écrit:

Hi,

I wanted to ask whether it is possible to get with the aggregation framework the distribution of one specific type of documents sent per user, I'm interested for occurrences of documents per user, e.g. :

1000 users sent 1 document
500 ussers sent 2 documents
X number of unique users sent Y documents (each)
etc.

on each document i index the user_id

Is there a way to support such a query, or partially support it? get the first 10 rows of this type of list not the exhaustive list. Can you give me some hint?

Thanks

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c9e7e543-372c-4441-9cac-e7c0f259ed4e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.53a956a4.5bd062c2.950f%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.


(Thomas) #3

Hi David

Thank you for your reply, so based on your suggestion I should maintain a
document (e.g. user) with some aggregated values and I should update it as
we move along with our indexing of our data, correct?

This though would only give me totals. I cannot apply something like a
range. I found as well a similar discussion here
https://groups.google.com/forum/#!msg/elasticsearch/UsrCG2Abj-A/IDO9DX_PoQwJ.
Maybe something similar with the terms and histogram aggregation could
support this logic like instead of giving :

{
"aggs" : {
"requests_distribution" : {
"distribution" : {
"field" : "user_id",
"interval" : 50
}
}
}
}

and the result could be:

{
"aggregations": {
"requests_distribution" : {
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 50,
"doc_count": 400
},
{
"key": 150,
"doc_count": 30
}
]
}
}
}

Where the key represents a unique number of users like for 0 to 50 users
have 2 documents per user etc.

Just an idea

Thanks
Thomas

On Tuesday, 24 June 2014 13:32:13 UTC+3, Thomas wrote:

Hi,

I wanted to ask whether it is possible to get with the aggregation
framework the distribution of one specific type of documents sent per user,
I'm interested for occurrences of documents per user, e.g. :

1000 users sent 1 document
500 ussers sent 2 documents
X number of unique users sent Y documents (each)
etc.

on each document i index the user_id

Is there a way to support such a query, or partially support it? get the
first 10 rows of this type of list not the exhaustive list. Can you give me
some hint?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ae8b56f1-a783-4ade-b948-079f6457ae27%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #4

I was only thinking loud. I mean that I don't know what your model looks like.
May be you could illustrate your use case with some actual data and we can move forward from here?

What kind of documents are you actually indexing and searching for? What fields do you have?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 24 juin 2014 à 14:42:14, Thomas (thomas.bolis@gmail.com) a écrit:

Hi David

Thank you for your reply, so based on your suggestion I should maintain a document (e.g. user) with some aggregated values and I should update it as we move along with our indexing of our data, correct?

This though would only give me totals. I cannot apply something like a range. I found as well a similar discussion here https://groups.google.com/forum/#!msg/elasticsearch/UsrCG2Abj-A/IDO9DX_PoQwJ. Maybe something similar with the terms and histogram aggregation could support this logic like instead of giving :

{
"aggs" : {
"requests_distribution" : {
"distribution" : {
"field" : "user_id",
"interval" : 50
}
}
}
}

and the result could be:

{
"aggregations": {
"requests_distribution" : {
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 50,
"doc_count": 400
},
{
"key": 150,
"doc_count": 30
}
]
}
}
}

Where the key represents a unique number of users like for 0 to 50 users have 2 documents per user etc.

Just an idea

Thanks
Thomas

On Tuesday, 24 June 2014 13:32:13 UTC+3, Thomas wrote:
Hi,

I wanted to ask whether it is possible to get with the aggregation framework the distribution of one specific type of documents sent per user, I'm interested for occurrences of documents per user, e.g. :

1000 users sent 1 document
500 ussers sent 2 documents
X number of unique users sent Y documents (each)
etc.

on each document i index the user_id

Is there a way to support such a query, or partially support it? get the first 10 rows of this type of list not the exhaustive list. Can you give me some hint?

Thanks

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ae8b56f1-a783-4ade-b948-079f6457ae27%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.53a97c1d.2443a858.950f%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.


(Thomas) #5

My mistake sorry,

Here is an example:

I have the request document:

"request":{
"dynamic" : "strict",
"properties" : {
"time" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"user_id" : {
"index" : "not_analyzed",
"type" : "string"
},
"country" : {
"index" : "not_analyzed",
"type" : "string"
}
}
}

I want to find the number of (unique) user_ids that have X number of
documents, e.g. for country US, and ideally I need the full list e.g.:

1000 users have 43 documents
..
100 users have 234 documents
150 users have 500 documents
etc..

In other words the distribution of documents (requests) per unique user
count, of course I can understand that it is a pretty heavy operation in
terms of memory, but we may limit to the top 100 rows for instance, or if
we can workaround it.

Thanks again for your time
Thomas

On Tuesday, 24 June 2014 13:32:13 UTC+3, Thomas wrote:

Hi,

I wanted to ask whether it is possible to get with the aggregation
framework the distribution of one specific type of documents sent per user,
I'm interested for occurrences of documents per user, e.g. :

1000 users sent 1 document
500 ussers sent 2 documents
X number of unique users sent Y documents (each)
etc.

on each document i index the user_id

Is there a way to support such a query, or partially support it? get the
first 10 rows of this type of list not the exhaustive list. Can you give me
some hint?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e07561ed-7f1b-4e98-8a8d-16e410324cc2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6