Something I am finding difficult, using Aggregations


#1

Having used elastic aggregations for a little bit (and having used Mongo
aggregations previously), I have been finding a couple of things a bit
difficult/awkward.
I am not sure if its because I don't know how to do it properly - or we
missing a feature/enhancement in elastic.

A common thing I want to do is aggregate on field x, but in the result, I
also want field y & z (which are unique for a given x) - there doesn't seem
to be an easy way to do that.

Lets say I have some data:
{
"id" : "94538ef6-2998-4ddd-be00-1f5dc2654955",
"quantity" : 1234567.2342,
"commodityId" : "0e918fb8-6572-4663-a692-cbebe8aca7f2",
"commodityName" : "Lead",
"ownerId" : "53e0f816-8a0a-4659-b868-c48035676b25",
"ownerName" : "Simon Chan",
"locationId" : "1cdd4bc7-76d9-43fb-ac56-8f555164211a",
"locationName" : "Shenyang - Shenyang Dongbei",
"locationCode" : "W33",
"locationCity" : "Shenyang",
"locationCountry" : "China"
}

Lets say I want to do a (term) aggregation on ownerId (because its unique,
while ownerName obviously is not) I will get results where the bucket key
is the id. However, what I want to display to the user is the ownerName -
not the id. Looking up the name from the id could be very expensive - but
its also unnecessary because the name will be unique for a given bucket -
we have the info to hand in the index. The same issue if I want to
aggregate by locationId, or commodityId. We dereference the data associated
with an id, so that we can search on them - but also we want to use this
information to create a label for a bucket when we aggregate.

Is there a simple way to retrieve ownerName while aggregating on ownerId?
The only way I know to do this is to:
a) make sure owner name is not_analyzed and
b) do a term subaggregation - which will give only 1 result.
Is there an easier way I have missed?

(FWIW doing the same thing in, say, a Mongo aggregation is simply a matter
of adding the ownerName as a key field - since its unique for a given id,
it wont change the aggregation results - the ownerName info is simply
extracted from the key data in the result).

Cheers,
M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cfcf8e74-06e7-4bf3-8cca-311dd14ccbe2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#2

Am I on my own with this problem? Have I got it all wrong?

On Wednesday, 2 July 2014 13:21:26 UTC+1, mooky wrote:

Having used elastic aggregations for a little bit (and having used Mongo
aggregations previously), I have been finding a couple of things a bit
difficult/awkward.
I am not sure if its because I don't know how to do it properly - or we
missing a feature/enhancement in elastic.

A common thing I want to do is aggregate on field x, but in the result, I
also want field y & z (which are unique for a given x) - there doesn't seem
to be an easy way to do that.

Lets say I have some data:
{
"id" : "94538ef6-2998-4ddd-be00-1f5dc2654955",
"quantity" : 1234567.2342,
"commodityId" : "0e918fb8-6572-4663-a692-cbebe8aca7f2",
"commodityName" : "Lead",
"ownerId" : "53e0f816-8a0a-4659-b868-c48035676b25",
"ownerName" : "Simon Chan",
"locationId" : "1cdd4bc7-76d9-43fb-ac56-8f555164211a",
"locationName" : "Shenyang - Shenyang Dongbei",
"locationCode" : "W33",
"locationCity" : "Shenyang",
"locationCountry" : "China"
}

Lets say I want to do a (term) aggregation on ownerId (because its unique,
while ownerName obviously is not) I will get results where the bucket key
is the id. However, what I want to display to the user is the ownerName -
not the id. Looking up the name from the id could be very expensive - but
its also unnecessary because the name will be unique for a given bucket -
we have the info to hand in the index. The same issue if I want to
aggregate by locationId, or commodityId. We dereference the data associated
with an id, so that we can search on them - but also we want to use this
information to create a label for a bucket when we aggregate.

Is there a simple way to retrieve ownerName while aggregating on ownerId?
The only way I know to do this is to:
a) make sure owner name is not_analyzed and
b) do a term subaggregation - which will give only 1 result.
Is there an easier way I have missed?

(FWIW doing the same thing in, say, a Mongo aggregation is simply a matter
of adding the ownerName as a key field - since its unique for a given id,
it wont change the aggregation results - the ownerName info is simply
extracted from the key data in the result).

Cheers,
M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d5858fc3-b7a1-4c60-9678-7f905c496c92%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Harwood-2) #3

There is a universal truth that computers want IDs and people prefer
looking at labels.
Almost every application has to handle this translation manually and it
does feel like if the platform had built-in knowledge of an
id->reference-data mapping that would be of widespread use.

In the interim I guess one approach is to combine the unique ID and related
non-unique label into a single token which would then satisfy the needs of
having unique tokens for aggregation and readable tokens for display
purposes (perhaps made more readable if you strip the ID from the token
before display).
Obviously this would add overheads over using a basic ID.

On Wednesday, July 2, 2014 1:21:26 PM UTC+1, mooky wrote:

Having used elastic aggregations for a little bit (and having used Mongo
aggregations previously), I have been finding a couple of things a bit
difficult/awkward.
I am not sure if its because I don't know how to do it properly - or we
missing a feature/enhancement in elastic.

A common thing I want to do is aggregate on field x, but in the result, I
also want field y & z (which are unique for a given x) - there doesn't seem
to be an easy way to do that.

Lets say I have some data:
{
"id" : "94538ef6-2998-4ddd-be00-1f5dc2654955",
"quantity" : 1234567.2342,
"commodityId" : "0e918fb8-6572-4663-a692-cbebe8aca7f2",
"commodityName" : "Lead",
"ownerId" : "53e0f816-8a0a-4659-b868-c48035676b25",
"ownerName" : "Simon Chan",
"locationId" : "1cdd4bc7-76d9-43fb-ac56-8f555164211a",
"locationName" : "Shenyang - Shenyang Dongbei",
"locationCode" : "W33",
"locationCity" : "Shenyang",
"locationCountry" : "China"
}

Lets say I want to do a (term) aggregation on ownerId (because its unique,
while ownerName obviously is not) I will get results where the bucket key
is the id. However, what I want to display to the user is the ownerName -
not the id. Looking up the name from the id could be very expensive - but
its also unnecessary because the name will be unique for a given bucket -
we have the info to hand in the index. The same issue if I want to
aggregate by locationId, or commodityId. We dereference the data associated
with an id, so that we can search on them - but also we want to use this
information to create a label for a bucket when we aggregate.

Is there a simple way to retrieve ownerName while aggregating on ownerId?
The only way I know to do this is to:
a) make sure owner name is not_analyzed and
b) do a term subaggregation - which will give only 1 result.
Is there an easier way I have missed?

(FWIW doing the same thing in, say, a Mongo aggregation is simply a matter
of adding the ownerName as a key field - since its unique for a given id,
it wont change the aggregation results - the ownerName info is simply
extracted from the key data in the result).

Cheers,
M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ab7e68a0-31f1-415e-b640-9b0c68c76ed3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#4

By the way, it appears that doing a Terms sub-aggregation (as I suggested
in (b)) can be a bit of a performance murderer...
In my case I am already doing a Terms aggregation (on the id) - and the
Terms sub-aggregation is turning a ~10ms response into a ~10000ms response
:-o

Sure, obviously there exists an id-reference data mapping in the system.
But it doesn't really scale having to dereference ids on read operations.
Either :
a) its a remote call - and making 10's or 100's of remote calls to serve a
single user request isnt going to perform or scale well.
b) the reference data has to be all held in RAM - which doesn't scale well.

The thing is that we have the data in the index - we already de-referenced
it when we built the document to index it.

I can try make a token - but as you can imagine, trying to encode/decode
all the location details into 1 token will make a big token

On Thursday, 3 July 2014 12:06:00 UTC+1, Mark Harwood wrote:

There is a universal truth that computers want IDs and people prefer
looking at labels.
Almost every application has to handle this translation manually and it
does feel like if the platform had built-in knowledge of an
id->reference-data mapping that would be of widespread use.

In the interim I guess one approach is to combine the unique ID and
related non-unique label into a single token which would then satisfy the
needs of having unique tokens for aggregation and readable tokens for
display purposes (perhaps made more readable if you strip the ID from the
token before display).
Obviously this would add overheads over using a basic ID.

On Wednesday, July 2, 2014 1:21:26 PM UTC+1, mooky wrote:

Having used elastic aggregations for a little bit (and having used Mongo
aggregations previously), I have been finding a couple of things a bit
difficult/awkward.
I am not sure if its because I don't know how to do it properly - or we
missing a feature/enhancement in elastic.

A common thing I want to do is aggregate on field x, but in the result, I
also want field y & z (which are unique for a given x) - there doesn't seem
to be an easy way to do that.

Lets say I have some data:
{
"id" : "94538ef6-2998-4ddd-be00-1f5dc2654955",
"quantity" : 1234567.2342,
"commodityId" : "0e918fb8-6572-4663-a692-cbebe8aca7f2",
"commodityName" : "Lead",
"ownerId" : "53e0f816-8a0a-4659-b868-c48035676b25",
"ownerName" : "Simon Chan",
"locationId" : "1cdd4bc7-76d9-43fb-ac56-8f555164211a",
"locationName" : "Shenyang - Shenyang Dongbei",
"locationCode" : "W33",
"locationCity" : "Shenyang",
"locationCountry" : "China"
}

Lets say I want to do a (term) aggregation on ownerId (because its
unique, while ownerName obviously is not) I will get results where the
bucket key is the id. However, what I want to display to the user is the
ownerName - not the id. Looking up the name from the id could be very
expensive - but its also unnecessary because the name will be unique for a
given bucket - we have the info to hand in the index. The same issue if I
want to aggregate by locationId, or commodityId. We dereference the data
associated with an id, so that we can search on them - but also we want to
use this information to create a label for a bucket when we aggregate.

Is there a simple way to retrieve ownerName while aggregating on ownerId?
The only way I know to do this is to:
a) make sure owner name is not_analyzed and
b) do a term subaggregation - which will give only 1 result.
Is there an easier way I have missed?

(FWIW doing the same thing in, say, a Mongo aggregation is simply a
matter of adding the ownerName as a key field - since its unique for a
given id, it wont change the aggregation results - the ownerName info is
simply extracted from the key data in the result).

Cheers,
M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fb5ea83e-eaaf-4776-8167-b846c4aeb07f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#5

bump.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91f55506-5040-48f3-b994-f525999db0b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #6

On Thu, Jul 3, 2014 at 6:24 PM, mooky nick.minutello@gmail.com wrote:

By the way, it appears that doing a Terms sub-aggregation (as I suggested
in (b)) can be a bit of a performance murderer...
In my case I am already doing a Terms aggregation (on the id) - and the
Terms sub-aggregation is turning a ~10ms response into a ~10000ms response
:-o

Sure, obviously there exists an id-reference data mapping in the system.
But it doesn't really scale having to dereference ids on read operations.
Either :
a) its a remote call - and making 10's or 100's of remote calls to serve a
single user request isnt going to perform or scale well.
b) the reference data has to be all held in RAM - which doesn't scale well.

The thing is that we have the data in the index - we already de-referenced
it when we built the document to index it.

I can try make a token - but as you can imagine, trying to encode/decode
all the location details into 1 token will make a big token

There are no remote calls, but indeed aggregations are stored in RAM. So if
the field that you are using for the first-level terms aggregation has a
high cardinality, adding a sub-aggregation certainly adds memory pressure
(CPU overhead as well, but not enough to justify this slow down).

Deferred aggregations might help for that issue:
https://github.com/elasticsearch/elasticsearch/pull/6128. It would allow
elasticsearch to compute the top ownerIds first, take the top N and only
then to resolve their ownerName using a top_hits or a terms aggregation.
They will be available in Elasticsearch 1.3 that we expect to release soon.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5B88O%2BGS%2BjvX6wr44h3N91xSQtF8TT3vS66AzbECmPqg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #7