Counting distinct terms in field

Daniel_E · July 30, 2011, 5:21pm

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?

kimchy · July 30, 2011, 6:01pm

No, there isn't another way to do it. Sadly, distinct is something that is
hard to do in a distributed env for large result sets. We can easily count
the distinct count per shard, but to return a correct number across shards
menas we need to send all the (distinct) values back to the the
"coordinator" and compute the distinct count there... .

On Sat, Jul 30, 2011 at 8:21 PM, Daniel E deinspanjer@gmail.com wrote:

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?

Liyu · September 20, 2011, 1:17am

If we want to implement a reduce function on the large result set,
will you have any suggestions for a clean and elegant solution
(something similar to what MongoDB does), or does ES have this feature
on the roadmap?

Thanks a bunch.

On Jul 30, 11:01 am, Shay Banon kim...@gmail.com wrote:

No, there isn't another way to do it. Sadly, distinct is something that is
hard to do in a distributed env for large result sets. We can easily count
the distinct count per shard, but to return a correct number across shards
menas we need to send all the (distinct) values back to the the
"coordinator" and compute the distinct count there... .

On Sat, Jul 30, 2011 at 8:21 PM, Daniel E deinspan...@gmail.com wrote:

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?

Liyu · September 20, 2011, 5:25am

BTW, I am from Symantec and we are in the process evaluating ES.

Thanks,

-- Liyu

On Mon, Sep 19, 2011 at 6:17 PM, Liyu liyuyi@gmail.com wrote:

If we want to implement a reduce function on the large result set,
will you have any suggestions for a clean and elegant solution
(something similar to what MongoDB does), or does ES have this feature
on the roadmap?

Thanks a bunch.

On Jul 30, 11:01 am, Shay Banon kim...@gmail.com wrote:

No, there isn't another way to do it. Sadly, distinct is something that
is
hard to do in a distributed env for large result sets. We can easily
count
the distinct count per shard, but to return a correct number across
shards
menas we need to send all the (distinct) values back to the the
"coordinator" and compute the distinct count there... .

On Sat, Jul 30, 2011 at 8:21 PM, Daniel E deinspan...@gmail.com wrote:

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?

kimchy · September 20, 2011, 8:36am

Yes, something like that is planned to be implemented. Not sure if there is
an issue for it or not.

On Tue, Sep 20, 2011 at 4:17 AM, Liyu liyuyi@gmail.com wrote:

If we want to implement a reduce function on the large result set,
will you have any suggestions for a clean and elegant solution
(something similar to what MongoDB does), or does ES have this feature
on the roadmap?

Thanks a bunch.

On Jul 30, 11:01 am, Shay Banon kim...@gmail.com wrote:

No, there isn't another way to do it. Sadly, distinct is something that
is
hard to do in a distributed env for large result sets. We can easily
count
the distinct count per shard, but to return a correct number across
shards
menas we need to send all the (distinct) values back to the the
"coordinator" and compute the distinct count there... .

On Sat, Jul 30, 2011 at 8:21 PM, Daniel E deinspan...@gmail.com wrote:

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?