Counting distinct terms in field

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?

No, there isn't another way to do it. Sadly, distinct is something that is
hard to do in a distributed env for large result sets. We can easily count
the distinct count per shard, but to return a correct number across shards
menas we need to send all the (distinct) values back to the the
"coordinator" and compute the distinct count there... .

On Sat, Jul 30, 2011 at 8:21 PM, Daniel E deinspanjer@gmail.com wrote:

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?

If we want to implement a reduce function on the large result set,
will you have any suggestions for a clean and elegant solution
(something similar to what MongoDB does), or does ES have this feature
on the roadmap?

Thanks a bunch.

On Jul 30, 11:01 am, Shay Banon kim...@gmail.com wrote:

No, there isn't another way to do it. Sadly, distinct is something that is
hard to do in a distributed env for large result sets. We can easily count
the distinct count per shard, but to return a correct number across shards
menas we need to send all the (distinct) values back to the the
"coordinator" and compute the distinct count there... .

On Sat, Jul 30, 2011 at 8:21 PM, Daniel E deinspan...@gmail.com wrote:

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?

BTW, I am from Symantec and we are in the process evaluating ES.

Thanks,

-- Liyu

On Mon, Sep 19, 2011 at 6:17 PM, Liyu liyuyi@gmail.com wrote:

If we want to implement a reduce function on the large result set,
will you have any suggestions for a clean and elegant solution
(something similar to what MongoDB does), or does ES have this feature
on the roadmap?

Thanks a bunch.

On Jul 30, 11:01 am, Shay Banon kim...@gmail.com wrote:

No, there isn't another way to do it. Sadly, distinct is something that
is
hard to do in a distributed env for large result sets. We can easily
count
the distinct count per shard, but to return a correct number across
shards
menas we need to send all the (distinct) values back to the the
"coordinator" and compute the distinct count there... .

On Sat, Jul 30, 2011 at 8:21 PM, Daniel E deinspan...@gmail.com wrote:

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?

Yes, something like that is planned to be implemented. Not sure if there is
an issue for it or not.

On Tue, Sep 20, 2011 at 4:17 AM, Liyu liyuyi@gmail.com wrote:

If we want to implement a reduce function on the large result set,
will you have any suggestions for a clean and elegant solution
(something similar to what MongoDB does), or does ES have this feature
on the roadmap?

Thanks a bunch.

On Jul 30, 11:01 am, Shay Banon kim...@gmail.com wrote:

No, there isn't another way to do it. Sadly, distinct is something that
is
hard to do in a distributed env for large result sets. We can easily
count
the distinct count per shard, but to return a correct number across
shards
menas we need to send all the (distinct) values back to the the
"coordinator" and compute the distinct count there... .

On Sat, Jul 30, 2011 at 8:21 PM, Daniel E deinspan...@gmail.com wrote:

We have a field called signature.full which is unanalyzed, every
document has a single string which represents one possible term.

We are displaying a page that shows a table of unique signatures,
ordered by frequency. Unfortunately, there is no good way that we
have found to see how many unique signatures there are other than
doing a facet for an absurdly high range.

"facets" : {
"signatures" : {
"terms" : {
"field" : "signature.full",
"size" : MAXINT
}
}
}

MAXINT being 2^32 because ES runs in a 32 bits machine. And MAXINT is
just a trick to get all the results so we can count it.

Is there a better way to do this?