Can I calculate the cardinality of the _id field?

(Dave) #1

I have multiple Elasticsearch 1.3.2 indices, and I'm using custom
document IDs. I want to find the number of distinct IDs across my
indices. Some documents have the same ID but are in different indices,
so this is different from just counting documents. So I wanted to do a
cardinality aggregation on the _id field. So I posted this to http://localhost:9200/my_indices/_search:

{ "from": 0, "size": 0, "aggregations": { "_count": { "cardinality": { "script": "doc['_id'].value", "lang": "groovy" } } } }

But Elasticsearch just sent back this:

{ "took": 60, "timed_out": false, "_shards": { "total": 175, "successful": 175, "failed": 0 }, "hits": { "total": 310714, "max_score": 0, "hits": [] }, "aggregations": { "_count": { "value": 0 } }

I'm pretty sure there's more than 0 IDs in there! What happened, and is it possible to get what I want?

(Adrien Grand) #2

The issue here is that the _id field is neither indexed nor has doc values so aggregations can't load fielddata. You can try the _uid field instead (which is a combination of the _type and the _id) which is indexed. Note however that this will require a lot of memory because of fielddata.

(Camilo Sierra) #3

i think that is really important to say that this result is approximate, in a small index i had good results after one/two million of docs the result is not accurate almost every time (after as the doc says the error remains under 5%)

(system) #4