Aggregation count unique values


(Manuel Islas) #1

Hello,

I have a requirement in which I need to aggregate over multiple indexes, each being independent from the others and each containing potentially millions of documents. Each index has its own Ids but contains a hash property that can be used to identify duplicated items across indexes. Document type resembles the following:

On index 1

{ "id": 1, "hash": "abcdefg", "title": "the title 1", "category": "the category 1", "createdBy": "user", "modifiedDate": "2011-04-11T10:20:30Z" }
{ "id": 2, "hash": "asdfjklñ", "title": "the title 2", "category": "the category 1", "createdBy": "user2", "modifiedDate": "2011-04-11T10:20:30Z" }

On index 2

{ "id": 2, "hash": "fghijk", "title": "the title 2", "category": "the category 2", "createdBy": "user3", "modifiedDate": "2011-04-11T10:20:30Z" }
{ "id": 3, "hash": "abcdefg", "title": "the title 1", "category": "the category 1", "createdBy": "user", "modifiedDate": "2011-04-11T10:20:30Z" }
{ "id": 4, "hash": "lmnopq", "title": "the title 3", "category": "the category 3", "createdBy": "user2", "modifiedDate": "2011-04-11T10:20:30Z" }

I need to get the list of unique titles with the counts of unique documents across the indexes and having one big index is not an option for me. The result that I am looking for is something like this:

"category 1": 2
"category 2": 1
"category 3": 1

I am using Elasticsearch 2.4

I tried to use an aggregation like the following

{
  "size": 0,
	"aggs": {
	    "categories": {
		    "terms": {
				"field": "category"
			}
	    }
	}
}

but this will return the counts including the duplicates, I tried as well the following aggregate

{
  "size": 0,
  "aggs": {
    "duplicateCount": {
    	"terms": {
    	"field": "hash",
    	"min_doc_count": 1
         },
         "aggs": {
	         "categories": {
		      "terms": {
		    	"field": "category"
		      }
		    }
      }
    }
  }
}

But this aggregate returns each hash with a bucket with the title for that hash only.

Any idea how can I achieve this? Is this even possible?

Thanks in advance,
Manuel


(Mark Harwood) #2

In later versions of elasticsearch we introduced the composite aggregation and terms agg partitioning to help break big requests like this one into smaller pieces.
Using 2.4 APIs you could look at using the scroll API, sorting docs by hash and stream them out to your client code to look for duplicates in the sequence of docs.


(Manuel Islas) #3

Hi Mark,

First of all, thank you for your reply. I tried the composite aggregation

{
  "size": 0,
  "query": {
  	"match_all": {}
  },
  "aggs": {
  	"test_bucket": {
  		"composite": {
  			"sources": [
  				{ "category": { "terms": { "field": "category.keyword" } } },
  				{ "hash": { "terms": { "field": "hash" } } }
  			]
  		}
  	}
  }
}

but I am getting this as a result,

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 15,
        "successful": 15,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "test_bucket": {
            "buckets": [
                {
                    "key": {
                        "category": "the category 1",
                        "hash": "abcdefg"
                    },
                    "doc_count": 2
                },
                {
                    "key": {
                        "category": "the category 1",
                        "hash": "asdfjklñ"
                    },
                    "doc_count": 1
                },
                {
                    "key": {
                        "category": "the category 2",
                        "hash": "fghijk"
                    },
                    "doc_count": 1
                }
            ]
        }
    }
}

As can be seen in "the category 1" I would expect to get doc_count of 1 instead of 2 as I am trying to de-duplicate items (both documents have the same hash in index 1 and index 2).

Am I doing something wrong here or is this the expected behavior and what I want to achieve is not possible?

Thanks in advance


(Mark Harwood) #4

I think I need to understand your problem first and what determines "unique".
You say you want a list of titles but your example ideal result looks like a list of categories.

I also probably need to know the cardinalities between id, title and hash fields e.g. how many titles per hash or ids per title etc


(Manuel Islas) #5

Hi Mark,

Sorry, my bad, it is categories in the example, not titles. Each index is independent from the others and all documents in each index are unique and have their own Ids per index but when aggregating/searching across multiple indexes, there might be n items with the same hash/title in the n indexes queried.

In my example, I can have many items in the same category (1:n) and unique hash values per index but as many as the indexes I am querying (as I might potentially have the same item in another index). My intention, or what I am trying to achieve, is to get the count of unique items in a category (in my example I have two items in the category 1 but they have the same hash so I want to count them as 1.

Thanks in advance for the help.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.