Term aggrigation with partition not working


(Tushar Chevulkar) #1

when I am running the below query the partition is not working the query executes but the partition filter dose not work.

I really don't know where am I going wrong. Can any one help me with this ?

{
	"query": {
		"bool": {
		  "must": [
		      {
		        "match_phrase_prefix": {
          			"mentions": {
          				"query": "ABC",
          				"max_expansions": 10
          			}
          		}
		      }
		  ]
		}
	},
	"aggs": {
					"genres": {
					  "terms": {
							"field": "mentions",
							"size": 1
  					},
						"aggs": {
							"user_part": {
								"terms": {
									"field": "user.username",
									"size":4,
        						"include": {
                         "partition": 1,
                         "num_partitions": 5
                      }
								}
							}
						}
					}
				}
}

(David Pilato) #2

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.


(Tushar Chevulkar) #3

i use the above query on twitter data dump to see how many people have use a certain hashtag. Like if i search so for e.g. if I search 'nike' it shows nike,nikiwomen etc and who has used the hastag. Below is the sample result.

{
	"took": 82,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"failed": 0
	},
	"hits": {
		"total": 915,
		"max_score": 0,
		"hits": []
	},
	"aggregations": {
		"genres": {
			"doc_count_error_upper_bound": 15,
			"sum_other_doc_count": 858,
			"buckets": [{
				"key": "nike",
				"doc_count": 72,
				"users_with_hastag": {
					"doc_count_error_upper_bound": 0,
					"sum_other_doc_count": 34,
					"buckets": [{
							"key": "A",
							"doc_count": 14
						},
						{
							"key": "B",
							"doc_count": 9
						},
						{
							"key": "C",
							"doc_count": 8
						},
						{
							"key": "D",
							"doc_count": 7
						}
					]
				}
			}]
		}
	}
}

now in the above result you can see nike is used by more people and I have displayed only 4 to keep the output light. is there a way that i can paginate the results so when i run the paginate parameters i get the next set of users form 4 to 8


(Mark Harwood) #4

Partitions aren't designed with a global sort order in mind (i.e. the users returned in partition 2 aren't guaranteed to be any more or less popular than those returned in partition 1). Global sort orders on high-cardinality fields like UserIds are hard to reason about in a distributed system where each shard or index has only a small percentage of all the docs.
Partitions are a coping strategy for this problem. By examining arbitrary sub-groupings of terms independently of each other you can attempt to compute things like the top N of something within just that subgroup rather than attempting this analysis across the whole data.


(Tushar Chevulkar) #5

any specific example that can solve this?


(Mark Harwood) #6

What's the end goal and what business problem are you trying to solve?

I'm unsure what the use is of a sorted-by-popularity list of all users who have ever mentioned #nike.
We can discuss alternative approaches that would support this objective but it's worth understanding if that is really the requirement first


(Tushar Chevulkar) #7

My end goal as i mentioned earlier i need to scroll through aggregated results. I am writing a complex search query where user will type hashtag and he will see all the users who has posted with the hashtag like the #nike example. so the above query is showing me 2 reulults #nike and #nikewomen and the 4 users each now. Now on my website there is a show all button where he can see all the users in #nike so for that reason i need to scroll through the aggregated results its kinda pagination.


(Mark Harwood) #8

"Deep pagination" for an arbitrary query on a distributed system is expensive which is why Google won't let you page beyond a certain number of results for a given query.

If you really need to provide exhaustive results to your end users then you may be forced to reconsider how you physically arrange the data to optimise access for this use case. You may need to pre-aggregate data to keep related information locally e.g. maintain a single document per user with a list of all the hashtags they've ever used and and how frequently they were used.


(system) closed #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.