Terms Aggregation Accuracy

Hello there,

For demonstration, let's say my data model is actors as root documents and movies as nested documents. Pay attention that actor document is unique and created once, but a movie document is created more than once. I want to get the movies which involve most of the actors. Therefore, my query is something like this:

{
	"aggs" : {
		"NestMovies" : {
			"nested" : {
				"path" : "Movies"
			},
			"aggs" : {
				"top10InvolvedMovies" : {
					"terms" : {
						"field" : "Movies.MovieId",
						"order" : {
							"backToActors>actorsCount" : "desc"
						}
					},
					"aggs" : {
						"backToActors" : {
							"reverse_nested" : {},
							"aggs" : {
								"actorsCount" : {
									"value_count" : {
										"field" : "ActorId"
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

My questions are as follows:

  1. Can you please confirm what exactly happens here? My guess is that ElasticSearch understands it should make a depth_first collect mode (Remark: actually it's not the default collect mode in ElasticSearch 5.x, so it must be specified explicitly), meaning it really iterating through ALL available movie terms, calculating the number of involved actors and then sort the movie terms by it. This sound pretty bad in aspect of performance (2N buckets, no matter what the requested size parameter is!), but actually so far the performance are very good (less than 2 seconds). That's probably becuase the cardinality of the movie field is not high (something about 10-400 unique values), the index size is about 50GB (by the way the shard number is 6 including 1 replica).

  2. I have tested the query by specifying the maximum relevant number in size parameter (in which sum_other_doc_count: 0), and then run the same query again with only size of 10 - and tested that the result were the same in both cases. Is that a valid way to test my query?

  3. It's written that: "_Sorting by ascending count or by sub aggregation is discouraged as it increases the error on document counts." What exactly that means and how dangerous it is in aspects of reliability and accuracy? What is the worst reliability problem might happen in my system? For example, since i'm sorting the terms by a sub aggregation, i'm obviously getting doc_count_error_upper_bound: -1. However, in my understanding there is nothing to worry about in my case. The query above loops through all the movie names and their relevant actors. I don't mind how many times the movie term themselves showed up (and therefore doc_count_error_upper_bound isn't relevant to me at all). In addition, sum_other_doc_count isn't relevant either since no matter what - the buckets must be created for each available term. The most important part obviously is the value_count aggregation. It's clearly not an approximate aggregation, but i'm afraid it would somehow lack of accuracy when it's sub-nested under terms aggregation (although it would be very odd and definitely a serious bug).

  4. By the way, how the doc_count_error_upper_bound can be lowered on a regular terms aggregation? Only by increasing the shard_size? Is it a bulletproof solution to such a problem (in case the performance are still moderate)?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.