I have a cluster with millions of documents. Each document represents a music transaction.
I want to calculate percentiles with the field priceAND to sum prices by percentiles.
My document looks like
"_source" : {
"commercialModel" : "SubscriptionModel",
"creationDate" : 1529307715833,
"dspCode" : "MUSIC-ONLINE",
"duration" : 78,
"price" : 0.57370562886557345,
"nbTx" : 1,
"releaseTitle" : {
"title" : "Se Cacher Sous L'arbre Sous la Pluie (Hiding Under the Tree in the Rain)",
"title_lower" : "se cacher sous l'arbre sous la pluie (hiding under the tree in the rain)"
},
"releaseType" : "TrackRelease",
}
So you'd like say, the sum of price field from all documents where the price was in the 95th percentile?
If that's the case, I'm afraid it's not currently possible with one request. We have to do a complete pass over the data to determine the global percentiles. Due to the sharded nature of data in Elasticsearch, the "local" percentile values could be very incorrect. E.g. you might have data from 0-10 on one shard, and 10-100 on a different shard. When viewed locally, the first shard would think the 95th percentile was near 10. But when viewed globally, the 95th percentile is closer to 100.
This means we can't "bucket" documents by percentile because we don't know where the bucket boundaries are until we've consumed all the data.
You can achieve it with two sequential requests though. First to calculate the percentiles, then a second query that uses a range aggregation which defines the different percentiles you care about. Or just a filter query which restricts all documents to those that had a price > 95th percentile.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.