To give you more background…
Vehicles are owned by dealers (till they are sold to a customer)… In our case, there are about few thousand car dealers. Each dealer has on an average about 500 vehicles. But some have almost 10K or more. In this particular case, all those 10K+ cars are getting indexed on the same node/shard. One shard resides on one data node. So we have a 22 data node cluster… 11 primaries and 1 replica each.
Car data undergoes changes multiple times in a day. Imagine this happening across all the thousands of dealers… Net net, out of these 3 - 4 million active vehicles, 10 - 30% churn each day. The churn is exaggerated because we need to have all changes around a vehicle to be available ASAP. So, the churn on ES index is even higher as the same document is churning multiple times in day for different business reasons. On a high level that is the story.
Total Index size is about 20 GB. Each document can be anywhere from few KB to couple of megabytes or slightly more. Not much… but any lower number of shards, does not scale for us. Perhaps there could be something else going on. But this is what it is.
During our performance testing (indexing, deletes, searches, aggregations, and suggestions) we found that we need these many shards. Our 1X load is about 100 req/sec (total… distributed between different actions like indexing, searching etc.). And because all the vehicles for this dealer is getting indexed on the same shard, our CPU utilization on two nodes (its primary and replica) constantly remains above 50 - 60%. Overall the cluster health looks good… with the CPU on other nodes being around 25% or less… but I am sure that there many more dealers like this one whose entire data resides on a single shard and for some reason if those dealers too start experience heavy indexing or search traffic then that node’s CPU will spike up as well.
Now, to answer your another question about _cat/indices, here are the details…
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open v-index 11 1 103,173,881 41,287,442 53.7gb 27.1gb
This is marvel reporting… and yes the document counts do not match what I report… 3 - 4 million does not match 103 million that you may see here… I was told by one of your instructors (ES operations training) that the document count reported by Marvel will differ from what you actually index because any complex JSON docs are broken down and indexed by lucene differently and thus that is what is reported by Marvel… not sure if I explained it correctly or not…
NOTE: So far, I have found that if I use murmur3 instead of djb2 for hashing then the sharding happens evenly without passing in an extraneous routing key.
let me know if you have more questions…