Documents not getting sharded evenly

I have an index with 11 primaries and 1 replica. Document IDs are sequential numbers and each is incremented by 10. For example 1, 11, 21, 31...

Sample set: 2606113223, 2606113453, 2606113473, 2606113483, 2606113493, 2606113503, 2606113513, 2606113523, 2606113623, 2606113633

We are using ES 1.7.0. I understand it uses DJB hashing algorithm underneath to determine what shard to route the document to. In my case I do not specify any special routing parameter and thus ES defaults to using my document id as the parameter to compute the hash for the shard as per the following algo.

shard = hash(doc_id) % num_of_primaries // 11 in my case.

For the sample ids provided above, the shard count is always 2. There are 9608 such documents and all get routed to the exact same shard.

What are my options to make ES index the documents more evenly?

You can provide a routing key for each doc.

I see... but how will it be different that than the document IDs? The document IDs are unqiue in my case.

Another alternative that I was thought about was if I could my own hash function somehow? If yes then can point me to some guide that explains how to add to ES?

I did read that cluster.routing.operation.hash.type can be provided in the YML file. But what is the value of this property? Is it the fully qualified name of the implementation of HashFuntion.java? Assuing that is the case then where do I drop the jar? Inside the ES lib folder? and then restart the JVM?

Providing a routing key is like providing a hash function IMO.

Let say routing key 1 send data to shard 1, 2 to shard 2 and so on.
Then you can decide in which shard you want to send your document to.

Well. I would not use that! Read https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_20_crud_and_routing_changes.html#_routing_hash_function

How many documents have you indexed across these 11 shards? How many documents is the index expected to hold?

OK. Let talk more about your suggestion…

Say, the document id is my routing key… then also, the djb2 hash function will kick in… in effect the result is the same… Help me understand… what should be the value of the the routing key such that I can achieve what I need. Perhaps I am missing something in your suggestion.

I have about 3 - 4 million of these… These are basically car details… for example make, model, year, color etc… Composite JSON objects… I can provide a sample if needed. The index can grow to contain 10 folds or more such JSON objects.

What does the distribution across shards look like if you run the '_cat/indices' API? Is the distribution of these 3-4 million documents so uneven that it is causing a problem? Why so many shards?

To give you more background…

Vehicles are owned by dealers (till they are sold to a customer)… In our case, there are about few thousand car dealers. Each dealer has on an average about 500 vehicles. But some have almost 10K or more. In this particular case, all those 10K+ cars are getting indexed on the same node/shard. One shard resides on one data node. So we have a 22 data node cluster… 11 primaries and 1 replica each.

Car data undergoes changes multiple times in a day. Imagine this happening across all the thousands of dealers… Net net, out of these 3 - 4 million active vehicles, 10 - 30% churn each day. The churn is exaggerated because we need to have all changes around a vehicle to be available ASAP. So, the churn on ES index is even higher as the same document is churning multiple times in day for different business reasons. On a high level that is the story.

Total Index size is about 20 GB. Each document can be anywhere from few KB to couple of megabytes or slightly more. Not much… but any lower number of shards, does not scale for us. Perhaps there could be something else going on. But this is what it is.

During our performance testing (indexing, deletes, searches, aggregations, and suggestions) we found that we need these many shards. Our 1X load is about 100 req/sec (total… distributed between different actions like indexing, searching etc.). And because all the vehicles for this dealer is getting indexed on the same shard, our CPU utilization on two nodes (its primary and replica) constantly remains above 50 - 60%. Overall the cluster health looks good… with the CPU on other nodes being around 25% or less… but I am sure that there many more dealers like this one whose entire data resides on a single shard and for some reason if those dealers too start experience heavy indexing or search traffic then that node’s CPU will spike up as well.

Now, to answer your another question about _cat/indices, here are the details…

health status index pri rep docs.count docs.deleted store.size pri.store.size
green open v-index 11 1 103,173,881 41,287,442 53.7gb 27.1gb

This is marvel reporting… and yes the document counts do not match what I report… 3 - 4 million does not match 103 million that you may see here… I was told by one of your instructors (ES operations training) that the document count reported by Marvel will differ from what you actually index because any complex JSON docs are broken down and indexed by lucene differently and thus that is what is reported by Marvel… not sure if I explained it correctly or not…

NOTE: So far, I have found that if I use murmur3 instead of djb2 for hashing then the sharding happens evenly without passing in an extraneous routing key.

let me know if you have more questions…

What is the ratio between indexing and search requests? If there are considerable more searches than indexing requests, you may spread out the load by increasing the number of replicas, especially since you total index size is reasonably small. This will make indexing and update requests slower as more shards need to be updated but will spread query load better.

Right… that is a separate exercise on my plate… However, this cluster is live in production at the moment and this new customer is going to get launched soon in my system… I need to find a relatively quicker way to get the sharding issue resolved. At the moment changing the hash algo seems like a good bet. And, yes, I understand that in ES 2.0 all this is not supported but I am not going there tomorrow… so, I guess I can take my chances here.

Just curious… how would doing what you suggest fix the issue I am seeing?

According to your description it seems there is an imbalance between shards, but is it really causing problems as the CPU usage is still reasonably low? As the number of replicas can be changed at any time, I would consider trying to increase it from 1 to 2 and see what effect that has. It is a lot easier than changing the hash function and reindexing everything.

Thats true… I will give that a try.

While doing perf testing for a customer have 10K+ vehicles, each of his vehicles went to shard #2. When I executed search, aggregation and indexing queries, I saw via Marvel that the CPU was very high for nodes on which shard 2 resided. For other nodes this was not the case. For other nodes the CPU remained around or under 25%. The load being applied was the same.

What API can I use to see what shard has how many documents? Does Marvel show it? If yes then can you point me on how to see for all nodes at once?

Tried this with 7 primaries and 2 replicas. We do see better CPU performance but I won’t necessarily attribute this to extra replicas only. I notice that because we reduced the number of primaries the documents are more evenly spread out across all the shards. Load was 1X of production traffic. We will apply 2X load and see if the performance scales linearly or not.