Movement of document from one shard to another shard

Hi all,

As we are using ES 1.7.3 with Java 1.8 version. Using bulk API we index some documents to es_item index. It having a document count of nearly 23 millions. I can able to see the a particular document in which shard is been located by using explain API
Below command to view document in which shard,
GET /es_item/_search
{
"explain": true,
"query": {
"match": {
"ITEM_ID": "9180373"
}
}
}

Sample output:-
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 18,
"successful": 18,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_shard": 16,
"_node": "8NXY4FVxRk-JlcmTStqcew",
"_index": "es_item",
"_type": "item",
"_id": "9180373",
"_score": 1,
"_source": {
"ITEM_ID": 9180373,
"ITEM_CODE": "9180373",
"ITEM_DSCR": "UNCODEABLE PRODUCTS",
"ITEM_SPECIFICITY_REF_ID": 188,
"ITEM_TYPE": "UNCODEABLE ITEM",
"IS_SHARED_IND": "Y"}}]}}

My Question is,
1. Is it possible to move a document from one shard to another shard in elasticsearch?
(or)
2. Is there any mechanism to do this?

Please kindly help me in this.

Thanks,
Ganeshbabu R

What is the reason you want to do this?

Hi @Christian_Dahlqvist

Okay Let me explain my use case little bit.

Our ES cluster has 3 master nodes, 3 data nodes, 1 client nodes.
es_item index are allocated 18 shards, 2 replica

Recently we indexed 23 millions of items (docs) to es_item index and some of the items were having a cross codes of nearly 10 millions to 20 millions of docs. As we are planning to update the cross codes to the corresponding items using groovy scripts. Suppose if some of the items were present in the same shard then all the cross codes will add to that shard where the parent document were located so It might shard become unbalanced and we might face some performance related issue (like searching, indexing).
So we plan to take a top 10 items of having maximum cross codes and moving each item to the each shard. so that we try to update the cross codes to the item and then shard may become more balanced to the ES cluster.

Is there any workaround to do this?

Please let us know

Thanks,
Ganeshbabu R

Right, so the default hash(routing_key)%num_shards routing policy introduces some undesirable bunching of content on certain shards.
Two options:

  1. Multiple indices
  2. Single index, custom routing.

So with option 1 you put just the "heavy" entities in dedicated indexes and keep all the lighter ones in a single index.
With option 2 (the subject of this ticket) your app needs to come up with the right routing key for heavy entities that ensures a more even spread across the shards in your index (ie no more than one heavy entity per shard). This requires you to decode the routing algo and formulate a key that balances as required. To do this you need to use the same hashing algorithm. A simple way of doing this might be to take an existing doc from each shard and use the ID of that doc as the routing key .

Hi @Mark_Harwood

I tried custom routing to index the document to a shard.

Below is the document indexed to the sample test_item index,

POST test_item/item/9180374?routing=9180374
{
"ITEM_ID": 9180374,
"ITEM_CODE": "9588049",
"ITEM_DSCR": "HOME FURNISHINGS & DECOR",
"ITEM_SPECIFICITY_REF_ID": 186,
"ITEM_TYPE": "SGI",
"IS_SHARED_IND": "Y",
"ITEM_EU_NAN_CODE": "8371994526",
"ITEM_MISUSED_GTIN_FLG": "N",
"CRT_DTTM": "2004-09-25 22:00:00",
"UPD_DTTM": "2014-06-01 04:01:30",
"DIST": [
{
"RGN_ID": 5,
"RGN_NM": "AT",
"DSTN_STRT_DT": "2011-03-03 21:33:51",
"DSTN_END_DT": "9999-12-31 00:00:00",
"ITEM_GLBL_CODE_ST_REF_ID": 721,
"FIRST_DATA_DT": null,
"LAST_DATA_DT": "2012-09-17 00:00:00",
"HAS_HIST_IND": "N",
"LAST_CHG_DT": "2004-09-26 10:02:29",
"IS_PREMVMNT_IND": "N",
"PRE_MOVEMENT_DT": null
},
]
}

https://www.elastic.co/guide/en/elasticsearch/guide/current/routing-value.html

By referring the above link I am trying to find the shard number using formula,

shard = hash(routing) % number_of_primary_shards

I used routing value as document id 9180374. The hash value of (9180374) is aba59ff7ea51c9455b955b0fb853445320e5148b and no of primary shards is 10.

When I use Get command to view the shard allocation, I am getting the following output,

"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_shard": 3,
"_node": "APU6WEGzQjK2rhxKgjeNQA",
"_index": "test_item",
"_type": "item",
"_id": "9180374",
"_score": 1,
"_source": {
"ITEM_ID": 9180374,
"ITEM_CODE": "9588049",
"ITEM_DSCR": "HOME FURNISHINGS & DECOR"
}
}]}

I am getting confused in hashing algorithm how shard value showing as "3" in the output. Can you please help me this?

Thanks,
Ganeshbabu R

The hash algorithm computes a large number given a routing key (the default key being the doc ID).
This is moduloed by the number of shards which should give an even spread of numbers between 0 and numShards-1.
You don't need to decode this algo though - just take one example value already stored on each shard.

For any given shard, take the ID of a document indexed with no routing key.
Use that id as the routing key for all documents you subsequently want to pin to that shard.

If the original doc with that ID naturally ended up on shard X then that ID is obviously a valid routing key for shard X.

Thanks for your response @Mark_Harwood

It's worked as per the suggestion you gave.

You asked me to take a indexed existing Document ID with no routing key for setting up routing key of subsequent documents suppose If I am indexing at first time (Let's say, I want document 1 to be indexed to shard 1 & document 2 want to be indexed to shard 2).
Now, how elasticsearch will indexed to the particular shards at the first time?
(or)
Is there a way to do in elasticsearch?

Please let us know your suggestions.

Thanks,
Ganeshbabu R

You could use a test index to figure the keys out with some dummy docs. If it has the same number of shards and elasticsearch version as your production index then the IDs that land on each shard in your test index would always be usable as routing keys in your production system (they are guaranteed to hash-modulo to the same shard number).

Thanks for your response @Mark_Harwood

Sure I will try it in dev but First thing, Number of shards are same as both in dev & Production cluster but ES version I am using in PROD as 1.7.2 where in DEV ES Version as 1.7.3.

I want the above each items (nothing but documents) to be indexed in each shard, So that I can update the cross codes to the corresponding item by the time shards gets more balanced to the cluster.
And also you said about Mulitple indices. Using Multiple indices I believe the scoring will be affected, as well: scores of one index will differ from scores of the other index.

Please let us know your suggestions.

Thanks,
Ganeshbabu R