Primary shard size problem

It looks like, sometimes, a primary shard will be too large. This is before and after replication. Let me explain.

We create an index every hour. Each index has 13 primary shards, replication set to 0.

These shards are fairly large, say up to 25g in size, depending on the time of day.

Using the _cat/shards API, most of the primary shards, for any one index, are all reported to be the same size - something like 22.8gb. However, sometimes, one of the shards will be 29.3gb. How can that be?

--

Our system flushes indexes sometime after the top of the hour (when indexing is occurring in a new index). It also then enables replication on the index. When replicated, the replicas all have the same "correct" size - even the replica of the primary who is "too large". In the example above, the replica for the 29.3gb primary shard is the expected 22.8gb in size.

The top lines of the output of _cat/shards API looks like this:
index-2016.11.02.00 0 p STARTED 15479815 21.6gb 192.168.180.157 node1
index-2016.11.02.00 0 r STARTED 15479815 21.6gb 192.168.180.150 node2
index-2016.11.02.00 1 p STARTED 15493333 21.6gb 192.168.180.140 node3
index-2016.11.02.00 1 r STARTED 15493333 21.6gb 192.168.180.131 node4
index-2016.11.02.00 2 p STARTED 15489454 29.3gb 192.168.180.130 node5
index-2016.11.02.00 2 r STARTED 15489454 21.4gb 192.168.180.148 node6
index-2016.11.02.00 3 p STARTED 15496716 21.7gb 192.168.180.152 node7
index-2016.11.02.00 3 r STARTED 15496716 21.7gb 192.168.180.138 node8

The doc counts are the same, but the size for the one primary is not.


One guess is that compression is not on or working on node5. However, we have hundreds of shards on each node and not all primaries on this node are too large - and none of the replicas on the node are too large.

--

Any ideas?

Merging of shards is not coordinated, and will therefore happen at different times, which might explain the differences in size you are seeing. You can issue a _forcemerge in order to force each shard down to e.g. a single segment, but be aware that this is an 'expensive' operation.

Mind If I ask a few follow up questions?

If I have a primary shard of a certain size, and its not being written to, and has been flushed so there is no translog, shouldn't the replica be an exact copy?

Under what conditions would the replica be smaller than the primary from which it was created? Could the replica be merged more than the primary?

Does merging segments make the shard significantly smaller? In the example above, its 37% smaller.

Unless a replica has been created based on a primary that is not being written to, they will have different segments. There is no coordination around merging so the primary will not necessarily merge before a replica. I have seen merging have a significant impact i=on the shard size, but have no statistics on it. Easiest way is to issue a _forcemerge on the index where primary and replica size differs and make sure you specify that it should merge down to 1 segment.

Speaking of possible merge issues, I see other merge related anomalies.

The node that tends to have the large primary shards also has the most Indexing Merges Total Time and the most Indexing Merges Total Stopped Time.

What causes merges to be stopped? We are on Elasticsearch 2.1.1 and I don't know what settings are available here.

These nodes use five 1TB SSDs in a RAID 0 stripe.

Also, this version of ES doesn't log when a merge is stopped or throttled by default. What setting can I enable to see the merge stopped and throttled messages?
Thanks.

I found more information that may lead to an explanation.

Using the _cat/segments API, for an index where a primary shard is larger than its replica, the output shows several differences.

The first difference is that the primary shard (which is "too big"), has a segment that the replica does not. Its a huge segment. Here is a snippet that is sorted, you can see the extra one in the middle, named "_15h":
log-2016.11.14.17 4 p 192.168.180.142 _15f 1491 3650 0 4.6mb 0 TRUE FALSE 5.3.1 TRUE
log-2016.11.14.17 4 r 192.168.180.157 _15f 1491 3650 0 4.6mb 119344 TRUE TRUE 5.3.1 TRUE
log-2016.11.14.17 4 p 192.168.180.142 _15h 1493 3544638 0 5.1gb 7739079 FALSE TRUE 5.3.1 FALSE
log-2016.11.14.17 4 p 192.168.180.142 _15i 1494 13425 0 18mb 161871 TRUE TRUE 5.3.1 TRUE
log-2016.11.14.17 4 r 192.168.180.157 _15i 1494 13425 0 18mb 161871 TRUE TRUE 5.3.1 TRUE

Segment _15h is the only one not "committed", in the whole index, which is indicated FALSE in the first TRUE/FALSE column.

This primary shard also has several segments which are not "searchable" (2nd TRUE/FALSE column):
log-2016.11.14.17 4 p 192.168.180.142 _119 1341 366 0 456.1kb 0 TRUE FALSE 5.3.1 TRUE
log-2016.11.14.17 4 r 192.168.180.157 _119 1341 366 0 456.1kb 62433 TRUE TRUE 5.3.1 TRUE
log-2016.11.14.17 4 p 192.168.180.142 _11y 1366 1717431 0 2.4gb 0 TRUE FALSE 5.3.1 TRUE
log-2016.11.14.17 4 r 192.168.180.157 _11y 1366 1717431 0 2.4gb 3949945 TRUE TRUE 5.3.1 TRUE
log-2016.11.14.17 4 p 192.168.180.142 _124 1372 152848 0 202.8mb 585655 TRUE TRUE 5.3.1 TRUE
log-2016.11.14.17 4 r 192.168.180.157 _124 1372 152848 0 202.8mb 585655 TRUE TRUE 5.3.1 TRUE

In this snippet, segments _119 and _11y are not searchable on the primary, but are on the replica. Segment _124 is searchable on both primary and replica.

So, what causes the primary to have a segment, a huge one, that is no on the replica?

What causes some segments not to be searchable?

Thanks in advance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.