I have a ES 7 cluster of couple of EC2 instances, the index has 6 shards and each shard has 2 replicas (so 1+2 x 6 = 18 shards for an index). When I create the index each shard size is around 25-30gb and we hold around 3mln of records in the database. We have a bit of updates happening everyday, let's say it's around 1mln, the update means the record gets replaced by a new one but the ID stays the same - we have pretty much the same amount of documents. I've noticed that after couple of weeks the shard size grows to 50gb so nearly double the size. Could someone please explain to me why this is happening and how can I fix it? (or should I fix it?) I've noticed search performance going down when we reach 50gb shards. Any comments/help would be highly appreciated.
index_name 2 p STARTED 46750788 50gb 11.11.11.111 ip-11.11.11.111-es
index_name 2 r STARTED 46750788 49.3gb 22.22.22.222 ip-22.22.22.222-es
index_name 2 r STARTED 46750788 44.4gb 33.33.33.333 ip-33.33.33.333-es
index_name 1 p STARTED 46532522 47.9gb 44.44.44.444 ip-44.44.44.444-es
index_name 1 r STARTED 46532522 52.7gb 55.55.55.555 ip-55.55.55.555-es
index_name 1 r STARTED 46532522 49gb 66.66.66.666 ip-66.66.66.666-es
index_name 3 r STARTED 46677577 52gb 11.11.11.111 ip-11.11.11.111-es
index_name 3 p STARTED 46677577 47.5gb 55.55.55.555 ip-55.55.55.555-es
index_name 3 r STARTED 46677577 44.4gb 77.77.77.777 ip-77.77.77.777-es
index_name 5 p STARTED 46736104 50.8gb 88.88.88.888 ip-88.88.88.888-es
index_name 5 r STARTED 46736104 52.8gb 99.99.99.999 ip-99.99.99.999-es
index_name 5 r STARTED 46736104 48gb 66.66.66.666 ip-66.66.66.666-es
index_name 4 p STARTED 46660338 45.7gb 77.77.77.777 ip-77.77.77.777-es
index_name 4 r STARTED 46660338 49.6gb 88.88.88.888 ip-88.88.88.888-es
index_name 4 r STARTED 46660338 46.8gb 99.99.99.999 ip-99.99.99.999-es
index_name 0 r STARTED 46504385 43gb 44.44.44.444 ip-44.44.44.444-es
index_name 0 r STARTED 46504385 53.3gb 22.22.22.222 ip-22.22.22.222-es
index_name 0 p STARTED 46504385 51gb 33.33.33.333 ip-33.33.33.333-es
Sorry, maybe I explained it incorrectly - if I have document with _id x, an update comes it, es.index is performed with an _id x which replaces the doc x which already exists in the ES db. This operation happens to around 1mln records per day, I have 3mln records in total.
It's based on what I can see from the output from the _cat command you ran. By default, it should show the number of deleted docs directly after the number of docs. There's nothing there though?
Sorry I was looking at the wrong thing - it's getting late now, here you go:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open index_name 23BMWdfBQKukF5AKjORnkA 6 2 279861774 99922075 833.8gb 268.3gb
Elasticsearch does not perform in place updates. Instead data is stored in immutable segments, so updating documents generates new additional segments that take up additional space and the data that was updated is not immediately deleted. It is not until segments are merged in the background that updated documents are removed from disk and this is triggered when the amount of updated documents in a segment exceeds a threshold. Having an index increase in size while updating is therefore expected.
That makes sense, what should be done in this case? Should I increase the number of shards so that I don't get into a situation where the shard gets over the recommended size? Is there a way to trigger a merge?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.