Why is my Elasticsearch index using so much disk space?

Bart_de_Man · October 3, 2019, 8:48am

Elasticsearch version (bin/elasticsearch --version):
5.6.11
Plugins installed:

JVM version (java -version):
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux myserver 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
Currently im running into a disk space issue on my server and seem to have found the culprit: Supposibly according to some people the size of this index is ~ 40 gb. Which sounds ridiculous because the average list of coords that I save is about 5 elements.

This index has 11k documents.

What I have tried:

Deleted the index and filled it up again, this resulted in a MUCH MUCH smaller index on elasticsearch 40GB -> 14MB?? Which is why im very skeptical as to if this will work at all.
Tried to research trough google without any concrete solutions.

The contents i posted were too big, so hereby a pastebin : https://pastebin.com/eMMiK8QD

I really hope someone might have some insight why it could possibly be this big.

Ignacio_Vera · October 3, 2019, 3:20pm

I had a look into your mapping and I believe the reason is due to the following field:

             "polygon": {
                "type": "geo_shape",
                "tree": "quadtree",
                "precision": "1.0m"
              }

This set up is known to generate big indexes, in particular if you are indexing shapes that cover big areas (compared to the given precision).

This is one of the reason we have introduced from version 6.6 a new indexing strategy. There is still a few limitation but if they do not affect your use case I would recommend to upgrade to the latest version of Elasticsearch and use that strategy instead.

If upgrade is not an option, I would recommend setting the parameter distance_error_pct or lower the precision of the indexed shapes.

Hope this helps.

Bart_de_Man · October 7, 2019, 8:14am

Thanks a bunch @Ignacio_Vera,

Do you happen to have an idea why a simple "_reindex" command for this index_1 to index_2 made such a huge difference in size?

Thanks in advance!

Ignacio_Vera · October 7, 2019, 8:41am

I am skeptical that the reindex command has indexed the polygon field as a geo_shape. Could you check the mapping of the second index?

Bart_de_Man · October 7, 2019, 9:05am

You are indeed correct, somehow ES turned it into a "float" instead of a geo_shape. However when I now make the "new" index under a different name and try to _reindex, it wont let me because of the following error:
[polygon] is defined as an object in mapping [my_polygons_1] but this name is already used for a field in other types

So _reindex doesnt work at all in this case and perhaps need to find a different way to see IF filling this new index results in a smaller disk size usage?

Ignacio_Vera · October 7, 2019, 9:17am

I think you are not creating the mapping before trying to reindex and that is the reason it is not working. The reindex command does not copy the mapping of the source index so it is trying to generate a dynamic mapping instead.

Bart_de_Man · October 7, 2019, 9:21am

Ah! But I have explicitly created the new index to be exactly like the other one. The only difference being the name , the UUID and the creation_date.

Ignacio_Vera · October 7, 2019, 10:52am

Not sure if I follow, could you share the commands you are using for reindex (creation of new index, mapping and reindex command)?

Bart_de_Man · October 7, 2019, 11:02am

My bad, hopefully the following mappings can give some more insight:
https://pastebin.com/gpJsNx0i

and the part thats currently giving me errors is:

POST /_reindex
{
  "source": {
    "index": "my_polygons_1"
  },
  "dest": {
    "index": "my_polygons_2"
  }
}

which results in the error:

[polygon] is defined as an object in mapping [my_polygons_1] but this name is already used for a field in other types```

Bart_de_Man · October 7, 2019, 1:15pm

Also just did a test with your suggested "distance_error_pct" which is (mildly putting it) a huge increase in performance already.

Which brings me to the question;
Say I add said parameter to my mapping, how much is a "reasonable" amount to set this to? Currently I set it to 0.025 and was a huge increase. Does this mean it allows an error percentage of 2.5% based on my 1meter precision?

Thanks again in advance!

Ignacio_Vera · October 7, 2019, 1:47pm

The parameter is a bit more complex and it works as follows:

When the parameter distance_error_pct is set, the algorithm computes the length of the bounding box of the provided shape. This length is multiplied by the value of that parameter and the result is the precision used to index the shape.

if the length of the diameter of the bounding box of the shape is 10 meters, then the precision for that shape is 0.025 meters. Because this is lower than the minimum given precision, the final precision will be 1m.

If the length of the diagonal is 100 meters, then the precision will be 2.5m.

If the length of the diagonal is 1000 meters, then the precision will be 25m.
and so on...

In summary, you will end up with your shapes indexed at variable precision depending on the area covered by the bounding box of such shape, with a maximum precision given by the precision parameter.

Hope it makes sense.

Bart_de_Man · October 8, 2019, 7:16am

Thanks a lot Ignacio!

I applied the mentioned distance_error_pct to an index where the precision isn't all that important for now and the result for a small subset of a 100 documents seems to be massive already. We went from 280 MB to 1MB.

I do wonder if this could be applied elsewhere, I also use large very accurate (lots of coords) polygons but do think after reading trough the documentation isnt a good idea (i need accuracy but still determining whether how accurate it needs to be).

Again this seems to be the fix and am very grateful for your help!

Ignacio_Vera · October 8, 2019, 7:57am

Great to hear that this solution is good enough for you.

Note that if you require precision you will have to upgrade to take advantage of the new indexing strategy. In addition you need to be aware you are running on an unmaintained version as it has reached EOL.

Bart_de_Man · October 8, 2019, 8:10am

Thanks and currently this do, and i'm aware of the EOL. But luckily the upgrade to the newer ES has been talked about and will (eventually) happen

system · November 5, 2019, 8:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Disk Space Issue Elasticsearch	13	120	October 28, 2024
ElasticSearch index size peculiarity Elasticsearch	2	661	July 6, 2017
Data taking too much space Elasticsearch	1	582	October 11, 2018
Indices eating too much space (~50 GB each) Elastic Search elastic-app-search	3	867	February 25, 2020
Disk Usage Elasticsearch 5.6 compared to Elasticsearch 7.9 Elasticsearch	4	683	November 18, 2020

Why is my Elasticsearch index using so much disk space?

Related topics