Huge difference in index size when bulk indexing geoshapes across ES versions

sbocconi · August 15, 2017, 5:06pm

I get huge differences in index size when bulk indexing large MultiPolygons across version of ES (1.5.2 and 5.5.0) with the following mapping:

    {
      "settings": {
          "number_of_shards": 5,
          "number_of_replicas": 0,
          "analysis": {
            "analyzer": {
              "lowercase": {
                "type": "custom",
                "filter": " lowercase",
                "tokenizer":  "keyword"
              }
            }
          }
        },
        "mappings": {
          "_default_": {
            "properties": {
              "geometry": {
                "precision": "1m",
                "tree": "quadtree",
                "type": "geo_shape"
              },
              "uri": {
                "index": "not_analyzed",
                "type": "string"
              },
              "id": {
                "index": "not_analyzed",
                "store": true,
                "type": "string"
              },
              "type": {
                "index": "not_analyzed",
                "type": "string"
              },
              "name": {
                "fields": {
                  "analyzed": {
                    "index": "analyzed",
                    "store": true,
                    "type": "string"
                  },
                  "exact": {
                    "analyzer": "lowercase",
                    "store": true,
                    "type": "string"
                  }
                },
                "type": "string"
              },
              "dataset": {
                "index": "not_analyzed",
                "type": "string"
              },
              "validSince": {
                "format": "date_optional_time",
                "type": "date"
              },
              "validUntil": {
                "format": "date_optional_time",
                "type": "date"
              }
            }
          }
        }
    }

What could be the reason of this?

The 1.5.2 version runs on a AWS cluster and the 5.5.0 on a single machine with ES getting 16Gb of RAM.

Size on the cluster is 7.4mb, on the single node is 4.8gb (using precision 10m otherwise it gets out of heap space).

I get the same behaviour if I update the mapping to use text iso string for 5.5.0

Stefano

polyfractal · August 18, 2017, 3:58pm

There have been an incredible number of changes from 1.x to 5.x, both to geo fields and to regular fields. The entire geo infrastructure has moved over to BKD trees, fields have doc values indexed by default now, etc.

And the comparison is between a cluster and a single machine, which means compression will be very different in each scenario (because data will be spread or colocated).

I'm not sure there is any reasonable way to compare these... it's apples vs oranges I'm afraid.

There are also a ton of questions that need to be addressed too:

how many documents?
how many shards?
how are you determining cluster size? 7.4mb seems too small to be a real value even for a tiny index
Have you verified both clusters have the exact same documents

etc etc.

I'd suggest putting together a more rigorous test to ensure as many variables are ruled out. But I'm honestly not sure it's worth the time, a lot has changed since 1.x.

sbocconi · September 7, 2017, 11:23am

Thank you very much for you reply, I am indexing 248 documents that are derived from the cshape dataset (https://github.com/histograph/data/blob/master/cshapes/cshapes.pits.ndjson) , with 5 shards, and the number of documents is the same when I do _cat/indices.

I also used the same script to index both instances, but I will try to look more into the issue.

Stefano

Mark_Harwood · September 7, 2017, 12:03pm

7.1 mb looks to be the original json data size and the explosion in size can be thought of as a similar cost to converting a vector graphic into a raster one. While the vector graphic may describe a diagonal line using only 2 coordinates an equivalent high-resolution, massive scale raster format would be much bigger. Like the pixels in an image, entries in the index represent the cells in a space of your own sizing choice.

This discussion has more of this detail:

github.com/elastic/elasticsearch

Default geo_shapes to quadtree

opened 03:58PM - 21 Jul 17 UTC

closed 04:58PM - 10 Dec 18 UTC

gmoskovicz

>enhancement :Analytics/Geo

We have been discussing this for a while and it seems to be that quadtree(s) are… much faster to be indexed. Also, most people don't realize that setting `precision` will then default `distance_error_pct` to zero, also causing the indexing times of geo shapes to be very slow. My proposal here is to: 1. Change the default `tree` to `quadtree` . 2. Default distance_error_pct to `0.001` when `precision` is set. CC @nknize

sbocconi · September 13, 2017, 3:34pm

Today I installed a 1.5.2 version on the same single machine and the indexing is instantaneous, size is small as reported.
Whatever the change has been with version 5.x, it is huge and I might have missed it, but it would have been good to document it more prominently.

Please note that I do my test with exactly the same script and so the mapping and the data to index are the same.

My feeling is that for normal shapes things can be fast, for complex shapes like countries it is very slow and big in size, and the vector to bitmap comparison make sense, only that I am not sure whether the gain on small shape was worth the huge loss in bigger shapes?

Thanks for your reply,

Stefano

system · October 11, 2017, 3:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
6.4 to 7.3, After Upgrade Elasticsearch uses 50% more Disk storage Elasticsearch	4	517	March 13, 2020
Elasticsearch 5 vs 8 performance and index size Elasticsearch	10	2390	March 16, 2023
Disk Usage Elasticsearch 5.6 compared to Elasticsearch 7.9 Elasticsearch	4	686	November 18, 2020
Nested Aggregations are 5~10x times slower in ES 6.x than 5.6.x Elasticsearch	13	3584	July 16, 2018
ES5 vs ES2 index size increase Elasticsearch	6	1045	April 28, 2017

Huge difference in index size when bulk indexing geoshapes across ES versions

Related topics