Huge difference in index size when bulk indexing geoshapes across ES versions


(Stefano Bocconi) #1

I get huge differences in index size when bulk indexing large MultiPolygons across version of ES (1.5.2 and 5.5.0) with the following mapping:

    {
      "settings": {
          "number_of_shards": 5,
          "number_of_replicas": 0,
          "analysis": {
            "analyzer": {
              "lowercase": {
                "type": "custom",
                "filter": " lowercase",
                "tokenizer":  "keyword"
              }
            }
          }
        },
        "mappings": {
          "_default_": {
            "properties": {
              "geometry": {
                "precision": "1m",
                "tree": "quadtree",
                "type": "geo_shape"
              },
              "uri": {
                "index": "not_analyzed",
                "type": "string"
              },
              "id": {
                "index": "not_analyzed",
                "store": true,
                "type": "string"
              },
              "type": {
                "index": "not_analyzed",
                "type": "string"
              },
              "name": {
                "fields": {
                  "analyzed": {
                    "index": "analyzed",
                    "store": true,
                    "type": "string"
                  },
                  "exact": {
                    "analyzer": "lowercase",
                    "store": true,
                    "type": "string"
                  }
                },
                "type": "string"
              },
              "dataset": {
                "index": "not_analyzed",
                "type": "string"
              },
              "validSince": {
                "format": "date_optional_time",
                "type": "date"
              },
              "validUntil": {
                "format": "date_optional_time",
                "type": "date"
              }
            }
          }
        }
    }

What could be the reason of this?

The 1.5.2 version runs on a AWS cluster and the 5.5.0 on a single machine with ES getting 16Gb of RAM.

Size on the cluster is 7.4mb, on the single node is 4.8gb (using precision 10m otherwise it gets out of heap space).

I get the same behaviour if I update the mapping to use text iso string for 5.5.0

Stefano


(Zachary Tong) #2

There have been an incredible number of changes from 1.x to 5.x, both to geo fields and to regular fields. The entire geo infrastructure has moved over to BKD trees, fields have doc values indexed by default now, etc.

And the comparison is between a cluster and a single machine, which means compression will be very different in each scenario (because data will be spread or colocated).

I'm not sure there is any reasonable way to compare these... it's apples vs oranges I'm afraid.

There are also a ton of questions that need to be addressed too:

  • how many documents?
  • how many shards?
  • how are you determining cluster size? 7.4mb seems too small to be a real value even for a tiny index
  • Have you verified both clusters have the exact same documents

etc etc.

I'd suggest putting together a more rigorous test to ensure as many variables are ruled out. But I'm honestly not sure it's worth the time, a lot has changed since 1.x. :slight_smile:


(Stefano Bocconi) #3

Thank you very much for you reply, I am indexing 248 documents that are derived from the cshape dataset (https://github.com/histograph/data/blob/master/cshapes/cshapes.pits.ndjson) , with 5 shards, and the number of documents is the same when I do _cat/indices.

I also used the same script to index both instances, but I will try to look more into the issue.

Stefano


(Mark Harwood) #4

7.1 mb looks to be the original json data size and the explosion in size can be thought of as a similar cost to converting a vector graphic into a raster one. While the vector graphic may describe a diagonal line using only 2 coordinates an equivalent high-resolution, massive scale raster format would be much bigger. Like the pixels in an image, entries in the index represent the cells in a space of your own sizing choice.

This discussion has more of this detail:


70MB csv file becomes 32GB index
(Stefano Bocconi) #5

Today I installed a 1.5.2 version on the same single machine and the indexing is instantaneous, size is small as reported.
Whatever the change has been with version 5.x, it is huge and I might have missed it, but it would have been good to document it more prominently.

Please note that I do my test with exactly the same script and so the mapping and the data to index are the same.

My feeling is that for normal shapes things can be fast, for complex shapes like countries it is very slow and big in size, and the vector to bitmap comparison make sense, only that I am not sure whether the gain on small shape was worth the huge loss in bigger shapes?

Thanks for your reply,

Stefano


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.