Store size 1,000 times the document byte size

I am experiencing a roughly 1,000x increase in store.size over the document byte size. I've got a very simple mapping and I've compared my mapping to Elasticsearch's internal mapping and they are the same.

So far, I have ingested 60,437 documents and have a store.size of 19.6Gb, but the average byte size (String.getBytes().length) of the JSON is 300-400 bytes per document.

I'm using Elasticsearch 5.2 on an M4.2xlarge EC2 instance. Elasticsearch was installed with mostly all defaults, except what I needed to do in order to pass the boostrap checks and bind to a non-local IP. I've allocated 16GB (half of my physical memory) to Elasticsearch.

I used to run Elasticsearch 2.x and was ingesting FAR more than just these handful of fields and was only experiencing about 20k / document, which was still substantial, though manageable.

If anyone can point out anything that would fix this, I would appreciate it. Or is there an ES 5.x configuration I haven't seen that will resolve this?

Below is my mapping.

{
    "settings": {
        "index.query.default_field": "tweetText"
    },
    "mappings": {
        "tweet": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "tweetDate": {
                    "type": "date",
                    "format": "EEE MMM dd HH:mm:ss Z YYYY||strict_date_optional_time||epoch_millis"
                },
                "userId": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "screenName": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "tweetText": {
                    "type": "text"
                },
                "cleanedText": {
                    "type": "text"
                },
                "tweetId": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "location": {
                    "type": "geo_point",
                    "ignore_malformed": true
                },
                "placeName": {
                    "type": "keyword",
                    "doc_values": true,
                    "eager_global_ordinals": false
                },
                "placeCountry": {
                    "type": "keyword",
                    "doc_values": true,
                    "eager_global_ordinals": true
                },
                "placeCountryCode": {
                    "type": "keyword",
                    "doc_values": false,
                    "eager_global_ordinals": false,
                    "index": false
                },
                "placeBoundingBox": {
                    "type": "geo_shape",
                    "tree": "quadtree",
                    "precision": "1m"
                },
                "resolvedUrls": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "hashtags": {
                    "type": "text"
                },
                "mentions": {
                    "type": "text"
                },
                "geoInferences": {
                    "properties": {
                        "matchedName": {
                            "type": "text"
                        },
                        "asciiName": {
                            "type": "keyword",
                            "doc_values": true,
                            "eager_global_ordinals": false
                        },
                        "country": {
                            "type": "keyword",
                            "doc_values": true,
                            "eager_global_ordinals": true
                        },
                        "county": {
                            "type": "text"
                        },
                        "countryCode": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "city": {
                            "type": "text"
                        },
                        "admin1Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin2Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin3Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin4Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "confidence": {
                            "type": "float",
                            "doc_values": false,
                            "ignore_malformed": false,
                            "index": false
                        },
                        "coordinates": {
                            "type": "geo_point",
                            "ignore_malformed": true
                        }
                    }
                },
                "temporalInferences": {
                    "type": "date",
                    "ignore_malformed": true
                }
            }
        }
    }
}

I create a gist with an example document. There will obviously be some smaller and larger than this, but this should be on the larger side.

What is the output of the cat indices API and indices stats API for this index?

@Christian_Dahlqvist, see below. Thank you!

GET /_cat/indices/twitter?pri&v&h=health,index,pri,rep,docs.count,mt,pri,rep,docs.count,store.size,pri.store.size

health | index | pri | rep | docs.count | mt | pri.mt | store.size | pri.store.size | pri.store.size
yellow | twitter | 5 | 1 | 26860 | 74 | 74 | 10.1gb | 10.1gb | 10.1gb

Output of /_stats for just that index is below:

Hello @Christian_Dahlqvist,

I've discovered the source of this issue. It seems that it's the bounding box that is at fault, though I've no idea why.

Once I remove the bounding box from the data being ingested, the index is a normal size (600 documents --> 550kb), but as soon as I add the bounding box back in (with a brand new index), the size skyrockets (3,593 documents --> 1.6GB) with only 84 documents containing a bounding box.

Below is the JSON of the bounding box:

"placeBoundingBox": {
    "type": "polygon",
    "coordinates": [
      [
        [
          -71.191421,
          42.227797
        ],
        [
          -71.191421,
          42.399542
        ],
        [
          -70.986004,
          42.399542
        ],
        [
          -70.986004,
          42.227797
        ],
        [
          -71.191421,
          42.227797
        ]
      ]
    ]
  }

The mapping associated with the bounding box (from calling GET /INDEX_NAME):

"placeBoundingBox": {
    "type": "geo_shape",
    "tree": "quadtree",
    "precision": "1.0m"
  }

To demonstrate that the mapping does infact work and is creating a proper geo_shape (even though Kibana doesn't recognize it as a geo_shape), I ran the following query and got back a successful hit:

GET /_search
{
  "query": {
    "bool": {
      "must": {
        "match_all": {
          
        }
      },
      "filter": {
        "geo_shape": {
          "placeBoundingBox": {
            "shape": {
              "type": "polygon",
              "coordinates": [
                [
                  [
                    -71.191421,
                    42.227797
                  ],
                  [
                    -71.191421,
                    42.399542
                  ],
                  [
                    -70.986004,
                    42.399542
                  ],
                  [
                    -70.986004,
                    42.227797
                  ],
                  [
                    -71.191421,
                    42.227797
                  ]
                ]
              ]
            },
            "relation": "within"
          }
        }
      }
    }
  }
}

I'd like to have the bounding box kept in, is there something wrong with either the mapping or the data? Is 1.0m too fine-grained?

Thank you.

The problem was the precision in the mapping, which was simply a typo (Our index for Elasticsearch 2.x had the precision as 1km). One tiny letter made all the difference...

A 1 meter ("1m") precision creates an extremely bloated index.

Removing the "precision" field from the mapping altogether will default to 50m and a well-sized index.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.