Store size 1,000 times the document byte size

bisoldi · February 7, 2017, 5:27pm

I am experiencing a roughly 1,000x increase in store.size over the document byte size. I've got a very simple mapping and I've compared my mapping to Elasticsearch's internal mapping and they are the same.

So far, I have ingested 60,437 documents and have a store.size of 19.6Gb, but the average byte size (String.getBytes().length) of the JSON is 300-400 bytes per document.

I'm using Elasticsearch 5.2 on an M4.2xlarge EC2 instance. Elasticsearch was installed with mostly all defaults, except what I needed to do in order to pass the boostrap checks and bind to a non-local IP. I've allocated 16GB (half of my physical memory) to Elasticsearch.

I used to run Elasticsearch 2.x and was ingesting FAR more than just these handful of fields and was only experiencing about 20k / document, which was still substantial, though manageable.

If anyone can point out anything that would fix this, I would appreciate it. Or is there an ES 5.x configuration I haven't seen that will resolve this?

Below is my mapping.

{
    "settings": {
        "index.query.default_field": "tweetText"
    },
    "mappings": {
        "tweet": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "tweetDate": {
                    "type": "date",
                    "format": "EEE MMM dd HH:mm:ss Z YYYY||strict_date_optional_time||epoch_millis"
                },
                "userId": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "screenName": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "tweetText": {
                    "type": "text"
                },
                "cleanedText": {
                    "type": "text"
                },
                "tweetId": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "location": {
                    "type": "geo_point",
                    "ignore_malformed": true
                },
                "placeName": {
                    "type": "keyword",
                    "doc_values": true,
                    "eager_global_ordinals": false
                },
                "placeCountry": {
                    "type": "keyword",
                    "doc_values": true,
                    "eager_global_ordinals": true
                },
                "placeCountryCode": {
                    "type": "keyword",
                    "doc_values": false,
                    "eager_global_ordinals": false,
                    "index": false
                },
                "placeBoundingBox": {
                    "type": "geo_shape",
                    "tree": "quadtree",
                    "precision": "1m"
                },
                "resolvedUrls": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "hashtags": {
                    "type": "text"
                },
                "mentions": {
                    "type": "text"
                },
                "geoInferences": {
                    "properties": {
                        "matchedName": {
                            "type": "text"
                        },
                        "asciiName": {
                            "type": "keyword",
                            "doc_values": true,
                            "eager_global_ordinals": false
                        },
                        "country": {
                            "type": "keyword",
                            "doc_values": true,
                            "eager_global_ordinals": true
                        },
                        "county": {
                            "type": "text"
                        },
                        "countryCode": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "city": {
                            "type": "text"
                        },
                        "admin1Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin2Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin3Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin4Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "confidence": {
                            "type": "float",
                            "doc_values": false,
                            "ignore_malformed": false,
                            "index": false
                        },
                        "coordinates": {
                            "type": "geo_point",
                            "ignore_malformed": true
                        }
                    }
                },
                "temporalInferences": {
                    "type": "date",
                    "ignore_malformed": true
                }
            }
        }
    }
}

I create a gist with an example document. There will obviously be some smaller and larger than this, but this should be on the larger side.

gist.github.com

https://gist.github.com/bisoldi/e2cd61863fb3878d13684c4767271567

gistfile1.txt

    {
      "_index": "twitter",
      "_type": "tweet",
      "_id": "AVoZivLca9LOhnR10_ll",
      "_score": null,
      "_source": {
        "tweetDate": 1486487211000,
        "userId": "123456789",
        "screenName": "removed",
        "tweetText": "RT @wef: America’s dominance is over. By 2030, we'll have a handful of global powers https://t.co/vWb0yD3bbK #wef17 https://t.co/KYOjNCmXNi",

This file has been truncated. show original

Christian_Dahlqvist · February 7, 2017, 5:58pm

What is the output of the cat indices API and indices stats API for this index?

bisoldi · February 7, 2017, 6:36pm

@Christian_Dahlqvist, see below. Thank you!

GET /_cat/indices/twitter?pri&v&h=health,index,pri,rep,docs.count,mt,pri,rep,docs.count,store.size,pri.store.size

health | index | pri | rep | docs.count | mt | pri.mt | store.size | pri.store.size | pri.store.size
yellow | twitter | 5 | 1 | 26860 | 74 | 74 | 10.1gb | 10.1gb | 10.1gb

Output of /_stats for just that index is below:

gist.github.com

https://gist.github.com/bisoldi/8060947789aa9f35cf5945c5f21878a4

gistfile1.txt



{
  "_shards": {
    "total": 10,
    "successful": 5,
    "failed": 0
  },
  "_all": {
    "primaries": {

This file has been truncated. show original

bisoldi · February 9, 2017, 6:15pm

Hello @Christian_Dahlqvist,

I've discovered the source of this issue. It seems that it's the bounding box that is at fault, though I've no idea why.

Once I remove the bounding box from the data being ingested, the index is a normal size (600 documents --> 550kb), but as soon as I add the bounding box back in (with a brand new index), the size skyrockets (3,593 documents --> 1.6GB) with only 84 documents containing a bounding box.

Below is the JSON of the bounding box:

"placeBoundingBox": {
    "type": "polygon",
    "coordinates": [
      [
        [
          -71.191421,
          42.227797
        ],
        [
          -71.191421,
          42.399542
        ],
        [
          -70.986004,
          42.399542
        ],
        [
          -70.986004,
          42.227797
        ],
        [
          -71.191421,
          42.227797
        ]
      ]
    ]
  }

The mapping associated with the bounding box (from calling GET /INDEX_NAME):

"placeBoundingBox": {
    "type": "geo_shape",
    "tree": "quadtree",
    "precision": "1.0m"
  }

To demonstrate that the mapping does infact work and is creating a proper geo_shape (even though Kibana doesn't recognize it as a geo_shape), I ran the following query and got back a successful hit:

GET /_search
{
  "query": {
    "bool": {
      "must": {
        "match_all": {
          
        }
      },
      "filter": {
        "geo_shape": {
          "placeBoundingBox": {
            "shape": {
              "type": "polygon",
              "coordinates": [
                [
                  [
                    -71.191421,
                    42.227797
                  ],
                  [
                    -71.191421,
                    42.399542
                  ],
                  [
                    -70.986004,
                    42.399542
                  ],
                  [
                    -70.986004,
                    42.227797
                  ],
                  [
                    -71.191421,
                    42.227797
                  ]
                ]
              ]
            },
            "relation": "within"
          }
        }
      }
    }
  }
}

I'd like to have the bounding box kept in, is there something wrong with either the mapping or the data? Is 1.0m too fine-grained?

Thank you.

bisoldi · February 9, 2017, 6:55pm

The problem was the precision in the mapping, which was simply a typo (Our index for Elasticsearch 2.x had the precision as 1km). One tiny letter made all the difference...

A 1 meter ("1m") precision creates an extremely bloated index.

Removing the "precision" field from the mapping altogether will default to 50m and a well-sized index.

system · March 9, 2017, 6:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch pri.store.size Elasticsearch	3	1063	January 13, 2023
ElasticSearch index size peculiarity Elasticsearch	2	661	July 6, 2017
Increased docs.count and store.size after Elasticsearch upgrade Elasticsearch	4	522	May 5, 2020
Disk usage in Elasticsearch 7.2 vs Elasticsearch 2.4 Elasticsearch	2	366	November 20, 2019
Disk Usage Elasticsearch 5.6 compared to Elasticsearch 7.9 Elasticsearch	4	683	November 18, 2020

Store size 1,000 times the document byte size

Related topics