Index data on disk between versions

Hello

I have setup 2 different elasticsearch instances on 2 different servers. The first is elasticsearch v 7.10 and the other v 7.12

I create the same exact index and use the same exact data for indexing, both are 100% identical. One is an exact replica of the other.

First i delete, close and then create again the index on both.
Then I'm indexing about 100K documents from a database (same replica on both servers).

As soon as the indexing ends (takes about 5 minutes):
On the first server that data on disk (Lucene data files) are about 65MB
On the second server they are about 650MB, x10 times bigger.

Number of documents of course are the same.

Why this x10 difference? I did the procedure 3-4 times with the same results.

I noticed that on the second server the .tim files are x10 of the size compared the first. So something with the way Lucene calculates the term dictionary is different.

Has anything changed between v7.10 and v7.12?

Welcome to our community! :smiley:

That's a little hard to say without being able to replicate it ourselves unfortunately.
But, what is the output from the _cluster/stats?pretty&human API from each cluster?

Cluster1

    {
      "_nodes" : {
        "total" : 1,
        "successful" : 1,
        "failed" : 0
      },
      "cluster_name" : "elasticsearch",
      "cluster_uuid" : "517m_VFcSie5_rV3Ty-Etg",
      "timestamp" : 1620015784712,
      "status" : "yellow",
      "indices" : {
        "count" : 20,
        "shards" : {
          "total" : 20,
          "primaries" : 20,
          "replication" : 0.0,
          "index" : {
            "shards" : {
              "min" : 1,
              "max" : 1,
              "avg" : 1.0
            },
            "primaries" : {
              "min" : 1,
              "max" : 1,
              "avg" : 1.0
            },
            "replication" : {
              "min" : 0.0,
              "max" : 0.0,
              "avg" : 0.0
            }
          }
        },
        "docs" : {
          "count" : 381415,
          "deleted" : 270510
        },
        "store" : {
          "size" : "202.2mb",
          "size_in_bytes" : 212045424,
          "reserved" : "0b",
          "reserved_in_bytes" : 0
        },
        "fielddata" : {
          "memory_size" : "4.9mb",
          "memory_size_in_bytes" : 5167152,
          "evictions" : 0
        },
        "query_cache" : {
          "memory_size" : "11.3mb",
          "memory_size_in_bytes" : 11856720,
          "total_count" : 1038718,
          "hit_count" : 260203,
          "miss_count" : 778515,
          "cache_size" : 8767,
          "cache_count" : 56046,
          "evictions" : 47279
        },
        "completion" : {
          "size" : "45.7kb",
          "size_in_bytes" : 46841
        },
        "segments" : {
          "count" : 78,
          "memory" : "1mb",
          "memory_in_bytes" : 1070817,
          "terms_memory" : "413.2kb",
          "terms_memory_in_bytes" : 423161,
          "stored_fields_memory" : "38kb",
          "stored_fields_memory_in_bytes" : 38992,
          "term_vectors_memory" : "0b",
          "term_vectors_memory_in_bytes" : 0,
          "norms_memory" : "15.6kb",
          "norms_memory_in_bytes" : 16064,
          "points_memory" : "0b",
          "points_memory_in_bytes" : 0,
          "doc_values_memory" : "578.7kb",
          "doc_values_memory_in_bytes" : 592600,
          "index_writer_memory" : "2.4mb",
          "index_writer_memory_in_bytes" : 2562456,
          "version_map_memory" : "0b",
          "version_map_memory_in_bytes" : 0,
          "fixed_bit_set" : "77.8kb",
          "fixed_bit_set_memory_in_bytes" : 79704,
          "max_unsafe_auto_id_timestamp" : -1,
          "file_sizes" : { }
        },
        "mappings" : {
          "field_types" : [
            {
              "name" : "binary",
              "count" : 9,
              "index_count" : 1
            },
            {
              "name" : "boolean",
              "count" : 48,
              "index_count" : 11
            },
            {
              "name" : "completion",
              "count" : 2,
              "index_count" : 2
            },
            {
              "name" : "date",
              "count" : 147,
              "index_count" : 20
            },
            {
              "name" : "double",
              "count" : 54,
              "index_count" : 3
            },
            {
              "name" : "flattened",
              "count" : 9,
              "index_count" : 1
            },
            {
              "name" : "float",
              "count" : 33,
              "index_count" : 4
            },
            {
              "name" : "half_float",
              "count" : 19,
              "index_count" : 5
            },
            {
              "name" : "integer",
              "count" : 71,
              "index_count" : 4
            },
            {
              "name" : "keyword",
              "count" : 643,
              "index_count" : 20
            },
            {
              "name" : "long",
              "count" : 526,
              "index_count" : 17
            },
            {
              "name" : "nested",
              "count" : 77,
              "index_count" : 11
            },
            {
              "name" : "object",
              "count" : 412,
              "index_count" : 17
            },
            {
              "name" : "search_as_you_type",
              "count" : 2,
              "index_count" : 1
            },
            {
              "name" : "text",
              "count" : 208,
              "index_count" : 16
            }
          ]
        },
        "analysis" : {
          "char_filter_types" : [ ],
          "tokenizer_types" : [ ],
          "filter_types" : [
            {
              "name" : "edge_ngram",
              "count" : 1,
              "index_count" : 1
            },
            {
              "name" : "icu_folding",
              "count" : 3,
              "index_count" : 3
            },
            {
              "name" : "lowercase",
              "count" : 3,
              "index_count" : 3
            },
            {
              "name" : "shingle",
              "count" : 1,
              "index_count" : 1
            },
            {
              "name" : "stemmer",
              "count" : 3,
              "index_count" : 3
            },
            {
              "name" : "stemmer_override",
              "count" : 3,
              "index_count" : 3
            },
            {
              "name" : "stop",
              "count" : 3,
              "index_count" : 3
            }
          ],
          "analyzer_types" : [
            {
              "name" : "custom",
              "count" : 9,
              "index_count" : 3
            }
          ],
          "built_in_char_filters" : [ ],
          "built_in_tokenizers" : [
            {
              "name" : "icu_tokenizer",
              "count" : 9,
              "index_count" : 3
            }
          ],
          "built_in_filters" : [
            {
              "name" : "lowercase",
              "count" : 5,
              "index_count" : 1
            }
          ],
          "built_in_analyzers" : [ ]
        }
      },
      "nodes" : {
        "count" : {
          "total" : 1,
          "coordinating_only" : 0,
          "data" : 1,
          "data_cold" : 1,
          "data_content" : 1,
          "data_hot" : 1,
          "data_warm" : 1,
          "ingest" : 1,
          "master" : 1,
          "ml" : 1,
          "remote_cluster_client" : 1,
          "transform" : 1,
          "voting_only" : 0
        },
        "versions" : [
          "7.10.0"
        ],
        "os" : {
          "available_processors" : 12,
          "allocated_processors" : 12,
          "names" : [
            {
              "name" : "Windows Server 2019",
              "count" : 1
            }
          ],
          "pretty_names" : [
            {
              "pretty_name" : "Windows Server 2019",
              "count" : 1
            }
          ],
          "mem" : {
            "total" : "63.8gb",
            "total_in_bytes" : 68567556096,
            "free" : "10.4gb",
            "free_in_bytes" : 11247824896,
            "used" : "53.3gb",
            "used_in_bytes" : 57319731200,
            "free_percent" : 16,
            "used_percent" : 84
          }
        },
        "process" : {
          "cpu" : {
            "percent" : -1
          },
          "open_file_descriptors" : {
            "min" : -1,
            "max" : -1,
            "avg" : 0
          }
        },
        "jvm" : {
          "max_uptime" : "3.3d",
          "max_uptime_in_millis" : 288202498,
          "versions" : [
            {
              "version" : "11.0.9",
              "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
              "vm_version" : "11.0.9+7-LTS",
              "vm_vendor" : "Oracle Corporation",
              "bundled_jdk" : false,
              "using_bundled_jdk" : null,
              "count" : 1
            }
          ],
          "mem" : {
            "heap_used" : "6.7gb",
            "heap_used_in_bytes" : 7211837112,
            "heap_max" : "15.9gb",
            "heap_max_in_bytes" : 17092640768
          },
          "threads" : 123
        },
        "fs" : {
          "total" : "894.2gb",
          "total_in_bytes" : 960194670592,
          "free" : "588.1gb",
          "free_in_bytes" : 631523094528,
          "available" : "588.1gb",
          "available_in_bytes" : 631523094528
        },
        "plugins" : [
          {
            "name" : "analysis-icu",
            "version" : "7.10.0",
            "elasticsearch_version" : "7.10.0",
            "java_version" : "1.8",
            "description" : "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
            "classname" : "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
            "extended_plugins" : [ ],
            "has_native_controller" : false
          }
        ],
        "network_types" : {
          "transport_types" : {
            "netty4" : 1
          },
          "http_types" : {
            "netty4" : 1
          }
        },
        "discovery_types" : {
          "zen" : 1
        },
        "packaging_types" : [
          {
            "flavor" : "unknown",
            "type" : "unknown",
            "count" : 1
          }
        ],
        "ingest" : {
          "number_of_pipelines" : 1,
          "processor_stats" : {
            "gsub" : {
              "count" : 0,
              "failed" : 0,
              "current" : 0,
              "time" : "0s",
              "time_in_millis" : 0
            },
            "script" : {
              "count" : 0,
              "failed" : 0,
              "current" : 0,
              "time" : "0s",
              "time_in_millis" : 0
            }
          }
        }
      }
    }

Cluster2

    {
      "_nodes" : {
        "total" : 1,
        "successful" : 1,
        "failed" : 0
      },
      "cluster_name" : "elasticsearch",
      "cluster_uuid" : "CFZkROkyS5eGq26EkLydnQ",
      "timestamp" : 1620015937232,
      "status" : "yellow",
      "indices" : {
        "count" : 14,
        "shards" : {
          "total" : 14,
          "primaries" : 14,
          "replication" : 0.0,
          "index" : {
            "shards" : {
              "min" : 1,
              "max" : 1,
              "avg" : 1.0
            },
            "primaries" : {
              "min" : 1,
              "max" : 1,
              "avg" : 1.0
            },
            "replication" : {
              "min" : 0.0,
              "max" : 0.0,
              "avg" : 0.0
            }
          }
        },
        "docs" : {
          "count" : 400652,
          "deleted" : 169618
        },
        "store" : {
          "size" : "530.6mb",
          "size_in_bytes" : 556395062,
          "reserved" : "0b",
          "reserved_in_bytes" : 0
        },
        "fielddata" : {
          "memory_size" : "4.8mb",
          "memory_size_in_bytes" : 5072864,
          "evictions" : 0
        },
        "query_cache" : {
          "memory_size" : "11.3mb",
          "memory_size_in_bytes" : 11857719,
          "total_count" : 2974483,
          "hit_count" : 1430299,
          "miss_count" : 1544184,
          "cache_size" : 2708,
          "cache_count" : 73066,
          "evictions" : 70358
        },
        "completion" : {
          "size" : "0b",
          "size_in_bytes" : 0
        },
        "segments" : {
          "count" : 67,
          "memory" : "715.1kb",
          "memory_in_bytes" : 732364,
          "terms_memory" : "260.3kb",
          "terms_memory_in_bytes" : 266592,
          "stored_fields_memory" : "32.6kb",
          "stored_fields_memory_in_bytes" : 33400,
          "term_vectors_memory" : "0b",
          "term_vectors_memory_in_bytes" : 0,
          "norms_memory" : "8.1kb",
          "norms_memory_in_bytes" : 8320,
          "points_memory" : "0b",
          "points_memory_in_bytes" : 0,
          "doc_values_memory" : "414.1kb",
          "doc_values_memory_in_bytes" : 424052,
          "index_writer_memory" : "2.5mb",
          "index_writer_memory_in_bytes" : 2716220,
          "version_map_memory" : "0b",
          "version_map_memory_in_bytes" : 0,
          "fixed_bit_set" : "67.3kb",
          "fixed_bit_set_memory_in_bytes" : 68944,
          "max_unsafe_auto_id_timestamp" : 1619937094844,
          "file_sizes" : { }
        },
        "mappings" : {
          "field_types" : [
            {
              "name" : "boolean",
              "count" : 19,
              "index_count" : 7
            },
            {
              "name" : "date",
              "count" : 43,
              "index_count" : 9
            },
            {
              "name" : "double",
              "count" : 18,
              "index_count" : 1
            },
            {
              "name" : "float",
              "count" : 30,
              "index_count" : 3
            },
            {
              "name" : "half_float",
              "count" : 24,
              "index_count" : 6
            },
            {
              "name" : "integer",
              "count" : 66,
              "index_count" : 3
            },
            {
              "name" : "keyword",
              "count" : 250,
              "index_count" : 9
            },
            {
              "name" : "long",
              "count" : 591,
              "index_count" : 9
            },
            {
              "name" : "nested",
              "count" : 42,
              "index_count" : 5
            },
            {
              "name" : "object",
              "count" : 346,
              "index_count" : 8
            },
            {
              "name" : "search_as_you_type",
              "count" : 2,
              "index_count" : 1
            },
            {
              "name" : "text",
              "count" : 23,
              "index_count" : 6
            }
          ]
        },
        "analysis" : {
          "char_filter_types" : [ ],
          "tokenizer_types" : [ ],
          "filter_types" : [
            {
              "name" : "edge_ngram",
              "count" : 1,
              "index_count" : 1
            },
            {
              "name" : "icu_folding",
              "count" : 1,
              "index_count" : 1
            },
            {
              "name" : "lowercase",
              "count" : 1,
              "index_count" : 1
            },
            {
              "name" : "shingle",
              "count" : 1,
              "index_count" : 1
            },
            {
              "name" : "stemmer",
              "count" : 1,
              "index_count" : 1
            },
            {
              "name" : "stemmer_override",
              "count" : 1,
              "index_count" : 1
            },
            {
              "name" : "stop",
              "count" : 1,
              "index_count" : 1
            }
          ],
          "analyzer_types" : [
            {
              "name" : "custom",
              "count" : 5,
              "index_count" : 1
            }
          ],
          "built_in_char_filters" : [ ],
          "built_in_tokenizers" : [
            {
              "name" : "icu_tokenizer",
              "count" : 5,
              "index_count" : 1
            }
          ],
          "built_in_filters" : [
            {
              "name" : "lowercase",
              "count" : 5,
              "index_count" : 1
            }
          ],
          "built_in_analyzers" : [ ]
        },
        "versions" : [
          {
            "version" : "7.12.1",
            "index_count" : 14,
            "primary_shard_count" : 14,
            "total_primary_size" : "530.6mb",
            "total_primary_bytes" : 556395062
          }
        ]
      },
      "nodes" : {
        "count" : {
          "total" : 1,
          "coordinating_only" : 0,
          "data" : 1,
          "data_cold" : 1,
          "data_content" : 1,
          "data_frozen" : 1,
          "data_hot" : 1,
          "data_warm" : 1,
          "ingest" : 1,
          "master" : 1,
          "ml" : 1,
          "remote_cluster_client" : 1,
          "transform" : 1,
          "voting_only" : 0
        },
        "versions" : [
          "7.12.1"
        ],
        "os" : {
          "available_processors" : 64,
          "allocated_processors" : 64,
          "names" : [
            {
              "name" : "Windows Server 2019",
              "count" : 1
            }
          ],
          "pretty_names" : [
            {
              "pretty_name" : "Windows Server 2019",
              "count" : 1
            }
          ],
          "architectures" : [
            {
              "arch" : "amd64",
              "count" : 1
            }
          ],
          "mem" : {
            "total" : "255.8gb",
            "total_in_bytes" : 274737979392,
            "free" : "173.4gb",
            "free_in_bytes" : 186189275136,
            "used" : "82.4gb",
            "used_in_bytes" : 88548704256,
            "free_percent" : 68,
            "used_percent" : 32
          }
        },
        "process" : {
          "cpu" : {
            "percent" : 0
          },
          "open_file_descriptors" : {
            "min" : -1,
            "max" : -1,
            "avg" : 0
          }
        },
        "jvm" : {
          "max_uptime" : "21.4h",
          "max_uptime_in_millis" : 77156559,
          "versions" : [
            {
              "version" : "16.0.1",
              "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
              "vm_version" : "16.0.1+9-24",
              "vm_vendor" : "Oracle Corporation",
              "bundled_jdk" : false,
              "using_bundled_jdk" : null,
              "count" : 1
            }
          ],
          "mem" : {
            "heap_used" : "9.3gb",
            "heap_used_in_bytes" : 10084310512,
            "heap_max" : "29.5gb",
            "heap_max_in_bytes" : 31675383808
          },
          "threads" : 352
        },
        "fs" : {
          "total" : "3.4tb",
          "total_in_bytes" : 3840410644480,
          "free" : "3.3tb",
          "free_in_bytes" : 3638142078976,
          "available" : "3.3tb",
          "available_in_bytes" : 3638142078976
        },
        "plugins" : [
          {
            "name" : "analysis-icu",
            "version" : "7.12.1",
            "elasticsearch_version" : "7.12.1",
            "java_version" : "1.8",
            "description" : "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
            "classname" : "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
            "extended_plugins" : [ ],
            "has_native_controller" : false,
            "licensed" : false,
            "type" : "isolated"
          }
        ],
        "network_types" : {
          "transport_types" : {
            "netty4" : 1
          },
          "http_types" : {
            "netty4" : 1
          }
        },
        "discovery_types" : {
          "zen" : 1
        },
        "packaging_types" : [
          {
            "flavor" : "unknown",
            "type" : "unknown",
            "count" : 1
          }
        ],
        "ingest" : {
          "number_of_pipelines" : 1,
          "processor_stats" : {
            "gsub" : {
              "count" : 0,
              "failed" : 0,
              "current" : 0,
              "time" : "0s",
              "time_in_millis" : 0
            },
            "script" : {
              "count" : 0,
              "failed" : 0,
              "current" : 0,
              "time" : "0s",
              "time_in_millis" : 0
            }
          }
        }
      }
    }

Forget to say that a Greek stemmer is used (on both). Maybe something changed on how Lucene calculates the terms in Greek?

The most accurate way to compare index/shard sizes is generally to first forcemerge them down to a single segment. As they are now they could be in different stages of merging. The size difference in specific file types does seem intersting though and might very well persist.

Hello

Thanks for the reply. My questions:

  1. When i create the index for the first time and make the initial indexing, the x10 size difference is explained? As soon as I index for example 1000 documents (1 minute after index creation) the difference is x10 and is the same until the end of indexing. So the problem is from the start as I see it.

  2. After 1-2 days the sizes wouldn't be similar after merges have been done?

x10 differnce is a big difference. Is there a way (eg a tool) to see what Lucene has created (decode the .tim files) in order to compare them?

Finally this difference affects memory consumption? If yes, i have a problem.