Weird inconsistencies regarding storage size in Elasticsearch 6

Hi everyone!
I'm experiencing weird behavior when measuring the disk usage by Elasticsearch 6.
I test my machine like so:

This is the mapping:

"transaction": {
        "properties": {
          "changedFieldsList": {
            "type": "nested",
            "properties": {
              "fieldChanged": {
                "type": "text"
              },
              "fieldId": {
                "type": "text"
              },
              "fieldType": {
                "type": "text"
              },
              "fieldValue": {
                "type": "text"
              }
            }
          },
          "commandCommitScn": {
            "type": "text"
          },
          "commandId": {
            "type": "text"
          },
          "commandScn": {
            "type": "text"
          },
          "commandSequence": {
            "type": "text"
          },
          "commandTimestamp": {
            "type": "text"
          },
          "commandType": {
            "type": "text"
          },
          "conditionFieldsList": {
            "type": "nested",
            "properties": {
              "fieldId": {
                "type": "text"
              },
              "fieldType": {
                "type": "text"
              },
              "fieldValue": {
                "type": "text"
              }
            }
          },
          "objectDBName": {
            "type": "text"
          },
          "objectId": {
            "type": "text"
          },
          "objectSchemaName": {
            "type": "text"
          }
        }
      }

I didn't change anything from the default configuration.

Then, I store the data sent to elastic in a seperate folder (lets call it sep_data, for short), and I checked the size of sep_data folder and the size of the elasticsearch data folder, using du -shk on my Linux Centos 7.
While the sep_data file added the same amount of size whenever I added another 100K, the elasticsearch data folder varied greatly.
Here are the test results:

test number Parsed transactions sep_data dir size Start [MB] sep_data dir size End [MB] sep_data dir size growth [MB] Elastic data folder start size [MB] Elastic data folder end size [MB] Elastic data folder folder growth [MB] Elastic data / sep_data ratio
1 100,000 0 389 389 0 718 718 185%
2 100,000 389 778 389 718 936 218 56%
3 100,000 778 1167 389 936 1719 783 201%
4 100,000 1167 1556 389 1719 2020 301 77%
5 100,000 1556 1945 389 2020 2840 820 211%
6 100,000 1945 2334 389 2840 3333 493 127%
7 100,000 2334 2723 389 3333 3642 309 79%
8 100,000 2723 3112 389 3642 3747 105 27%

I found this data to be very wierd, and I'm having trouble explaining why Elasticsearch behave this way.

I need to have a relaible way to predict how much disk storage will be used when I insert X tranactions, and I need to be able to explain why elastic does this.
So, can anyone help me?

There are several factors at play here. First of all, besides storing your data, elasticsearch also stores indices for this data. The size of the index itself depends on the type and diversity of the data that you are storing. For example, if your test data is something like "Test 1", "Test 2" and so on the index would be much smaller then in case of a real data, and if your test data are totally random strings the results will be worth then real data. Real data typically follows something similar to Zipf's Law so indexing the same string over and over, using completely random string, or reusing the same dataset multiple times are most likely to produce weird results that have nothing to do with real data. Do you send the same 100,000 transactions over and over by any chance?

Both indices and data are stored in segments, that are immutable and merged over time, the data inside segements is compressed. So, depending on the compression ratio for different portions of your data and the moment that you caught it during the stage of the merge process you caught it.

Hi, thanks for the answer.
I don't send the same data every time, but the changes are small. If the changes are small - does elastic recognize this and store the data differently?
Where can I read about the defualt compressing behavior?
Anyway, I need to now what will be the worst case cenario, memorywise, so I need to have an estimation of who much will elastic use for X data. The X in my case could very well be from 100k to 1TB, so its very important I can predict how elastic will use the storage.

Elasticsearch doesn't recognize that explicitly, but it would definitely affect the compression rates and the size of the inverted index. Typically, the smaller the variation the more compact inverted index you are going to get and the better we can compress the source.

I linked the blog post in my previous replay.

I think you just need to index more data and plot a graph. I am sure that after a while the fluctuation will stabilize. I would guess that you will probably need to index about 50G of data to see that. Elasticsearch stops merging segments when they reach 5G. So, after you will get a few of such large segments you should see smaller and smaller fluctuations that small segments are causing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.