Weird inconsistencies regarding storage size in Elasticsearch 6

yishain11 · April 16, 2018, 8:43am

Hi everyone!
I'm experiencing weird behavior when measuring the disk usage by Elasticsearch 6.
I test my machine like so:

This is the mapping:

"transaction": {
        "properties": {
          "changedFieldsList": {
            "type": "nested",
            "properties": {
              "fieldChanged": {
                "type": "text"
              },
              "fieldId": {
                "type": "text"
              },
              "fieldType": {
                "type": "text"
              },
              "fieldValue": {
                "type": "text"
              }
            }
          },
          "commandCommitScn": {
            "type": "text"
          },
          "commandId": {
            "type": "text"
          },
          "commandScn": {
            "type": "text"
          },
          "commandSequence": {
            "type": "text"
          },
          "commandTimestamp": {
            "type": "text"
          },
          "commandType": {
            "type": "text"
          },
          "conditionFieldsList": {
            "type": "nested",
            "properties": {
              "fieldId": {
                "type": "text"
              },
              "fieldType": {
                "type": "text"
              },
              "fieldValue": {
                "type": "text"
              }
            }
          },
          "objectDBName": {
            "type": "text"
          },
          "objectId": {
            "type": "text"
          },
          "objectSchemaName": {
            "type": "text"
          }
        }
      }

I didn't change anything from the default configuration.

Then, I store the data sent to elastic in a seperate folder (lets call it sep_data, for short), and I checked the size of sep_data folder and the size of the elasticsearch data folder, using du -shk on my Linux Centos 7.
While the sep_data file added the same amount of size whenever I added another 100K, the elasticsearch data folder varied greatly.
Here are the test results:

test number	Parsed transactions	sep_data dir size Start [MB]	sep_data dir size End [MB]	sep_data dir size growth [MB]	Elastic data folder start size [MB]	Elastic data folder end size [MB]	Elastic data folder folder growth [MB]	Elastic data / sep_data ratio
1	100,000	0	389	389	0	718	718	185%
2	100,000	389	778	389	718	936	218	56%
3	100,000	778	1167	389	936	1719	783	201%
4	100,000	1167	1556	389	1719	2020	301	77%
5	100,000	1556	1945	389	2020	2840	820	211%
6	100,000	1945	2334	389	2840	3333	493	127%
7	100,000	2334	2723	389	3333	3642	309	79%
8	100,000	2723	3112	389	3642	3747	105	27%

I found this data to be very wierd, and I'm having trouble explaining why Elasticsearch behave this way.

I need to have a relaible way to predict how much disk storage will be used when I insert X tranactions, and I need to be able to explain why elastic does this.
So, can anyone help me?

Igor_Motov · April 16, 2018, 8:10pm

There are several factors at play here. First of all, besides storing your data, elasticsearch also stores indices for this data. The size of the index itself depends on the type and diversity of the data that you are storing. For example, if your test data is something like "Test 1", "Test 2" and so on the index would be much smaller then in case of a real data, and if your test data are totally random strings the results will be worth then real data. Real data typically follows something similar to Zipf's Law so indexing the same string over and over, using completely random string, or reusing the same dataset multiple times are most likely to produce weird results that have nothing to do with real data. Do you send the same 100,000 transactions over and over by any chance?

Both indices and data are stored in segments, that are immutable and merged over time, the data inside segements is compressed. So, depending on the compression ratio for different portions of your data and the moment that you caught it during the stage of the merge process you caught it.

yishain11 · April 17, 2018, 7:08am

Hi, thanks for the answer.
I don't send the same data every time, but the changes are small. If the changes are small - does elastic recognize this and store the data differently?
Where can I read about the defualt compressing behavior?
Anyway, I need to now what will be the worst case cenario, memorywise, so I need to have an estimation of who much will elastic use for X data. The X in my case could very well be from 100k to 1TB, so its very important I can predict how elastic will use the storage.

Igor_Motov · April 17, 2018, 1:28pm

Elasticsearch doesn't recognize that explicitly, but it would definitely affect the compression rates and the size of the inverted index. Typically, the smaller the variation the more compact inverted index you are going to get and the better we can compress the source.

I linked the blog post in my previous replay.

I think you just need to index more data and plot a graph. I am sure that after a while the fluctuation will stabilize. I would guess that you will probably need to index about 50G of data to see that. Elasticsearch stops merging segments when they reach 5G. So, after you will get a few of such large segments you should see smaller and smaller fluctuations that small segments are causing.

system · May 15, 2018, 1:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch storage needs grows fast Elasticsearch	8	1161	July 5, 2017
ElasticSearch pri.store.size Elasticsearch	3	1245	January 13, 2023
Compression Mechanism Elasticsearch	4	398	July 6, 2017
ElasticSearch index size peculiarity Elasticsearch	2	690	July 6, 2017
Elasticsearch Data Directory size anomaly Elasticsearch	5	1425	August 23, 2018

Weird inconsistencies regarding storage size in Elasticsearch 6

Related topics