Problem disk usage index (to large)

Beuhlet_Reseau · May 4, 2017, 9:00am

Hello,

I have problem with my elasticsearch when i upload data to ES.

Intput log file :

26 Go
174 871 772 lines in log file

Output (index in ES) :

48 Go (Increase of 12 Go)
same lines (its ok)

Here an exemple of log line :

90|30768|3080368563|161E|1E4|059|04/2017 07:49|A0|LEGACYNT|WEBMAI|1010|3802|01|0000141|06|3538814700|00|WEBMAIL|WEBMAI|46524631
90|27638|2075468563|1E|2E|859|04/2017 17:49|A0|LEGACYNT|WEBMAI|1010|3802|01|0000141|06|3538174700|00|WEBMAIL|WEBMAI|46524631
90|30238|3035468563|1AE|2A14|085|04/2017 15:49|A0|LEGACYNT|WEBMAI|1010|3802|01|0000141|06|3564414700|00|WEBMAIL|WEBMAI|4654631

As you can see, somes fields are the same.

Also, i use csv filter to cut lines, and template to index my data, it looks like :

"templatelog1": {
    "order": 0,
    "version": 50001,
    "template": "log1-*",
    "settings": {
      "index": {
        "refresh_interval": "300s",
        "number_of_shards": "2",
        "translog": {
          "sync_interval": "5s",
          "durability": "async"
        },
        "number_of_replicas": "0"
      }
    },
    "mappings": {
      "_default_": {
        "dynamic_templates": [
          {
            "string_fields": {
              "mapping": {
                "norms": false,
                "type": "text",
                "fields": {
                  "keyword": {
                    "index": "not_analyzed",
                    "type": "keyword"
                  }
                }
              },
              "match_mapping_type": "string",
              "match": "*"
            }
          }
        ],

... After i define mapping of numeric field (byte, short ...)

What do you can talk me about this problem ?

warkolm · May 4, 2017, 9:10am

You are storing every string field twice, as a text and as a keyword.
Have you run a force merge on the index?

You don't need the not_analyzed, if you set it as a keyword it does that automatically.[quote="Beuhlet_Reseau, post:1, topic:84513"]
"translog": {
"sync_interval": "5s",
"durability": "async"
}
[/quote]

Be very, very, very careful changing this. If you are happy with potential data loss then you don't need to change anything.

Christian_Dahlqvist · May 4, 2017, 9:29am

This blog post may be a bit old, but is still quite useful. Enabling best_compression is an easy way to save some space, but does result in slightly higher CPU usage. Not having a lot of very small shards is also likely to help improve compression ratios.

Beuhlet_Reseau · May 9, 2017, 12:58pm

It's possible to disable it @warkolm We use only keyword with Visualize.

Yes, but There was no commit at each write, it's every 5 secondes. Indexing is faster.

About your link @Christian_Dahlqvist :

The use case n°3 : String : not_analyzed and _all : disabled.
=> How not analyzed all string ?

To disable _all it's just :

"_all": {
          "norms": false,
          "enabled": false}

Thank you !

Christian_Dahlqvist · May 9, 2017, 1:36pm

In Elasticsearch 5.x not_analyzed corresponds to keyword mapping fro string fields.

Beuhlet_Reseau · May 10, 2017, 9:23am

@Christian_Dahlqvist You means that not analyzed is by default ? I don't understand.

=> The twice storage of fields (string + keyword) is mandatory ?
=> String is always not analyzed if i have keyword
=> If i don't want analyze few keywords, it's possible ?

I have disable _all and i have a rate of decrease about 30% of space disk.
=> _All field is useless i hope (I do not understand very well explain in documentation...)

Christian_Dahlqvist · May 10, 2017, 1:34pm

This is controlled through mappings and can be defined based on how you need to be able to query the data.

Yes, the string field mappings were updated in Elasticsearch 5.x and the old not_analyzed setting was replaced by the keyword mapping.

I am not sure I understand what you mean. Through the mappings you can control how you want to index and analyze the fields and you can even configure fields to not be indexed at all.

The _all field is not useless. If you use Kibana and add filtering through the search bar without specifying. specific field, it is the _all field that is used behind the scenes. For some use cases this is a perfectly acceptable trade-off, but it is not the right thing for everyone, so check if/how it affects the user experience.

Beuhlet_Reseau · May 10, 2017, 2:44pm

OK thank you @Christian_Dahlqvist with link i have win about 46% space disk.

Do you know if it's possible to :

=> Remove index which have 3 days old for example
=> Or move index in zip folder to reuse them later in other ELK Stack.

The solution is very demanding in disk space so, i want clear index automatically if it's have 3 days old. (Just a test)

system · June 7, 2017, 2:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.