Want to know what is the file are present in indecies

Tamal_Kundu · May 7, 2018, 10:13am

.deleted store.size pri.store.size
yellow open fuselog-2018.05.03 Ivek1mr5S9q1KiRDu-2NUg 5 1 30271526 0 111gb 111gb
yellow open fuselog-2018.05.07 UQvSXBKlSjeJM_bngM01wQ 5 1 24105034 0 84.8gb 84.8gb
yellow open fuselog-2018.05.05 pGGh2mwASGmHET_YKSgldA 5 1 28867041 0 103.4gb 103.4gb
yellow open fuselog-2018.05.06 GqrrGFuvT56cWfCdEjcO8g 5 1 30833617 0 112.2gb 112.2gb
yellow open fuselog-2018.05.04 Cq1Cluf8SEK6F2wv3NFNGw 5 1 45942884 0 164.8gb 164.8gb
yellow open .kibana MtbGHB3TSz6KivnuRarZQQ 1 1 32 1 71.8kb 71.8kb
indent preformatted text by 4 spaces

[fuseadmin@a0110pcsgmon04 indices]$ du -sh --time *
165G 2018-05-06 16:41 Cq1Cluf8SEK6F2wv3NFNGw
113G 2018-05-07 08:05 GqrrGFuvT56cWfCdEjcO8g
112G 2018-05-06 16:41 Ivek1mr5S9q1KiRDu-2NUg
172K 2018-05-06 16:41 MtbGHB3TSz6KivnuRarZQQ
104G 2018-05-06 17:19 pGGh2mwASGmHET_YKSgldA
90G 2018-05-07 17:34 UQvSXBKlSjeJM_bngM01wQ
[fuseadmin@a0110pcsgmon04 indices]$

[fuseadmin@a0110pcsgmon04 index]$ du -sh --time *

Kindly let me know what is the file stored in indices and why it is taking so much space.

Note : we stored verbose logs. with 90% of traffic. can we reduce the payloads.

Christian_Dahlqvist · May 7, 2018, 11:29am

The size indexed data takes upon on disk depends on how much information you have added through enrichment and how you have mapped your data. If you are using default dynamic mappings, a good amount of space can be saved by optimising mappings. Read this blog post for an example.

Tamal_Kundu · May 7, 2018, 12:15pm

how we can determine that.

shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
26 608.7gb 685.1gb 791.1gb 1.4tb 46 localhost 127.0.0.1 DiGiNode
26 UNASSIGNED

GET _cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1 59 97 34 0.75 0.83 0.93 mdi * DiGiNode
GET _cat/indices?v

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open fuselog-2018.05.04 Cq1Cluf8SEK6F2wv3NFNGw 5 1 45942884 0 164.8gb 164.8gb
yellow open fuselog-2018.05.07 UQvSXBKlSjeJM_bngM01wQ 5 1 31323147 0 109.5gb 109.5gb
yellow open fuselog-2018.05.06 GqrrGFuvT56cWfCdEjcO8g 5 1 30833617 0 112.2gb 112.2gb
yellow open fuselog-2018.05.03 Ivek1mr5S9q1KiRDu-2NUg 5 1 30271526 0 111gb 111gb
yellow open fuselog-2018.05.05 pGGh2mwASGmHET_YKSgldA 5 1 28867041 0 103.4gb 103.4gb
yellow open .kibana MtbGHB3TSz6KivnuRarZQQ 1 1 32 1 71.8kb 71.8kb

Tamal_Kundu · May 7, 2018, 12:15pm

is it possible to arrange a remote session ?

Christian_Dahlqvist · May 7, 2018, 12:20pm

No, I do not do remote sessions. You can get the mappings used for a specific index using the get mapping API. This should show you what mappings you are using. This documentation page is also a great resource.

Tamal_Kundu · May 7, 2018, 12:26pm

okay.. No issue. Can I paste the out.

GET /_all/_mapping

GET /_mapping

Please let me know if there is way i can attached the file or send an email to you

Christian_Dahlqvist · May 7, 2018, 12:29pm

I do unfortunately not have time to go through your mappings for you. If you read the blog post and documentation I linked to and also learn about mappings, you should be able to optimise your mappings yourself and create an index template for newly created indices.

Tamal_Kundu · May 7, 2018, 12:32pm

okay.. So any way we can reduce the payload ?

Tamal_Kundu · May 7, 2018, 12:33pm

any logs from side, so that I can examine will be relay helpful.

Christian_Dahlqvist · May 7, 2018, 12:49pm

Are you looking to ingest the same data but have it take top less space on disk or are you looking to filter out data before indexing it? I am not sure I understand exactly what you are looking for.

Tamal_Kundu · May 7, 2018, 1:00pm

We are not ingesting the same data. The payload is consuming some space.

Is it possible to filter the data out before indexing. I believe our payload is so high that it is consuming much space.

{
"fuselog-2018.05.04": {
"aliases": {},
"mappings": {
"log": {
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"GUID": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"ReferenceID": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"TargetService": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"beat": {
"properties": {
"hostname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"bundle": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"elapsed_time": {
"type": "float"
},
"elapsed_timestamp_start": {
"type": "date"
},
"errorCode": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"first_word": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"host": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"input_type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"level": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"logPoint": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"logTimestamp": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"logdetails": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"managedServer": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"offset": {
"type": "long"
},
"serviceName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"serviceNameOld": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"source": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"sourceSystemID": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"tags": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"thread": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
},
"settings": {
"index": {
"creation_date": "1525392001724",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "Cq1Cluf8SEK6F2wv3NFNGw",
"version": {
"created": "5050099"
},
"provided_name": "fuselog-2018.05.04"
}
}
}
}

Christian_Dahlqvist · May 7, 2018, 1:03pm

You are using default mappings, so optimising mappings as per the resources I linked to. If you want to drop events before indexing, you can do so either in Filebeat or Logstash. What does your ingest architecture look like?

Tamal_Kundu · May 7, 2018, 3:19pm

[fuseadmin@a0110pcsgmon02 logstash-5.5.0]$ cat logstash.conf
input {
beats {
type => beats
port => 5044
}
}

filter {
if [type] == "log" {
#Grok to get SourcesystemID
grok {
match => {
"message" => "(?<=SourceSystemID:)%{WORD:sourceSystemID}"
}
}

    if ![sourceSystemID] {
     grok {
           match => {
                   "message" => "(?<=ChannelID:)%{DATA:sourceSystemID}(?>\|)"
           }
       }
    if ![sourceSystemID] {
            drop {}
    }
    }

    #Grok to get Container Name
     grok {
            match => {
                    "message" => "(?<=ContainerName:)%{GREEDYDATA:containerName}"
            }
    }

#Grok to get InvocationPoint
grok {
match => {
"message" => "(?<=LogPoint:)%{WORD:logPoint}"
}
}
if ![logPoint] {
grok {
match => {
"message" => "(?<=InvocationPoint:)%{DATA:logPoint}(?>|)"
}
}
}
#Grok to get LogTimestamp
grok {
match => {
"message" => "(?<=LogTimestamp:)%{TIMESTAMP_ISO8601:logTimestamp}"
}
}
if ![logTimestamp] {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:logTimestamp}%{SPACE}|%{SPACE}%{LOGLEVEL:level}%{SPACE}|%{SPACE}%{DATA:thread}%{SPACE}|%{SPACE}%{DATA:serviceNameOld}%{SPACE}|%{SPACE}%{DATA:bundle}%{SPACE}|%{SPACE}%{GREEDYDATA:logdetails}"
}
}
#hardcoded to get if the log is first or last entry
grok {
match => {"logdetails" => "%{WORD:first_word}"}
}
}

    #Grok to get GUID

     grok {
            match => {
                    "message" => "(?<=GUID:)%{DATA:GUID}(?>\|)"
            }
    }

    #Grok to get ServiceName

     grok {
            match => {
                    "message" => "(?<=ServiceName:)%{DATA:serviceName}(?>\|)"
            }
    }


    #Grok to get ServerName

     grok {
            match => {
                    "message" => "(?<=ManagedServer:)%{IP:managedServer}"
            }
    }

    #Grok to get ErrorCode

     grok {
            match => {
                    "message" => "(?<=ErrorCode:)%{DATA:errorCode}(?>\|)"
            }
    }

    date {
            match => ["logTimestamp" , "ISO8601"]
    }


    #Grok to get ReferenceID added on 16th Apr 2018 by Rudrajit

     grok {
              match => {
                      "message" => "(?<=ReferenceID:)%{DATA:ReferenceID}(?>\|)"
              }
      }

    #Grok to get TargetService added on 16th Apr 2018 by Rudrajit

     grok {
               match => {
                       "message" => "(?<=TargetService:)%{DATA:TargetService}(?>\|)"
               }
       }







    #tag the log entry with first or last, drop other entry
    if [logTimestamp] != "" {

if [errorCode] != "" {

mutate {

add_tag => ["error_log"]

}

} else {

     if [first_word] == "Incoming_Request" {
            mutate {
                    add_tag => ["start_log"]
            }
     } else if [first_word] == "Outbound" {
            mutate {
                    add_tag => ["end_log"]
            }
    } else if [logPoint] == "InboundReq" {
            mutate {
                    add_tag => ["start_log"]
            }
    } else if [logPoint] == "InboundResp" {
            mutate {
                    add_tag => ["end_log"]
            }
    } else {
    #        drop {}
    }

    } else {
   #         drop {}
    }

    #start logstash processing to get response time
    elapsed {
            start_tag => "start_log"
            end_tag => "end_log"
            unique_id_field => "GUID"
            new_event_on_match => false
    }

}
}

    output {

elasticsearch {
hosts =>["10.89.13.28:9200"]
manage_template => true
index => "fuselog-%{+YYYY.MM.dd}"
#index => filebeat
#document_type => "%{[@metadata][type]}"
}
}

Christian_Dahlqvist · May 7, 2018, 3:37pm

Please format your configuration using the tools available in the UI so it gets easier to read it.

Tamal_Kundu · May 8, 2018, 5:39am

[fuseadmin@a0110pcsgmon02 logstash-5.5.0]$ cat logstash.conf
input {
  beats {
     type => beats
     port => 5044
  }
}

filter {
  if [type] == "log" {
         #Grok to get SourcesystemID
         grok {
                match => {
                        "message" => "(?<=SourceSystemID:)%{WORD:sourceSystemID}"
                }
        }

        if ![sourceSystemID] {
         grok {
               match => {
                       "message" => "(?<=ChannelID:)%{DATA:sourceSystemID}(?>\|)"
               }
           }
        if ![sourceSystemID] {
                drop {}
        }
        }

        #Grok to get Container Name
         grok {
                match => {
                        "message" => "(?<=ContainerName:)%{GREEDYDATA:containerName}"
                }
        }

 #Grok to get InvocationPoint
        grok {
                match => {
                        "message" => "(?<=LogPoint:)%{WORD:logPoint}"
                }
        }
        if ![logPoint] {
        grok {
                match => {
                        "message" => "(?<=InvocationPoint:)%{DATA:logPoint}(?>\|)"
                }
        }
        }
        #Grok to get LogTimestamp
        grok {
                match => {
                        "message" => "(?<=LogTimestamp:)%{TIMESTAMP_ISO8601:logTimestamp}"
                }
        }
        if ![logTimestamp] {
        grok {
                match => {
                       "message" => "%{TIMESTAMP_ISO8601:logTimestamp}%{SPACE}\|%{SPACE}%{LOGLEVEL:level}%{SPACE}\|%{SPACE}%{DATA:thread}%{SPACE}\|%{SPACE}%{DATA:serviceNameOld}%{SPACE}\|%{SPACE}%{DATA:bundle}%{SPACE}\|%{SPACE}%{GREEDYDATA:logdetails}"
                }
        }
          #hardcoded to get if the log is first or last entry
        grok {
                match => {"logdetails" => "%{WORD:first_word}"}
        }
        }

        #Grok to get GUID

         grok {
                match => {
                        "message" => "(?<=GUID:)%{DATA:GUID}(?>\|)"
                }
        }

        #Grok to get ServiceName

         grok {
                match => {
                        "message" => "(?<=ServiceName:)%{DATA:serviceName}(?>\|)"
                }
        }


        #Grok to get ServerName

         grok {
                match => {
                        "message" => "(?<=ManagedServer:)%{IP:managedServer}"
                }
        }

        #Grok to get ErrorCode

         grok {
                match => {
                        "message" => "(?<=ErrorCode:)%{DATA:errorCode}(?>\|)"
                }
        }

        date {
                match => ["logTimestamp" , "ISO8601"]
        }


        #Grok to get ReferenceID added on 16th Apr 2018 by Rudrajit

         grok {
                  match => {
                          "message" => "(?<=ReferenceID:)%{DATA:ReferenceID}(?>\|)"
                  }
          }

        #Grok to get TargetService added on 16th Apr 2018 by Rudrajit

         grok {
                   match => {
                           "message" => "(?<=TargetService:)%{DATA:TargetService}(?>\|)"
                   }
           }







        #tag the log entry with first or last, drop other entry
        if [logTimestamp] != "" {
#        if [errorCode] != "" {
#                mutate {
#                        add_tag => ["error_log"]
#                }
#        } else {

         if [first_word] == "Incoming_Request" {
                mutate {
                        add_tag => ["start_log"]
                }
         } else if [first_word] == "Outbound" {
                mutate {
                        add_tag => ["end_log"]
                }
        } else if [logPoint] == "InboundReq" {
                mutate {
                        add_tag => ["start_log"]
                }
        } else if [logPoint] == "InboundResp" {
                mutate {
                        add_tag => ["end_log"]
                }
        } else {
        #        drop {}
        }

        } else {
       #         drop {}
        }

        #start logstash processing to get response time
        elapsed {
                start_tag => "start_log"
                end_tag => "end_log"
                unique_id_field => "GUID"
                new_event_on_match => false
        }
}
}

        output {
elasticsearch {
    hosts =>["10.89.13.28:9200"]
    manage_template => true
    index => "fuselog-%{+YYYY.MM.dd}"
    #index => filebeat
    #document_type => "%{[@metadata][type]}"
  }
}

Christian_Dahlqvist · May 8, 2018, 6:06am

As mentioned earlier, I would recommend you do the following:

Go through the fields in your mappings. For each field that has a default dual text/keyword mapping, determine if you need to be able to do free-text search on components of this field or simply aggregate on it. If you only need to aggregate on it, change the mapping to just keyword. This is shown in the docs I linked to. Store these in an index template that applies to the fuselog-* index pattern.
Enable best_compression in your new index template.
Upload this template to make it take effect on any new indices.

If you want to reduce the space taken up by cold indices, you may need to reindex based on the new index template. You may also want to run a force merge on indices no longer being written to with max_num_segments set to 1.

system · June 5, 2018, 6:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Disk indices storage greater than the indices used space Elasticsearch	5	1852	May 27, 2021
Log size issue Elasticsearch	4	505	September 5, 2018
Many deleted indices, no recovery of disk space? Elasticsearch	4	276	October 3, 2021
Weird storage change Elasticsearch	4	605	July 20, 2017
ElasticSearch - docs deleted count reduced Elasticsearch	5	1010	September 25, 2019

Want to know what is the file are present in indecies

if [errorCode] != "" {

mutate {

add_tag => ["error_log"]

}

} else {

Related topics