Parsing sha-1 value in elk

Hi,

I have a text file of 27 GB which includes sha-1 value and a count. Number of lines in the file is almost 55crore.I have one elastic node with one shard and zero replica.After indexing the total size of the index file is about 130 GB. How is it possible.Am i missing anything here or the indexed size is correct

Please clarify

Thanks,
Vinothine

It depends on your mapping and several other settings.
Did you run a forcemerge call after indexing all the data BTW?

Thanks for the reply david
Actually i have not done much mapping just created a template for the index with 1 shards and 0 replicas because elastic took 5 shard and 1 replica by default....

PUT _template/default
{
"index_patterns": ["*"],
"order": -1,
"settings": {
"number_of_shards": "1",
"number_of_replicas": "0"
}
}

and i didnt run forcemerge call after indexing

What is the mapping then?

{
"mapping": {
"default": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"match_mapping_type": "string",
"mapping": {
"norms": false,
"type": "text"
}
}
},
{
"string_fields": {
"match": "",
"match_mapping_type": "string",
"mapping": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"norms": false,
"type": "text"
}
}
}
],
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "keyword"
},
"geoip": {
"dynamic": "true",
"properties": {
"ip": {
"type": "ip"
},
"latitude": {
"type": "half_float"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "half_float"
}
}
}
}
},
"doc": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"match_mapping_type": "string",
"mapping": {
"norms": false,
"type": "text"
}
}
},
{
"string_fields": {
"match": "
",
"match_mapping_type": "string",
"mapping": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"norms": false,
"type": "text"
}
}
}
],
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "keyword"
},
"count": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"geoip": {
"dynamic": "true",
"properties": {
"ip": {
"type": "ip"
},
"latitude": {
"type": "half_float"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "half_float"
}
}
},
"host": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"message": {
"type": "text",
"norms": false
},
"path": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"shavalue": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}

my primary shard is actually 1 but the routing shard is default as 5 by elastic so this is happening?

The 27gb size is before any transformation like geoip, grok... right?
What is the total size after all transformations just before it's sent to elasticsearch ?

Otherwise it's hard to tell.

What is a typical line in your source file and a typical document that have been indexed in elasticsearch ?

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.

yes the 27 GB is of raw file which contains a sha1 value and count for eg[D657187D9C9C1AD04FDA5132338D495FDB112FD1:1] and after indexing i have 12 fields like id,source,doc,etc...

Can you share an example?

But anyway you can't compare oranges and apples. The data you are sending to elasticsearch is not the same as the raw data.
So we can't really tell if the ratio is good or not.

What size does the cat indices API report?

'''CODE'''

Raw data : C86BCF24575AA17349527262420B5D7418EA7888:1

Indexed data [elastic result] :
{
"_index": "logstash-sha_logs_1",
"_type": "doc",
"_id": "H5bdfGgBcKiZCqWYxTuG",
"_version": 1,
"_score": null,
"_source": {
"shavalue": "C86BCF24575AA17349527262420B5D7418EA7888",
"@version": "1",
"count": "1",
"@timestamp": "2019-01-23T22:39:04.378Z",
"message": "C86BCF24575AA17349527262420B5D7418EA7888:1",
"host": "0.0.0.0",
"path": "/root/example.txt"
},
"fields": {
"@timestamp": [
"2019-01-23T22:39:04.378Z"
]
},
"sort": [
1548283144378
]
}

grok parser

filter {
grok {
match => [ "message", "%{BASE16NUM:shavalue}:%{NUMBER:count}" ]
}
}

its about 136 GB

Please tell me for parsing sha-1 values whether i should use grok parser or fingerprint

It looks like you have already used grok to parse out the SHA1 hash from the message so I am not sure I understand your question. The fingerprint plugin calculates hashes, so I do not see how it is relevant here as you already have a hash.

yea....i just want to parse the hash field so i used grok parser but i wonder y that much of space is occupied during index it is almost 5 times 27gb*5.... And this problem is happening only when i parse sha values for other parsers like(bind,apachelogs) it is coming correctly.

Is there any special function to use sha value or can u suggest me how to parse a sha value using logstash...

i have provided my parser above please check it and let me know

Thanks in advance

From Logstash and Elasticsearch perspective it is just a string. When you use dynamic mappings in Elasticsearch, it will map it as both text and keyword, which adds flexibility but also requires more storage space. If you provide explicit mappings as described here you can save a lot of space. If you are using known types and modules, it is likely mappings have already been optimised, which is why it takes up less space.

Another thing that drives disk usage is cardinality. As your SHA1 hash field is likely to be unique and is highly random in nature, it will take up a lot more space than fields containing low cardinality values that compress better.

Thanks ... Will check it and let u know

can you tell me approximately how much it has to take for 27GB of sha file

I suspect you will save quite a bit by optimising mappings as each hash now is stored multiple times, but do not know if there are other things that could be affecting this as well as it seems like a lot of overhead.