Parsing sha-1 value in elk

vinu89 · January 23, 2019, 11:06am

Hi,

I have a text file of 27 GB which includes sha-1 value and a count. Number of lines in the file is almost 55crore.I have one elastic node with one shard and zero replica.After indexing the total size of the index file is about 130 GB. How is it possible.Am i missing anything here or the indexed size is correct

Please clarify

Thanks,
Vinothine

dadoonet · January 23, 2019, 11:18am

It depends on your mapping and several other settings.
Did you run a forcemerge call after indexing all the data BTW?

vinu89 · January 23, 2019, 11:45am

Thanks for the reply david
Actually i have not done much mapping just created a template for the index with 1 shards and 0 replicas because elastic took 5 shard and 1 replica by default....

PUT _template/default
{
"index_patterns": ["*"],
"order": -1,
"settings": {
"number_of_shards": "1",
"number_of_replicas": "0"
}
}

vinu89 · January 23, 2019, 11:47am

and i didnt run forcemerge call after indexing

dadoonet · January 23, 2019, 12:13pm

What is the mapping then?

vinu89 · January 24, 2019, 4:07am

{
"mapping": {
"default": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"match_mapping_type": "string",
"mapping": {
"norms": false,
"type": "text"
}
}
},
{
"string_fields": {
"match": "",
"match_mapping_type": "string",
"mapping": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"norms": false,
"type": "text"
}
}
}
],
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "keyword"
},
"geoip": {
"dynamic": "true",
"properties": {
"ip": {
"type": "ip"
},
"latitude": {
"type": "half_float"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "half_float"
}
}
}
}
},
"doc": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"match_mapping_type": "string",
"mapping": {
"norms": false,
"type": "text"
}
}
},
{
"string_fields": {
"match": "",
"match_mapping_type": "string",
"mapping": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"norms": false,
"type": "text"
}
}
}
],
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "keyword"
},
"count": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"geoip": {
"dynamic": "true",
"properties": {
"ip": {
"type": "ip"
},
"latitude": {
"type": "half_float"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "half_float"
}
}
},
"host": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"message": {
"type": "text",
"norms": false
},
"path": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"shavalue": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}

vinu89 · January 24, 2019, 5:21am

my primary shard is actually 1 but the routing shard is default as 5 by elastic so this is happening?

dadoonet · January 24, 2019, 8:58am

The 27gb size is before any transformation like geoip, grok... right?
What is the total size after all transformations just before it's sent to elasticsearch ?

Otherwise it's hard to tell.

What is a typical line in your source file and a typical document that have been indexed in elasticsearch ?

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.

vinu89 · January 24, 2019, 9:22am

yes the 27 GB is of raw file which contains a sha1 value and count for eg[D657187D9C9C1AD04FDA5132338D495FDB112FD1:1] and after indexing i have 12 fields like id,source,doc,etc...

dadoonet · January 24, 2019, 9:49am

Can you share an example?

But anyway you can't compare oranges and apples. The data you are sending to elasticsearch is not the same as the raw data.
So we can't really tell if the ratio is good or not.

Christian_Dahlqvist · January 24, 2019, 9:51am

What size does the cat indices API report?

vinu89 · January 24, 2019, 9:58am

'''CODE'''

Raw data : C86BCF24575AA17349527262420B5D7418EA7888:1

Indexed data [elastic result] :
{
"_index": "logstash-sha_logs_1",
"_type": "doc",
"_id": "H5bdfGgBcKiZCqWYxTuG",
"_version": 1,
"_score": null,
"_source": {
"shavalue": "C86BCF24575AA17349527262420B5D7418EA7888",
"@version": "1",
"count": "1",
"@timestamp": "2019-01-23T22:39:04.378Z",
"message": "C86BCF24575AA17349527262420B5D7418EA7888:1",
"host": "0.0.0.0",
"path": "/root/example.txt"
},
"fields": {
"@timestamp": [
"2019-01-23T22:39:04.378Z"
]
},
"sort": [
1548283144378
]
}

grok parser

filter {
grok {
match => [ "message", "%{BASE16NUM:shavalue}:%{NUMBER:count}" ]
}
}

vinu89 · January 24, 2019, 9:59am

its about 136 GB

vinu89 · January 25, 2019, 6:52am

Please tell me for parsing sha-1 values whether i should use grok parser or fingerprint

Christian_Dahlqvist · January 25, 2019, 7:58am

It looks like you have already used grok to parse out the SHA1 hash from the message so I am not sure I understand your question. The fingerprint plugin calculates hashes, so I do not see how it is relevant here as you already have a hash.

vinu89 · January 25, 2019, 8:09am

yea....i just want to parse the hash field so i used grok parser but i wonder y that much of space is occupied during index it is almost 5 times 27gb*5.... And this problem is happening only when i parse sha values for other parsers like(bind,apachelogs) it is coming correctly.

Is there any special function to use sha value or can u suggest me how to parse a sha value using logstash...

i have provided my parser above please check it and let me know

Thanks in advance

Christian_Dahlqvist · January 25, 2019, 8:33am

From Logstash and Elasticsearch perspective it is just a string. When you use dynamic mappings in Elasticsearch, it will map it as both text and keyword, which adds flexibility but also requires more storage space. If you provide explicit mappings as described here you can save a lot of space. If you are using known types and modules, it is likely mappings have already been optimised, which is why it takes up less space.

Another thing that drives disk usage is cardinality. As your SHA1 hash field is likely to be unique and is highly random in nature, it will take up a lot more space than fields containing low cardinality values that compress better.

vinu89 · January 25, 2019, 9:38am

Thanks ... Will check it and let u know

vinu89 · January 25, 2019, 9:50am

can you tell me approximately how much it has to take for 27GB of sha file

Christian_Dahlqvist · January 25, 2019, 9:59am

I suspect you will save quite a bit by optimising mappings as each hash now is stored multiple times, but do not know if there are other things that could be affecting this as well as it seems like a lot of overhead.

Topic		Replies	Views
Elasticsearch Disk Space Issue Elasticsearch	13	111	October 28, 2024
Elasticsearch Template Issue Elasticsearch	9	913	December 25, 2018
How to Reduce Received Logs Size in ELK Stack? Elasticsearch	32	2794	May 2, 2022
How to define shard count per index? Elasticsearch	5	5197	February 3, 2017
Advice for restore/configuration Elasticsearch	6	521	April 7, 2019

Parsing sha-1 value in elk

Related topics