Load data into Elastic Search using Log-stash

Well the question may be already exists or there will be lots of solutions are available, but here the requirement is different.

Here I have a huge file around 44GB and it is increasing day by day.

Q1:- Can we load this much of huge file into elasticsearch?

Here is the sample data of the log file.

    Jun 1 17: 12: 18 10.10 .125 .148 2017 - 06 - 01 T11: 42: 28 Z 352019 b8 - 0 d2d - 4397 - 446 a - 98 fabeddf3bf doppler[19]: {
    "cf_app_id": "a4d311b3-f756-4d5e-bc3d-03690d461443",
    "cf_app_name": "parkingapp",
    "cf_ignored_app": false,
    "cf_org_id": "c5803a97-d696-497e-a0a4-112117eefab1",
    "cf_org_name": "KPIT",
    "cf_origin": "firehose",
    "cf_space_id": "886f2158-6b8a-4079-a6e1-7aa52034400d",
    "cf_space_name": "Development",
    "cpu_percentage": 0.022689683212221975,
    "deployment": "cf",
    "disk_bytes": 86257664,
    "disk_bytes_quota": 1073741824,
    "event_type": "ContainerMetric",
    "instance_index": 0,
    "ip": "10.10.125.113",
    "job": "diego_cell",
    "job_index": "356614b9-b079-4cc7-bcf9-4f61ab7924d0",
    "level": "info",
    "memory_bytes": 89395200,
    "memory_bytes_quota": 536870912,
    "msg": "",
    "origin": "rep",
    "time": "2017-06-01T11:42:28Z"
}
Jun 1 17: 12: 18 10.10 .125 .148 2017 - 06 - 01 T11: 42: 28 Z 352019 b8 - 0 d2d - 4397 - 446 a - 98 fabeddf3bf doppler[19]: {
    "cf_app_id": "3a83fdf4-a69a-45ca-8537-f7916c79dbbb",
    "cf_app_name": "spring-cloud-broker",
    "cf_ignored_app": false,
    "cf_org_id": "13233503-5430-4372-942c-02147ac34c38",
    "cf_org_name": "system",
    "cf_origin": "firehose",
    "cf_space_id": "1f40ca9a-ca34-434b-aa17-82ed87657a6e",
    "cf_space_name": "p-spring-cloud-services",
    "cpu_percentage": 0.0955028907326772,
    "deployment": "cf",
    "disk_bytes": 188231680,
    "disk_bytes_quota": 1073741824,
    "event_type": "ContainerMetric",
    "instance_index": 0,
    "ip": "10.10.125.113",
    "job": "diego_cell",
    "job_index": "356614b9-b079-4cc7-bcf9-4f61ab7924d0",
    "level": "info",
    "memory_bytes": 641343488,
    "memory_bytes_quota": 1073741824,
    "msg": "",
    "origin": "rep",
    "time": "2017-06-01T11:42:28Z"
}
Jun 1 17: 12: 18 10.10 .125 .148 2017 - 06 - 01 T11: 42: 28 Z 352019 b8 - 0 d2d - 4397 - 446 a - 98 fabeddf3bf doppler[19]: {
    "cf_app_id": "37acc229-844a-4ed3-ab54-5149ffab5b5b",
    "cf_app_name": "apps-manager-js",
    "cf_ignored_app": false,
    "cf_org_id": "13233503-5430-4372-942c-02147ac34c38",
    "cf_org_name": "system",
    "cf_origin": "firehose",
    "cf_space_id": "0ba61523-6a76-4d37-a0cd-a0117454a6eb",
    "cf_space_name": "system",
    "cpu_percentage": 0.04955433122879798,
    "deployment": "cf",
    "disk_bytes": 10235904,
    "disk_bytes_quota": 107374182 4,
    "event_type": "ContainerMetric",
    "instance_index": 5,
    "ip": "10.10.125.113",
    "job": "diego_cell",
    "job_index": "356614b9-b079-4cc7-bcf9-4f61ab7924d0",
    "level": "info",
    "memory_bytes": 6307840,
    "memory_bytes_quota": 67108864,
    "msg": "",
    "origin": "rep",
    "time": "2017-06-01T11:42:28Z"
}

You can see the logs are not completely JSON format.

Here Logs contains cf_app_name, with below command i got the only cf_app_name from logs and stored that output into another file.

grep -Po '"cf_app_name":.*?[^\\]"' /var/log/messages | cut -d ':' -f2 > applications.txt
then i have created indexes in elastisearch with the cf_app_name by reading that applications.txt file using below script

tr '[A-Z]' '[a-z]' < applications.txt > apps_name.txt

while IFS='"' read -ra arr;

do
        for i in "${arr[@]}" ; do

        name="$i"

CURL_COMMAND=`curl -XPUT 'localhost:9200/'$name'?pretty'`

echo $CURL_COMMAND
done

done </root/apps_name.txt

I did successfully created the thousands of indexes in Elastic Search.

Now what I would like to do is, load these logs into Elastic Search indexes according to the cf_app_name.

That means all the logs should store into respected indexes according to their cf_app_name.

Q2: Is logstash is the best solution for this ? If it is, Please provide your valuable suggestions to achieve this.

Thanks you, Bunny.

Can we load this much of huge file into elasticsearch?

Yes, of course.

I did successfully created the thousands of indexes in Elastic Search.

Why do you want to have separate indexes? Indexes have a fixed cost so having too many of them in relation to the size of your cluster typically isn't a good idea.

Is logstash is the best solution for this ? If it is, Please provide your valuable suggestions to achieve this.

I suggest you

  • use a multiline codec to join the lines of each logical event,
  • use a grok filter to extract fields for the timestamp(s) and whatever else you've got in addition to the JSON string at the end,
  • use a json filter to parse the JSON string,
  • reference the cf_app_name field in your elasticsearch output configuration (e.g. index => "%{cf_app_name}").

Actually i am new to this logstash.
I tried the below script.
But i am getting errors.

input
{
    file
    {
        path => ["/root/test123.txt"]
        start_position => "beginning"
        sincedb_path => "/dev/null"
        exclude => "*.gz"
    }
}

filter
{
    grok {
        pattern => ["%{cf_app_name}"]
        named_captures_only => true
        }
    grep {
        match  => ["addition-server"]
        drop => false
        add_tag => json
        }
    json {
        tags => json
        message => data
        }
output
{
  elasticsearch {
    hosts => "localhost"
    index => "%{cf_app_name}"
}

    stdout { codec => rubydebug }
}

Error

fetched an invalid config {:config=>"input \n{\n    file \n    {\n        path => [\"/root/test123.txt\"]\n        start_position => \"beginning\"\n        sincedb_path => \"/dev/null\"\n        exclude => \"*.gz\"\n    }\n}\n\nfilter \n{\n    grok {\n\tpattern => [\"%{cf_app_name}\"]\n\tnamed_captures_only => true\n\t}\n    grep {\n\tmatch  => [\"addition-server\"]\n\tdrop => false\n\tadd_tag => json\n\t}\n    json {\n\ttags => json\n\tmessage => data\n\t}\t\noutput\n{ \n  elasticsearch {\n    hosts => \"10.10.236.61\"\n    index => \"%{cf_app_name}\"\n}\n\n    stdout { codec => rubydebug }\n}\n\n", :reason=>"Expected one of #, => at line 29, column 17 (byte 406) after filter \n{\n    grok {\n\tpattern => [\"%{cf_app_name}\"]\n\tnamed_captures_only => true\n\t}\n    grep {\n\tmatch  => [\"addition-server\"]\n\tdrop => false\n\tadd_tag => json\n\t}\n    json {\n\ttags => json\n\tmessage => data\n\t}\t\noutput\n{ \n  elasticsearch ", :level=>:error}

Can you please help me with this.

It ll be soo helpful for me.

There are mulitple problems here:

  • You're not closing the filter block. There's a } missing. This is what's preventing Logstash from starting up.
  • You're referencing a cf_app_name field in your first grok filter but that filter won't exist until after the json filter. Filters are processed in order.
  • I don't understand what you're trying to do with the grok filter.
  • The grep filter is deprecated. What are you trying to do?

I tried the below script but still no luck.

input
{
    file
    {
        path => ["/root/test123.txt"]
        start_position => "beginning"
        sincedb_path => "/dev/null"
        exclude => "*.gz"
    }
}

filter
{
        grok {

        match => {
                "cf_app_name" => "sparkle"
                 }
               }
}
output
{
  elasticsearch {
    hosts => "localhost"
    index => "sparkle"
}

    stdout { codec => rubydebug }
}

Can you please suggest

I can't find anything obviously wrong with your configuration (except that the grok filter doesn't do anything useful) so you need to be more specific about the problems you're having.

Actually i would like to store the logs into the index as
if cf_app_name is xxx
then it should store those xxx logs into the xxx index in elasticsearch.

Please suggest me.

can you please help me with that grok filter.

If you're not familiar with grok expressions then perhaps the grok constructor web site can help you get started.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.