Logstash: Parse Mongo Collection Name To Add As 'Type' in ElasticSearch

I could use some help. I am using logstash with the logstash-input-mongodb plugin and the logstash-output-elasticsearch plugin.

Problems: Sending documents to elasticsearch takes a "type" and an "id" in the URI. Ex: PUT /<host>/<index>/<type>/<uri>. The configuration without a filter will always report the "type" = "logs", and will generate a new "id". This is problematic because restarting logstash will send everything again with new ids.

Goal: I need to parse the collection name, and the id from the input and use it as the type=collection_name and id=mongo_id.

Here is my config:

input {
  mongodb {
   uri => '<connectionString>'
   collection => '(collection1|collection2|collection3)'
   batch_size => 300
  }
}

filter {
 grok {
  match => [
    ????
  ]
 ]
}

output {
  elasticsearch {
   hosts => ["<es-host>"]
   index => "<index>"
   flush_size => 50
   document_type => "<COLLECTIONTOPARSE>"
   document_id => "<IDTOPARSE>"
  }

I think I need the grok match to parse the collection and id in mongo, but I could use some help.

When I run this without the filter, and search in ES, it shows the doc structure below. A lot has been omitted, but I've included what is necessary for parsing the collection name and mongo_id out.. but I'm not sure how to do it. Do I need a multi-line filter? How do I parse the variables and reference them in the output plugin?

{
    "took": 17,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 300,
        "max_score": 1,
        "hits": [
          {
            "_index": "index-name1",
            "_type": "logs",
            "_id": "AV6Wy_ltr_CCklAgCDn1",
            "_score": 1,
            "_source": {
                "host": "<some host value>",
                "@version": "1",
                "@timestamp": "2017-09-18T21:01:43.291Z",
                "logdate": "2015-09-30T18:25:02+00:00",
                "mongo_id": "IDTOPARSE",
                "_class": "word.word1.word2.word3.COLLECTIONTOPARSE.subcollection",
            }
          },
          ...
        ]
     }
  }

My regex expressions that parse the COLLECTIONTOPARSE and IDTOPARSE are below, but I don't know if "match" will filter out everything but the match, or will actually create variables I can use in the output plugin?

.?+.word3.([a-zA-Z]+)..?+",
OR .?.word3.([a-zA-Z]+)..?",

"mongo_id": "([a-zA-Z]+)",

I don't know if "match" will filter out everything but the match, or will actually create variables I can use in the output plugin?

The latter, i.e. the grok filter extracts new fields from existing fields.

It seems the mongo_id field is ready to be used right away without parsing.

As for the _class field from which you want to extract the collection name, how is the collection name identified in the sequence of period-separated words? Is it always the fifth word? Or always the second to last? Or something else?

Thanks Magnus! For some reason I didn't know what I was able to reference parsed fields. I solved the _class by using the gsub option:

filter {
mutate {
gsub => [
# remove the beginning part of the class name so we can use the rest for the ES 'type'
"_class", "word1.word2.word3.word4.", ""
]
}
}

Regarding outputting the correct stuff into ES, I used this:

output {
elasticsearch {
hosts => [""]
index => "test"

#mongo_id is in the input, so we can reference the value to be the document id in the ES PUT command
document_id => "%{[mongo_id]}"

document_type => "%{[_class]}"
}
}

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.