Ingest pipeline to extract data

if I want to extract the numeric duration value from
words::words::words::words (duration=432, words)

would this grok pipeline work
PUT _ingest/pipeline/parse
{
"description" : "parses the duration field",
"processors" : [
{
"grok": {
"field": "message",
"patterns": ["%{NUMBER:duration}"]
}
}
]
}

Would this parse the value and store it in a new field or would I have to create the new field first? Or is there any issues that you notice with it?

Is there anything else I have to do to run this pipeline now that it is created?

I know on the ingest node page it says to run the pipeline like this
PUT my-index/_doc/my-id?pipeline=my_pipeline_id
{
"foo": "bar"
}

but I don't understand the need of "foo":"bar" my pipeline should just extract duration out of that message field and store it in a new field shouldn't it?

It may be helpful to use the Simulate Pipeline API to try out your pipeline to see what the result will be, without actually indexing any documents, like so:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "this is an example pipeline",
    "processors": [
      {
        "grok": {
          "field": "message",
          "patterns": ["%{NUMBER:duration}"]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "words::words::words::words (duration=432, words)"
      }
    }
  ]
}

In this case, this should return a response like:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "duration" : "432",
          "message" : "words::words::words::words (duration=432, words)"
        },
        "_ingest" : {
          "timestamp" : "2019-05-31T16:21:53.147Z"
        }
      }
    }
  ]
}

We can see that the number is added to the document as duration, because that's the name we used in the grok pattern. However, note that the given grok pattern will just extract the first number in the message field, so if you used a document with "message": "words::words::87::words (duration=432, words)", the extracted duration would be 87.

To make this a little more resilient, you can use regular expressions in your grok pattern, like this: ".*duration=%{NUMBER:duration}.*". That pattern would extract 432 rather than 87 from the second example message.

To use the pipeline, just index documents as normal, but with the pipeline=my_pipeline_id parameter on the request, like in your second post.

The "foo": "bar" is just an example document, it's not required or anything - just replace it with the document you want to run through the pipeline and index.

1 Like

Nice thanks I was actually trying to make a better expression but I was getting parsing errors, guess I need to brush up on my regex.
As for using the pipeline I believe I'll have to add the pipeline name to fluentd as I don't ingest documents manually here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.