Ingest pipeline to extract data

nicks1993 · May 30, 2019, 11:47am

if I want to extract the numeric duration value from
words::words::words::words (duration=432, words)

would this grok pipeline work
PUT _ingest/pipeline/parse
{
"description" : "parses the duration field",
"processors" : [
{
"grok": {
"field": "message",
"patterns": ["%{NUMBER:duration}"]
}
}
]
}

Would this parse the value and store it in a new field or would I have to create the new field first? Or is there any issues that you notice with it?

Is there anything else I have to do to run this pipeline now that it is created?

nicks1993 · May 30, 2019, 2:40pm

I know on the ingest node page it says to run the pipeline like this
PUT my-index/_doc/my-id?pipeline=my_pipeline_id
{
"foo": "bar"
}

but I don't understand the need of "foo":"bar" my pipeline should just extract duration out of that message field and store it in a new field shouldn't it?

gbrown · May 31, 2019, 4:29pm

It may be helpful to use the Simulate Pipeline API to try out your pipeline to see what the result will be, without actually indexing any documents, like so:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "this is an example pipeline",
    "processors": [
      {
        "grok": {
          "field": "message",
          "patterns": ["%{NUMBER:duration}"]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "words::words::words::words (duration=432, words)"
      }
    }
  ]
}

In this case, this should return a response like:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "duration" : "432",
          "message" : "words::words::words::words (duration=432, words)"
        },
        "_ingest" : {
          "timestamp" : "2019-05-31T16:21:53.147Z"
        }
      }
    }
  ]
}

We can see that the number is added to the document as duration, because that's the name we used in the grok pattern. However, note that the given grok pattern will just extract the first number in the message field, so if you used a document with "message": "words::words::87::words (duration=432, words)", the extracted duration would be 87.

To make this a little more resilient, you can use regular expressions in your grok pattern, like this: ".*duration=%{NUMBER:duration}.*". That pattern would extract 432 rather than 87 from the second example message.

To use the pipeline, just index documents as normal, but with the pipeline=my_pipeline_id parameter on the request, like in your second post.

The "foo": "bar" is just an example document, it's not required or anything - just replace it with the document you want to run through the pipeline and index.

nicks1993 · May 31, 2019, 4:46pm

Nice thanks I was actually trying to make a better expression but I was getting parsing errors, guess I need to brush up on my regex.
As for using the pipeline I believe I'll have to add the pipeline name to fluentd as I don't ingest documents manually here.

system · June 28, 2019, 4:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to use pipelines and processors Elasticsearch	13	1654	June 28, 2019
Unable to Grok scientific notation numbers using ingest pipelines Elasticsearch	1	1364	June 12, 2018
Ingest pipeline - extract regex from events Elasticsearch painless , ingest-pipeline	2	615	November 14, 2023
Ingestion Pipeline not parsing out field values Elasticsearch	1	452	March 30, 2018
Unable to grok ingest pipeline for Caddy log (even after running in grok debugger) Elasticsearch	3	803	August 4, 2020

Ingest pipeline to extract data

Related topics