Ingest: foreachとgrokを組み合わせた場合の処理について

tsgkdt · September 19, 2017, 1:36pm

知りたいこと

Ingest Pipelineでforeachの中でgrokを適用し、結果をarrayとして保存する方法について

状況の説明

括弧内にログのコード、次にログのメッセージからなる文字列が、カンマ区切りで出力されているテキストがあります。

（例）

[INFO.0001]あいうえおかきくけこ, [INFO.0002]さしすせそ

これを、Ingest Pipelineでこのようにパースしたいと考えています。

期待値

{
  "hoge": ["INFO.0001", "INFO.0002"],
  "fuga": ["あいうえおかきくけこ", "さしすせそ"]
}

try

まず、splitで配列にし、foreach内でgrokによりコードとメッセージを取るように指定することを考えました。
これだと、hoge, fugaのフィールドに対して最後のループのときの値がセットされるだけです。

 {
      "split": {
        "field": "message",
        "separator": ","
      }
    },
    {
      "foreach": {
        "field": "message",
        "processor" : {
          "grok": {
            "field": "_ingest._value",
            "patterns": ["\\[%{NOTSPACE:hoge}\\]%{GREEDYDATA:fuga}"]
          }
        }
      } 
    }

actual result

{
  "hoge": "INFO.0002",
  "fuga": "さしすせそ"
}

try2

また、値の上書きとならないよう、foreachの中でgrokの次にappendを指定した場合には、

      "foreach": {
        "field": "biz.temp.exception",
        "processor" : {
          "grok": {
            "field": "_ingest._value",
            "patterns": ["\\[%{NOTSPACE:hoge}\\]%{GREEDYDATA:fuga}"]
          },
          "append": {
            "field": "hoge_array",
            "value": "%{hoge}"
          }
        }

result

"type":"parse_exception","reason":"[processor] Must specify exactly one processor type","hea
der":{"processor_type":"foreach","property_name":"processor"}}

このようなエラーとなり、foreachのprocessorは１つしか指定してはいけないと返されます。

調べたこと

同じようなトピックが既にありましたが回答がついておらず、自動クローズされていました。

foreach自身は複数書けることが分かりましたが、grokとの組み合わせには言及されていませんでした。

github.com/elastic/elasticsearch

Modify foreach processor to accept a single processor instead of collection

opened 09:53PM - 08 Jul 16 UTC

closed 08:13PM - 13 Jul 16 UTC

BigFunger

discuss :Data Management/Ingest Node

@BigFunger, @Bargs, and @talevy had a discussion on Zoom earlier today, and this… is one of the issues that we discussed. ### Summary I would like to see the foreach processor reworked so that it only accepts a single processor instead of an array of processors as it does now. ### Background I have been working on the UI for ingest pipelines, and specifically I have been trying to implement the foreach processor. The UI uses the verbose setting on the simulate API to report back to the user. This way, when the user add or edits a processor, I can use the output from the parent processor as the input of the next and provide them with the data necessary to build out their processors. This is a problem in the context of the foreach processor because I can't provide the user with the input and output of each of the processors defined within the foreach processor. Per our discussion, it also make sense to structure the foreach processor in this way because it follows the patterns that have been established in the other processors. For example, if you want to apply an `uppercase` processor to more than one field in the document, you need to create one `uppercase` processor for each field you want to act on. ### Example With a pipeline definition of the following: ``` json { "pipeline": { "description": "", "processors": [ { "split": { "tag": "processor_1", "field": "message", "separator": " " } }, { "foreach": { "tag": "processor_2", "field": "message", "processors": [ { "uppercase": { "tag": "processor_3", "field": "_value" }, "lowercase": { "tag": "processor_4", "field": "_value" }, "uppercase": { "tag": "processor_5", "field": "_value" } } ] } } ] }, "docs": [ { "_source": { "message": "these are the words of a sentence" } } ] } ``` I would expect the following output: ``` json { "docs": [ { "processor_results": [ { "tag": "processor_1", "doc": { "_type": "_type", "_id": "_id", "_index": "_index", "_source": { "message": [ "these", "are", "the", "words", "of", "a", "sentence" ] }, "_ingest": { "timestamp": "2016-07-06T21:27:14.585+0000" } } }, { "tag": "processor_3", "doc": { "_type": "_type", "_id": "_id", "_index": "_index", "_source": { "message": [ "THESE", "ARE", "THE", "WORDS", "OF", "A", "SENTENCE" ] }, "_ingest": { "timestamp": "2016-07-06T21:27:14.585+0000" } } }, { "tag": "processor_4", "doc": { "_type": "_type", "_id": "_id", "_index": "_index", "_source": { "message": [ "these", "are", "the", "words", "of", "a", "sentence" ] }, "_ingest": { "timestamp": "2016-07-06T21:27:14.585+0000" } } }, { "tag": "processor_5", "doc": { "_type": "_type", "_id": "_id", "_index": "_index", "_source": { "message": [ "THESE", "ARE", "THE", "WORDS", "OF", "A", "SENTENCE" ] }, "_ingest": { "timestamp": "2016-07-06T21:27:14.585+0000" } } }, { "tag": "processor_2", "doc": { "_type": "_type", "_id": "_id", "_index": "_index", "_source": { "message": [ "THESE", "ARE", "THE", "WORDS", "OF", "A", "SENTENCE" ] }, "_ingest": { "timestamp": "2016-07-06T21:27:14.585+0000" } } } ] } ] } ``` Instead, I get this back: ``` json { "docs": [ { "processor_results": [ { "tag": "processor_1", "doc": { "_id": "_id", "_type": "_type", "_index": "_index", "_source": { "message": [ "these", "are", "the", "words", "of", "a", "sentence" ] }, "_ingest": { "timestamp": "2016-07-08T21:36:14.400+0000" } } }, { "tag": "processor_2", "doc": { "_id": "_id", "_type": "_type", "_index": "_index", "_source": { "message": [ "these", "are", "the", "words", "of", "a", "sentence" ] }, "_ingest": { "timestamp": "2016-07-08T21:36:14.400+0000" } } } ] } ] } ```

試している環境

ES, Kibana, 6.0.0-beta2

johtani · September 26, 2017, 7:45am

残念ながら、ぱっと見では、Grokでの対応はできなそうかと。
ただし、LogstashのGrokの場合はまた状況が違いそうです。

後は、Script processorでPainlessで処理する形が妥当な気がします。

tsgkdt · October 3, 2017, 12:42am

アドバイスありがとうございます。

Script processorで試したところ、期待する動作を確認できました。
simulateの結果を貼っておきます。

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "message",
          "separator": ","
        },
        "script": {
          "lang": "painless",
          "source": """ctx.logcode = []; ctx.logmessage = []; for (line in ctx.message) { def matcher = /\[(.*?)\](.*)/.matcher(line); if (matcher.find()) { ctx.logcode.add(matcher.group(1).trim()); ctx.logmessage.add(matcher.group(2).trim());}}"""
        },
        "remove": {
          "field": "message"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "[INFO.0001]あいうえおかきくけこ, [INFO.0002]さしすせそ"
      }
    }
  ]
}

結果

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_type": "_type",
        "_id": "_id",
        "_source": {
          "logmessage": [
            "あいうえおかきくけこ",
            "さしすせそ"
          ],
          "logcode": [
            "INFO.0001",
            "INFO.0002"
          ]
        },
        "_ingest": {
          "timestamp": "2017-10-03T00:38:19.594Z"
        }
      }
    }
  ]
}

動作確認環境

ES, Kibana 6.0.0-RC1

system · October 31, 2017, 12:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingestnode Grokにてarraylistの処理について日本語による質問・議論はこちら	5	745	April 7, 2020
Ingest node GrokにてArraylistの値を受け取りエラー日本語による質問・議論はこちら	5	840	October 28, 2019
Painlessでの値の取得方法について日本語による質問・議論はこちら	3	1134	April 8, 2020
【ingest node】dissect processorとKV processorを1度に処理できるか日本語による質問・議論はこちら	5	1069	January 16, 2020
(Ingest Pipeline) Grok on an Array using Foreach -> Pipeline processors Elasticsearch	2	1598	September 16, 2020