Помогите разобраться с grok filter

pavuk · March 12, 2018, 3:38pm

Да я ваши предыдущие рекомендации держу в уме, просто пока не добрался до них. Попробую разобраться с этим, если получится.
Меня смутило, что %{[text][0]} и %{[created_at][0]} в выводе.

pavuk · March 13, 2018, 6:33am

Попробовал последовать Вашей рекомендации, и сделал такой конфиг:

input {
             http_poller {
               urls => {
                  test => {
                   method => get
                   url => "https://plesk.uservoice.com/api/v1/forums/184549/suggestions.json?sort=newest"
                   headers => {
                      Accept => "application/json"
                      Authorization => "Bearer  xxxxxxxxxx"
                   }

                 }
               }
               request_timeout => 60
               # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
               # schedule => { cron => "* * * * * UTC"}
               schedule => { every => "1h"}
               codec => "plain"
             }
           }


filter {
  xml {
                source => "message"
                target => "xmldata"
                #store_xml => "false"
                xpath => ["/response/suggestion","suggestion"]
  }

  mutate {
    remove_field => [ "message", "xmldata" ]
  }

  split {
    field => "[suggestion]"
  }

  xml {
    source => "suggestion"
    xpath => ["suggestion/supporters_count/text()", "supporters_count"]
    xpath => ["suggestion/created_at/text()", "created_at"]
    xpath => ["suggestion/text/text()", "text"]
    force_content => true
    force_array => false
    remove_field => [ "message" ]
    store_xml => false
  }
  mutate {
    convert => {
      "supporters_count" => "integer"
      "created_at" => "string"
      "text" => "string"
    }
    replace => {
      "supporters_count" => "%{[supporters_count][0]}"
      "created_at" => "%{[created_at][0]}"
      "text" => "%{[text][0]}"
    }
  }
}

В результате https://screenshots.firefox.com/lNz170RX6DjAI44F/talkkib.plesk.com
А в логе:

[2018-03-13T06:25:17,353][WARN ][logstash.filters.split ] Only String and Array types are splittable. field:[suggestion] is of type = NilClass

Почему-то по suggestion не сплитится...

Igor_Motov · March 13, 2018, 9:46pm

Вы не пробовали поставить stdout вывод после первого xml фильтра, убрать все остальное и посмотерть, что выводиться? Если бы вы это сделали, то вы могли бы заметить, что у вас suggestion в первом фильтре не выбираются потому, что xpath не правильный.

filter {
  # First extract individual suggestions
  xml {
    source => "message"
    xpath => ["/response/suggestions/suggestion", "suggestion"]
    force_content => false
    force_array => true
    remove_field => [ "message" ]
    store_xml => false
  }
  # Split suggestions into records
  split {
   field => "suggestion"
 }
 # Now parse each xml in each record
 xml {
   source => "suggestion"
   xpath => ["/suggestion/supporters_count/text()", "supporters_count"]
   xpath => ["/suggestion/created_at/text()", "created_at"]
   xpath => ["/suggestion/text/text()", "text"]
   force_content => true
   force_array => false
   remove_field => [ "suggestion" ]
   store_xml => false
 }
 # The filter above may produce multiple entries for each xpath element
 # We need to adjust type and interested only in the first one
 mutate {
   convert => {
     "supporters_count" => "integer"
     "created_at" => "string"
     "text" => "string"
   }
   replace => {
     "supporters_count" => "%{[supporters_count][0]}"
     "created_at" => "%{[created_at][0]}"
     "text" => "%{[text][0]}"
   }
 }
}

pavuk · March 14, 2018, 4:36am

Попробовал с простейшим конфигом:

input {
   http_poller {
     urls => {
         test => {
                   # Supports all options supported by ruby's Manticore HTTP client
                   method => get
                   url => "https://plesk.uservoice.com/api/v1/forums/184549/suggestions?sort=newest"
                   headers => {
                      Accept => "application/json"
                      Authorization => "Bearer  xxxxxxxx"
                   }

                 }
               }
               request_timeout => 60
               # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
               # schedule => { cron => "* * * * * UTC"}
               schedule => { every => "1h"}
               codec => "plain"
             }
}


filter {
  # First extract individual suggestions
  xml {
    source => "message"
    xpath => ["/response/suggestions/suggestion", "suggestion"]
    force_content => false
    force_array => true
    remove_field => [ "message" ]
    store_xml => false
  }
  # Split suggestions into records
  split {
   field => "suggestion"
 }
}

output {
  stdout { codec => rubydebug }
}

Все равно не парсится. Та же ошибка:

[2018-03-14T04:20:28,514][WARN ][logstash.filters.split ] Only String and Array types are splittable. field:suggestion is of type = NilClass

Уже и не знаю, видимо дело не в xpath, а в чем-то еще. Ну как еще его можно написать, если в выводе:

 curl -XGET "https://plesk.uservoice.com/api/v1/forums/184549/suggestions?sort=newest" -H "Authorization: Bearer xxxxxxxxx"

<?xml version="1.0" encoding="UTF-8"?>
<response>
<response_data>
  <page type="integer">1</page>
  <per_page type="integer">10</per_page>
  <total_records type="integer">1911</total_records>
  <filter>public</filter>
  <sort>newest</sort>
</response_data>
  <suggestions type="array">
    <suggestion>
<url>http://plesk.uservoice.com/forums/184549-feature-suggestions/suggestions/33620917-option-to-not-select-alias-domains-in-let-s-encryp</url>
<path>/forums/184549-feature-suggestions/suggestions/33620917-option-to-not-select-alias-domains-in-let-s-encryp</path>
<id type="integer">33620917</id>
<state>published</state>
<title>Option to not select alias domains in Let's Encrypt by default</title>
<text>By default, the Let's Encrypt extension pre-selects all alias domains of a domain to be included in the Let's Encrypt certificate. This can lead to some unwanted results, for example when one of the alias domains is removed from the DNS at a later stage (because it was only used for testing) which would result in Let's Encrypt being unable to renew the certificate.

It would be nice if there was an option that allows us to configure that alias domains should _not_ be selected by default in the Let's Encrypt extension.
</text>

Не видит он suggestion и все.

pavuk · March 14, 2018, 6:05am

Кажется, понял в чем проблема. Дело в том, что много раз внутри вывода еще встречается, помимо самого первого . Видимо поэтому xpath неверный получается.
Попробовал

xpath => ["/suggestions/suggestion", "suggestion"]

не помогло. Как его правильно написать, чтобы игнорировались все эти response?

Igor_Motov · March 14, 2018, 11:55am

А что выводиться, если убрать split?

pavuk · March 14, 2018, 1:18pm

Ошибки сплита нет, конечно. А в выводе что-то вроде

https://gist.githubusercontent.com/IronButterfly/99356b3f12995aa4e4c0e3657a9b6e7b/raw/9606eeafee782e03d05dbe1f0670c59ca8fd0940/gistfile1.txt

Igor_Motov · March 14, 2018, 2:25pm

Ну так вы получаете информацию в json, а не в xml -

 "message": "{\"response_data\":{\"page\":1,\"per_page\":10,\"total_records\":1911,\"filter\":\"public\",\"sort\":\"newest\"},\"suggestions\":[{\"url\":\"http://plesk.uservoice.com/forums/184549-feature-suggestions/suggestions/33620917-option-to-not-select-alias-domains-in-let-s-

Естественно, он ничего не парсит и все оставляет в message. В таком случае, вам действительно надо использовать кодек json. Уберите xml фильтр, замените кодек в http_pooler на json, и давайте посмотрим, что он будет выдавать.

pavuk · March 14, 2018, 3:07pm

Убрал все фильтры, кодек в http_pooler поставил вместо plain в json и вот что в результате:

gist.github.com

https://gist.github.com/IronButterfly/ff9d501ac46d5052572e16d203370e48

gistfile1.txt

{
  "_index": "uservoice",
  "_type": "doc",
  "_id": "kXkHJWIB0KkY9AqqrFpu",
  "_version": 1,
  "_score": 2,
  "_source": {
    "@timestamp": "2018-03-14T15:01:15.038Z",
    "response_data": {
      "per_page": 10,

This file has been truncated. show original

Igor_Motov · March 15, 2018, 12:46am

filter {
  # Split suggestions into records
  split {
    field => "[_source][suggestions]"
  }
  # Extract fields that we care about and delete the rest
  mutate {
    add_field => {"formatted_text" => "%{[_source][suggestions][formatted_text]}"}
    add_field => {"created_at" => "%{[_source][suggestions][created_at]}"}
    add_field => {"supporters_count" => "%{[_source][suggestions][supporters_count]}"}
    remove_field => "_source"
  }
}

pavuk · March 15, 2018, 5:09am

Все равно не сплитит.

[2018-03-15T04:56:00,116][WARN ][logstash.filters.split ] Only String and Array types are splittable. field:[_source][suggestions] is of type = NilClass

Igor_Motov · March 15, 2018, 11:45am

Я пробовал с тем json, который вы мне прислали читая его из файла - все работает. Вам надо убедиться, что путь _source/suggestions существует в записи, которую вы получаете от http_pooler.

pavuk · March 16, 2018, 7:00am

Уффф... Наконец-то разобрался Стало сплитится и все как надо выводить с таким вот конфигом:

input {
   http_poller {
     urls => {
         test => {
                   method => get
                   url => "https://plesk.uservoice.com/api/v1/forums/184549/suggestions?sort=newest"
                   headers => {
                      Accept => "application/json"
                      Authorization => "Bearer  xxxxxxxxx"
                   }
                   
                 }
               }
               request_timeout => 60
               schedule => { every => "1h"}
               codec => "json"
             }
}

filter {
  # Split suggestions into records
  split {
    field => "[suggestions]"
  }
  # Extract fields that we care about and delete the rest
  mutate {
    add_field => {"id" => "%{[suggestions][id]}"}
    add_field => {"title" => "%{[suggestions][title]}"}
    add_field => {"text" => "%{[suggestions][text]}"}
    add_field => {"created_at" => "%{[suggestions][created_at]}"}
    add_field => {"status" => "%{[suggestions][status][name]}"}
    add_field => {"category" => "%{[suggestions][category][name]}"}
    add_field => {"supporters_count" => "%{[suggestions][supporters_count]}"}
    remove_field => "suggestions"
    remove_field => "response_data"
  }
}

Но возникли еще вопросы. Поможете, если я Вам еще не надоел?

Вывод выглядит вот так:

{
"status" => "open discussion",
"@timestamp" => 2018-03-16T06:39:43.150Z,
"created_at" => "2018/03/11 16:50:13 +0000",
"category" => "Backup / Restore",
"id" => "33599668",
"supporters_count" => "1",
"@version" => "1",
"text" => "Please enter the percentage of database restore in progress ... it is very frustrating to restore without knowing either the time left or the totality of kbytes transferred and remaining, with progress a bar \nThank you",
"title" => "backup"
}

проблема в том, что вместо "@timestamp" мне нужно использовать "created_at". Но это поле идет в индекс как текст, а не как дата.

В соответствии с полем schedule выборка происходит каждый час. Хотелось бы, чтобы в тех записях в индеске ElasticSearch, которые уже существуют, обновлялись измененные поля, и добавлялись новые записи, которые появились за прошедший час.
Хотелось бы как-то из поля "text" вырезать все URLs и emails, и кроме того, переводить весь текст в lowercase.

pavuk · March 16, 2018, 8:34am

Разобрался с пунктом 1 и частично с пунктом 3 с помощью такого фильтра:

filter {
  # Split suggestions into records
  split {
    field => "[suggestions]"
  }
  # Extract fields that we care about and delete the rest
  mutate {
    add_field => {"id" => "%{[suggestions][id]}"}
    add_field => {"title" => "%{[suggestions][title]}"}
    add_field => {"text" => "%{[suggestions][text]}"}
    add_field => {"created_at" => "%{[suggestions][created_at]}"}
    add_field => {"status" => "%{[suggestions][status][name]}"}
    add_field => {"category" => "%{[suggestions][category][name]}"}
    add_field => {"supporters_count" => "%{[suggestions][supporters_count]}"}
    remove_field => "suggestions"
    remove_field => "response_data"
    #remove_field => "@timestamp"
  }
  mutate {
    lowercase => ["text"]
  }
  mutate {
    gsub => ["created_at", "/", "-"]
}

  date {
    match => [ "created_at", "YYYY-MM-dd HH:mm:ss Z" ]
    target => "created_at"
  }
}

Остался пункт 2 и вырезание урлов и емейлов.

Igor_Motov · March 16, 2018, 2:39pm

Вам надо просто индексировать эти записи с одним и тем же id в elasticsearch. Это можно достичь, указав поле с id в параметре document_id .

Это можно достичь с помощью gsub в mutate.

pavuk · March 16, 2018, 3:26pm

Что-то не пойму, как правильно написать document_id.
Сделал:

document_id => "id"

Но, так только одна запись добавляется.
Если так:

document_id => "%{[suggestions][id]}"

то ошибка.

Про gsub понятно. Осталось только правильные regexp-ы написать.

Igor_Motov · March 16, 2018, 8:22pm

document_id => "%{id}"

pavuk · March 19, 2018, 3:30am

Огромное спасибо за помощь и науку @Igor_Motov ! В итоге, получил практически то, что хотел изначально.

Может быть кому-то будет интересно и полезно, как получать данные из UserVoice по API. Мой почти финальный конфиг получился таким:

input {
  http_poller {
    urls => {
      test => {
        method => get
        url => "https://plesk.uservoice.com/api/v1/forums/184549/suggestions?per_page=300?sort=newest"
        headers => {
          Accept => "application/json"
          Authorization => "Bearer  xxxxxxxxxxxx"
                   }
                   
               }
             }
          request_timeout => 60
          schedule => { every => "1h"}
          codec => "json"
             }
}

filter {
  
  split {
    field => "[suggestions]"
  }
  
  mutate {
    add_field => {"id" => "%{[suggestions][id]}"}
    add_field => {"title" => "%{[suggestions][title]}"}
    add_field => {"text" => "%{[suggestions][text]}"}
    add_field => {"created_at" => "%{[suggestions][created_at]}"}
    add_field => {"status" => "%{[suggestions][status][name]}"}
    add_field => {"category" => "%{[suggestions][category][name]}"}
    add_field => {"supporters_count" => "%{[suggestions][supporters_count]}"}
    remove_field => "suggestions"
    remove_field => "response_data"
    remove_field => "@timestamp"
  }

  if [status] == "%{[suggestions][status][name]}" {
    mutate {
     replace => [ "status", "No status" ]
  }
  }

  if [text] == "%{[suggestions][text]}" {
    mutate {
     replace => [ "text", "%{title}" ]
  }
  }

  mutate {
    strip => ["text"]
  }
  
  mutate {
    lowercase => ["text"]
  }
  
  mutate {
    gsub => ["text", "http\S+", ""]
  }

  mutate {
    convert => { "supporters_count" => "integer" }
  }

  mutate {
    gsub => ["created_at", "/", "-"]
  }

  date {
    match => [ "created_at", "YYYY-MM-dd HH:mm:ss Z" ]
    target => "created_at"
  }

}

output {
  elasticsearch {
    hosts => ["http://xxxxxxxxx.com:9200/"]
    user => xxxxx
    password => xxxxx
    index => "uservoice"
    document_id => "%{id}"
  }
}

system · April 16, 2018, 3:30am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Парсинг логов в Logstash с помощью grok Вопросы на русском языке	24	5015	December 5, 2018
Logstash: Как заменить значение полей на основании содержимого Вопросы на русском языке	4	1001	October 18, 2019
Проблема с парсингом логов в Logstash Вопросы на русском языке	1	612	March 20, 2023
Logstash 1.4 filter not working Logstash	3	430	July 6, 2017
How to rename fields Logstash	4	2369	September 12, 2017

Помогите разобраться с grok filter

Related topics