Помогите разобраться с grok filter

pavuk · March 1, 2018, 6:48am

Здравствуйте,

Вопрос у меня такой. Использую http_poller input plugin для Logstash. По API выдергиваю данные в формате json из UserVoice. Там довольно большой объем данных приходит, все это улетает в ElasticSearch. В принципе, с этим всем работать можно, но хотелось бы весь этот поток как-то отфильтровать так, чтобы в ES шли только нужные мне поля. Выглядит эти данные примерно так:

<suggestion>
       <url>http://plesk.uservoice.com/forums/184549-feature-suggestions/suggestions/33450685-easy-change-the-main-website-on-a-subscription</url>
       <path>/forums/184549-feature-suggestions/suggestions/33450685-easy-change-the-main-website-on-a-subscription</path>
       <id type="integer">33450685</id>
       <state>published</state>
       <title>Easy change the main website on a subscription</title>
       <text>Currently the main domain must be renamed to change the main domain. This is cumbersome, especially if e-mail addresses are still assigned to this domain, because they will be also renamed. It would be more convenient if another already existing domain could be set as the main domain, e. g. by clicking on a button "Set as main domain". </text>
       <formatted_text>&lt;div class="typeset"&gt;&lt;p&gt;Currently the main domain must be renamed to change the main domain. This is cumbersome, especially if e-mail addresses are still assigned to this domain, because they will be also renamed. It would be more convenient if another already existing domain could be set as the main domain, e. g. by clicking on a button "Set as main domain". &lt;/p&gt;&lt;/div&gt;</formatted_text>
       <referrer>https://plesk.uservoice.com/forums/xxxxx</referrer>
       <vote_count type="integer">1</vote_count>
       <subscriber_count type="integer">1</subscriber_count>
       <comments_count type="integer">0</comments_count>
       <supporters_count type="integer">1</supporters_count>
       <merged_into_suggestion_id type="integer"/>
       <topic>
         <id type="integer">184549</id>
         <prompt>I suggest you ...</prompt>
         <example>Enter your idea... (in English)</example>
         <votes_allowed type="integer">20</votes_allowed>
         <suggestions_count type="integer">3909</suggestions_count>
         <open_suggestions_count type="integer">1898</open_suggestions_count>
         <closed type="boolean">false</closed>
         <anonymous_access type="boolean">false</anonymous_access>
         <unlimited_votes type="boolean">true</unlimited_votes>
         <classic_voting>false</classic_voting>
         <closed_at type="datetime"/>
         <created_at type="datetime">2012-11-20T07:23:47Z</created_at>
         <updated_at type="datetime">2018-01-25T05:54:41Z</updated_at>
         <forum>
           <id>184549</id>
           <name>Feature Suggestions</name>
         </forum>
       </topic>
       <category>
         <id type="integer">69397</id>
         <name>Plesk (general)</name>
       </category>
       <closed_at nil="true"/>
       <status nil="true"/>
       <creator>
       <id type="integer">185785938</id>
       <name>mai</name>
       <title></title>
       <url>http://plesk.uservoice.com/users/xxxxx</url>
       <avatar_url>https://secure.gravatar.com/xxxxx.png</avatar_url>
       <created_at type="datetime">2016-07-04T12:16:58Z</created_at>
       <updated_at type="datetime">2018-02-26T07:22:42Z</updated_at>
       </creator>
       <response nil="true"/>
       <attachments type="array">
       </attachments>
       <created_at type="datetime">2018-02-26T07:22:43Z</created_at>
       <updated_at type="datetime">2018-02-26T07:22:54Z</updated_at>
           </suggestion>

Хотелось бы понять, как отфильтровать с помощью grok filter этот поток так, чтобы выбирать из него только и какие-то еще нужные мне поля? Попробовал разными способами, что-то ничего у меня не получилось. Может быть подскажете какие-то примеры?
Спасибо.

Igor_Motov · March 2, 2018, 3:51am

А почему вы хотите воспользоваться именно grok вместо xml?

pavuk · March 2, 2018, 4:05am

Наверное потому, что почему-то думал, что задачу нужно выполнить именно с его помощью. Про xml не знал.

pavuk · March 2, 2018, 4:10am

То есть должно быть что-то вроде

filter {
  xml {
    source => "text"
    source => "supporters_count"
    source => "created_at"
  }
}

Верно?

Igor_Motov · March 2, 2018, 4:18am

Я думаю, что source должен быть только первый. Остальное - в xpath.

pavuk · March 2, 2018, 4:37am

Что-то ничего не изменилось. Все то же самое.

Может быть, нужно вместо

source => "text"

использовать

source => "suggestion.text"

?

pavuk · March 2, 2018, 4:40am

Нет и так не работает.
Ни на что этот фильтр не влияет.
Или я что-то неправильно понимаю...

Igor_Motov · March 2, 2018, 4:50pm

source - это поле, в котором содержится XML. То есть source => "text" означает, что XML находиться в поле text. Если у вас этот XML - вся входящая запись, то должно быть source => "message"

pavuk · March 2, 2018, 5:21pm

Хорошо, но только как это решает мою первоначальную задачу? Мне из этого message нужно не все, а только определенные поля.

Igor_Motov · March 2, 2018, 8:05pm

filter {
  xml {
    source => "message"
    xpath => ["suggestion/supporters_count/text()", "supporters_count"]
    xpath => ["suggestion/created_at/text()", "created_at"]
    xpath => ["suggestion/text/text()", "text"]
    force_content => true
    force_array => false
    remove_field => [ "message" ]
    store_xml => false
  }
  mutate {
    convert => {
      "supporters_count" => "integer"
      "created_at" => "string"
      "text" => "string"
    }
    replace => {
      "supporters_count" => "%{[supporters_count][0]}"
      "created_at" => "%{[created_at][0]}"
      "text" => "%{[text][0]}"
    }
  }
}

для меня выводит

{
  "@timestamp": "2018-03-02T20:03:09.169Z",
  "@version": "1",
  "supporters_count": 1,
  "host": "my_laptop",
  "created_at": "2018-02-26T07:22:43Z",
  "text": "Currently the main domain must be renamed to change the main domain. This is cumbersome, especially if e-mail addresses are still assigned to this domain, because they will be also renamed. It would be more convenient if another already existing domain could be set as the main domain, e. g. by clicking on a button \"Set as main domain\". "
}

pavuk · March 3, 2018, 3:34am

Спасибо большое за пример, теперь легче будет разбираться
Но он все равно не работает. В Kibana вижу, что весь массив попадает в поле suggestions а в created_at вижу %{[created_at][0]}. Тоже и для text - %{[text][0]}

Может это от того, что в input стоит codec => "json" ?

Скажите, а перевод в lowercase текста можно тоже делать в logstash или это уже забота ES? То же самое и про чистку текста от урлов и почтовых адресов.

Igor_Motov · March 6, 2018, 4:48pm

Если у вас входящая запись в xml, то input codec должен быть plain. Нужно больше информации о том, как у вас записи поступают, что бы разобраться, что не работает. Я свой пример тестировал на XML в одной строке.

Зависит от того, как вы этот почищенный текст будете использовать. Если только для поиска - то можно в Elasticsearch. Если для вывода - то лучше в logstash.

pavuk · March 7, 2018, 5:48am

API запрос с выводом можно посмотреть вот тут UserVoice API Request · GitHub

Хотелось бы, чтобы в ES текст попадал из Logstash уже в lowercase и очищенным.
Спасибо.

Igor_Motov · March 7, 2018, 4:13pm

А как Вы с этим API работаете?

В этом случае имеет смысл пропустить все через mutate фильтр в logstash.

pavuk · March 7, 2018, 4:45pm

В каком смысле как?
Пока лишь хочу выбрать из вывода нужные поля.

Igor_Motov · March 7, 2018, 4:58pm

Я имел в виду настройки http_poller.

pavuk · March 8, 2018, 2:49am

Там все просто:

input {
             http_poller {
               urls => {
                  test => {
                   # Supports all options supported by ruby's Manticore HTTP client
                   method => get
                   url => "https://plesk.uservoice.com/api/v1/forums/184549/suggestions.json?sort=newest"
                   headers => {
                      Accept => "application/json"
                      Authorization => "Bearer  xxxxxxxxxx"
                   }

                 }
               }
               request_timeout => 60
               # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
               # schedule => { cron => "* * * * * UTC"}
               schedule => { every => "1h"}
               codec => "json"
             }
           }


filter {
  xml {
    ......

 output {
           stdout {
           codec => rubydebug
# elasticsearch {
#   hosts => ["http://xxxxxxxx.com:9200/"]
#   user => elastic
#   password => xxxxxx
##   user => logstash_internal
##   password => xxxxx
#   index => "uservoice"
 }
             }

Igor_Motov · March 8, 2018, 3:53pm

Я думаю, что codec => "json" здесь мешает. К тому же у вас может быть несколько записей в одном ответе сервера, поэтому их, скорее всего надо будет в два этапа разбивать - сначала xml филтром и split фильтром на записи, а потом еще одним xml фильтром на поля

pavuk · March 12, 2018, 7:16am

В общем, ничего не получается. Поставил codec => "plain"
Сейчас конфиг выглядит как:

filter {
  xml {
    source => "message"
    xpath => ["response/suggestion/supporters_count/text()", "supporters_count"]
    xpath => ["response/suggestion/created_at/text()", "created_at"]
    xpath => ["response/suggestion/text/text()", "text"]
    force_content => true
    force_array => false
    remove_field => [ "message" ]
    store_xml => false
  }
  mutate {
    convert => {
      "supporters_count" => "integer"
      "created_at" => "string"
      "text" => "string"
    }
    replace => {
      "supporters_count" => "%{[supporters_count][0]}"
      "created_at" => "%{[created_at][0]}"
      "text" => "%{[text][0]}"
    }
  }
}

В результате, в Кибане вижу что-то неудобоваримое - https://screenshots.firefox.com/1AKSZPP6YiozCbWQ/talkkib.plesk.com

Igor_Motov · March 12, 2018, 2:05pm

Вы полностью проигнорировали мой совет в предыдущем посте использовать 2 xml фильтра , и пути у вас в xml не правильные, поэтому у вас xml фильтр просто не отрабатывает и все остается в поле message.

Когда я работаю с конфигурацией logstash я обычно не пишу огромную конфигурацию и потом смотрю на результат в кибане. Я отлаживаю маленькими кусочками заменив output на stdout и смотря на вход и выход для каждого фильтра.

Topic		Replies	Views
Парсинг логов в Logstash с помощью grok Вопросы на русском языке	24	5039	December 5, 2018
Logstash: Как заменить значение полей на основании содержимого Вопросы на русском языке	4	1011	October 18, 2019
Проблема с парсингом логов в Logstash Вопросы на русском языке	1	613	March 20, 2023
Logstash 1.4 filter not working Logstash	3	430	July 6, 2017
How to rename fields Logstash	4	2369	September 12, 2017

Помогите разобраться с grok filter

Related topics