Tokenizing incoming stream

I'd like to tokenize a field as soon as I receive it in Logstash. I understand I can make it lowercase but it is possible to tokenize it, too?

What do you mean by tokenize? What does the input field look like, and what result do you want?

A typical message containing, entity_name, source_name, headline, looks like this:

"Thomson Reuters Corp.","Japan Today","Trump claims victory after forcing NATO crisis talks"

I'd like to tokenize the headline:"Trump ...talks" to " Trump, claims, victory, after, ..., talks" or similar to it. I understand there are various tokenization methods; and Elasticsearch offers many options. My goal is to get this done in Logstash, so I don't have to do it in Elasticsearch. My questions:

  1. Is any full text tokenization feasible in Logstash?
  2. Is doing this in Logstash a good idea?

My ultimate goal is to create parent child relationships between this news piece and comments on it which only share the headline. There is no other relationship between the news pieces and comments except for headline. So I need to tokenize the headline and use the results to find the relevant comments in an index in Elasticsearch; and establish the parent-child relationship in Elasticsearch.

I would use a dissect filter to split the entity_name and source_name off. Then mutate+split to chop the words of headline up into an array. Something like

mutate { split => { "headline" => " " } }

If you want anything more sophisticated than that then do it in elasticsearch.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.