Tokenizing incoming stream

emanresu · July 25, 2018, 3:54pm

I'd like to tokenize a field as soon as I receive it in Logstash. I understand I can make it lowercase but it is possible to tokenize it, too?

Badger · July 25, 2018, 4:01pm

What do you mean by tokenize? What does the input field look like, and what result do you want?

emanresu · July 25, 2018, 4:40pm

A typical message containing, entity_name, source_name, headline, looks like this:

"Thomson Reuters Corp.","Japan Today","Trump claims victory after forcing NATO crisis talks"

I'd like to tokenize the headline:"Trump ...talks" to " Trump, claims, victory, after, ..., talks" or similar to it. I understand there are various tokenization methods; and Elasticsearch offers many options. My goal is to get this done in Logstash, so I don't have to do it in Elasticsearch. My questions:

Is any full text tokenization feasible in Logstash?
Is doing this in Logstash a good idea?

My ultimate goal is to create parent child relationships between this news piece and comments on it which only share the headline. There is no other relationship between the news pieces and comments except for headline. So I need to tokenize the headline and use the results to find the relevant comments in an index in Elasticsearch; and establish the parent-child relationship in Elasticsearch.

Badger · July 25, 2018, 5:45pm

I would use a dissect filter to split the entity_name and source_name off. Then mutate+split to chop the words of headline up into an array. Something like

mutate { split => { "headline" => " " } }

If you want anything more sophisticated than that then do it in elasticsearch.

system · August 22, 2018, 5:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tokenizer splits field values Elasticsearch	2	540	July 6, 2017
It is possibile don't token word with elasticsearch? Elasticsearch	3	390	July 6, 2017
Custom analyzer to include all the given text and tokenize it Elasticsearch	1	428	April 4, 2017
Using custom analyzer / tokenizers to breakdown a string into subfields Elasticsearch	2	430	July 5, 2017
Custom tokenizer or analyzer? Elasticsearch	6	394	July 6, 2017

Tokenizing incoming stream

Related topics