Noise Word handling on ingest

TWhitehouse · January 21, 2020, 11:30am

Hi All

One of my current use cases is to take input from a SQL DB of service desk ticket subjects and descriptions, and be able to show words that are trending (meaning we have a potential problem).

So far I've imported using Logstash, and used some mutate filters to lowercase the data and concatenate it into a single field, then remove special characters, copy the text into another field and then split that into an array.

However where I'm falling down is indexing that data into ES and getting my required output.

My ideal output is a count of terms with all the noise words filtered out so that, over time, we can track which words are trending on our tickets. I'm also looking to visualise this information within Kibana (which I think limits the APIs I can use in ES)

What I've got so far is indexing the full thing into a text field, and indexing the same (but split into an array via logstash) as a keyword field. However, I end up with a load of noise words in there.

I'm looking for the best way to get rid of those noise words. I tried using an analyzer, and a normalizer, but im not sure I've done it right, and normalizers dont allow you to use the stop word filter.

I appreciate this is a bit rambly, but any advice anyone can give would be well received.

Many Thanks

Mark_Harwood · January 21, 2020, 11:41am

A static set of noise words is possible to configure but can be lengthy and won't adapt to changes in content over time.

We do have an aggregation designed to spot anomalies when compared to a background set of documents (eg. today's docs vs last week's). This thread might be of interest but note the major caveat - this diffing analysis is not possible if you use time-based indices and the content for "today" is on a machine remote from the rest-of-time content which you want to compare against.

TWhitehouse · January 21, 2020, 1:38pm

Thanks for your reply Mark
I've looked into this, managed to get the shingles analyzer working, however I still fall short of being able to visualise any of this within Kibana which is where the usefulness is going to be.

Is significant_text the only way to go? Or should I go back to my original thought of splitting the event into an array of single words as I was going to originally? (Only issue is I then go back to my original issue of how to remove noise words either on ingest or on query of a keyword field)

Mark_Harwood · January 21, 2020, 1:58pm

There's an outstanding issue for that but no progress sadly.

You can use the significant_terms aggregation in Kibana but it is expensive to use on text fields in large data sets (requires use of fielddata setting which loads all text into RAM). The significant_text is more efficient but lacks Kibana support. How many docs do you have? (per day and historical backlog).

TWhitehouse · January 21, 2020, 2:01pm

Hi Mark
So its around 400-500 a day
I could look at the last year, last 90 days etc. I've not really decided yet.

Mark_Harwood · January 21, 2020, 2:19pm

That's probably a manageable number for using significant_terms with fielddata configured.
Caution is advised - try it in a test environment first to make sure memory overheads are not too high and also assume that future versions of elasticsearch might not support this configuration style now that we have the significant_text agg.

One other approach may be to set up a cron job running the significant_text aggregation every day and outputting the results into a new "trends" index for the visualisations in Kibana. I used to have something similar looking for trending topics on Twitter and publishing the results as events on a calendar or clusters of related topics

TWhitehouse · January 21, 2020, 2:21pm

I'll give that a go for sure.
Thanks very much for your help and time Mark. It's appreciated

Thanks

Mark_Harwood · January 21, 2020, 2:35pm

No worries.
As an example of up-to-date text analysis here's what's trending in my RSS news feeds for the last 24 hours:

Kibana

I use a single-shard index of news headlines indexed with shingles and above are the significant_text results for a date range of the last 24 hours (750 new docs). The top terms are then used in a second query, passed as filters to the adjacency_matrix aggregation to reveal the clusters of related topics rather than presenting them as a flat list. To use Kibana's Graph visualisation your trend-spotting script can output these topic relationships as documents containing arrays of terms that are related.

system · February 18, 2020, 2:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Beginner - Help Kibana	2	412	October 12, 2017
Visualizing Terms In A Field Kibana	4	353	February 7, 2020
Indexing PDF's and Perform Text Analytics with ES Elasticsearch	12	3640	October 9, 2018
Indexing multiple synonym values as keywords Elasticsearch	2	751	June 6, 2017
Ingest and Conditional Routing Logstash	7	370	November 30, 2023

Noise Word handling on ingest

Related topics