Identifying Significant Words In a Field

wwalker · March 27, 2018, 2:32am

Working with the Twitter plugin, there are a couple fields that contain the tweet. Is it possible to create a visual that can analyze the text in this field and then create a chart that shows what words are mentioned most?

tsullivan · March 27, 2018, 7:42pm

Working with the Twitter plugin

I assume you mean the Twitter plugin for Logstash

Yes, this is very possible and I happen to have example that does this. The extra part needed is that the logstash index needs to have custom mappings to store analyzed text versions of the text fields you're interested in.

See: GitHub - tsullivan/avocado-pipeline

The custom mappings are specified in the avocado-tweets-wildcard.json file, referenced here: https://github.com/tsullivan/avocado-pipeline/blob/master/tweet-pipeline.conf#L32

wwalker · March 28, 2018, 5:33am

Thanks for the info. The stopwords configured at the top, I assume that's where you set words that are not to be analyzed?

wwalker · March 30, 2018, 5:25am

Finally got around to working on this and it works perfectly. I have only had it running for a short time but is there a need to mutate the field before it's indexed to homogenize the letter case that's used or is Elasticsearch smart enough to know that This = this = THIS?

tsullivan · March 30, 2018, 5:27pm

Hm, you add a lowercase filter to the custom analyzer. The documentation on custom analyzers has an example of adding a lowercase filter:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html#_example_configuration_5

Here's an example I came up with using the my_stop_analyzer from the pipeline I shared with you:

Make an index with the custom analyzer in its settings:

PUT /cool_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "stopwords": [
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "have", "i", "if", "in", "into", "is", "it", "my", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with", "what", "you", "https", "t", "co", "www", "http", "com"
          ],
          "type": "stop"
        }
      },
      "filter": {
        "lower": {
          "type": "lowercase"
        }
      }
    }
  }
}

Run a test:

POST cool_example/_analyze
{
  "analyzer": "my_stop_analyzer",
  "text": "Be here on Sunday Sunday SUNDAY"
}

Elasticsearch returns only lowercased non-stopword tokens:

{
  "tokens": [
    {
      "token": "here",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "sunday",
      "start_offset": 11,
      "end_offset": 17,
      "type": "word",
      "position": 3
    },
    {
      "token": "sunday",
      "start_offset": 18,
      "end_offset": 24,
      "type": "word",
      "position": 4
    },
    {
      "token": "sunday",
      "start_offset": 25,
      "end_offset": 31,
      "type": "word",
      "position": 5
    }
  ]
}

tsullivan · March 30, 2018, 5:28pm

Sorry, I forgot to reply to this. Yes, the configured stopwords tell Elasticsearch what to drop from analysis: Stop analyzer | Elasticsearch Guide [8.11] | Elastic

wwalker · March 31, 2018, 2:50am

huh....I thought Elasticsearch was just the index and search engine for the data but it looks like it can also do data manipulation as well. So, when Logstash and Elasticsearch can both manipulate the data, in this case the lowercase function, which one is best to use/more efficient, Logstash or Elasticsearch?

tsullivan · April 3, 2018, 6:04pm

Personally, I'm not really sure! I'm really more of a Kibana question-answerer.

system · May 1, 2018, 6:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Visualization of the most frequent words Kibana	4	1860	February 23, 2022
Is it possible to count each term of a text instead of the complete text to display in Kibana? Elasticsearch	5	4790	March 3, 2017
Significant Terms "No Results Found"? Elasticsearch	2	1589	August 5, 2019
Freelance opportunity using logstash vs elastisearch Elasticsearch	1	339	July 6, 2017
Exessive field mapping Twitter to Elasticsearch Logstash	4	1199	June 16, 2017

Identifying Significant Words In a Field

Make an index with the custom analyzer in its settings:

Run a test:

Elasticsearch returns only lowercased non-stopword tokens:

Related topics