Identifying Significant Words In a Field

Working with the Twitter plugin, there are a couple fields that contain the tweet. Is it possible to create a visual that can analyze the text in this field and then create a chart that shows what words are mentioned most?

Working with the Twitter plugin

I assume you mean the Twitter plugin for Logstash

Yes, this is very possible and I happen to have example that does this. The extra part needed is that the logstash index needs to have custom mappings to store analyzed text versions of the text fields you're interested in.

See: GitHub - tsullivan/avocado-pipeline

The custom mappings are specified in the avocado-tweets-wildcard.json file, referenced here: https://github.com/tsullivan/avocado-pipeline/blob/master/tweet-pipeline.conf#L32

Thanks for the info. The stopwords configured at the top, I assume that's where you set words that are not to be analyzed?

Finally got around to working on this and it works perfectly. I have only had it running for a short time but is there a need to mutate the field before it's indexed to homogenize the letter case that's used or is Elasticsearch smart enough to know that This = this = THIS?

Hm, you add a lowercase filter to the custom analyzer. The documentation on custom analyzers has an example of adding a lowercase filter:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html#_example_configuration_5

Here's an example I came up with using the my_stop_analyzer from the pipeline I shared with you:

Make an index with the custom analyzer in its settings:

PUT /cool_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "stopwords": [
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "have", "i", "if", "in", "into", "is", "it", "my", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with", "what", "you", "https", "t", "co", "www", "http", "com"
          ],
          "type": "stop"
        }
      },
      "filter": {
        "lower": {
          "type": "lowercase"
        }
      }
    }
  }
}

Run a test:

POST cool_example/_analyze
{
  "analyzer": "my_stop_analyzer",
  "text": "Be here on Sunday Sunday SUNDAY"
}

Elasticsearch returns only lowercased non-stopword tokens:

{
  "tokens": [
    {
      "token": "here",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "sunday",
      "start_offset": 11,
      "end_offset": 17,
      "type": "word",
      "position": 3
    },
    {
      "token": "sunday",
      "start_offset": 18,
      "end_offset": 24,
      "type": "word",
      "position": 4
    },
    {
      "token": "sunday",
      "start_offset": 25,
      "end_offset": 31,
      "type": "word",
      "position": 5
    }
  ]
}

Sorry, I forgot to reply to this. Yes, the configured stopwords tell Elasticsearch what to drop from analysis: Stop analyzer | Elasticsearch Guide [8.11] | Elastic

huh....I thought Elasticsearch was just the index and search engine for the data but it looks like it can also do data manipulation as well. So, when Logstash and Elasticsearch can both manipulate the data, in this case the lowercase function, which one is best to use/more efficient, Logstash or Elasticsearch?

Personally, I'm not really sure! I'm really more of a Kibana question-answerer.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.