Extract Hashtags and Mentions into separate fields

prakhar_nigam · December 13, 2021, 1:34pm

I am doing a DIY Tweet Sentiment analyser, I have an index of tweets like these,

"_source" : {
      "id" : 26930655,
      "status" : 1,
      "title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow:  @Quinnvestments , @JacobOracle , @jevauniedaye , @ginsbergonomics , @InspoCrypto",
      "hashtags" : null,
      "created_at" : 1622390229,
      "category" : null,
      "language" : 50
    },
    {
          "id" : 22521897,
          "status" : 1,
          "title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan 🙏🚩🚩🇮🇳""",
          "hashtags" : null,
          "created_at" : 1620045296,
          "category" : null,
          "language" : 50
    }

There Mappings and settings are like

"sentiment-en" : {
    "mappings" : {
      "properties" : {
        "category" : {
          "type" : "text"
        },
        "created_at" : {
          "type" : "integer"
        },
        
        "hashtags" : {
          "type" : "text"
        },
        "id" : {
          "type" : "long"
        },
        "language" : {
          "type" : "integer"
        },
        "status" : {
          "type" : "integer"
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "raw" : {
              "type" : "keyword"
            },
            "raw_text" : {
              "type" : "text"
            },
            "stop" : {
              "type" : "text",
              "index_options" : "docs",
              "analyzer" : "stop_words_filter"
            },
            "syn" : {
              "type" : "text",
              "index_options" : "docs",
              "analyzer" : "synonyms_filter"
            }
          },
          "index_options" : "docs",
          "analyzer" : "all_ok_filter"
        }
      }
    }
  }
}




"settings" : {
      "index" : {
        "number_of_shards" : "10",
        "provided_name" : "sentiment-en",
        "creation_date" : "1627975717560",
        "analysis" : {
          "filter" : {
            "stop_words" : {
              "type" : "stop",
              "stopwords" : [ ]
            },
            "synonyms" : {
              "type" : "synonym",
              "synonyms" : [ ]
            }
          },
          "analyzer" : {
            "stop_words_filter" : {
              "filter" : [ "stop_words" ],
              "tokenizer" : "standard"
            },
            "synonyms_filter" : {
              "filter" : [ "synonyms" ],
              "tokenizer" : "standard"
            },
            "all_ok_filter" : {
              "filter" : [ "stop_words", "synonyms" ],
              "tokenizer" : "standard"
            }
          }
        },
        "number_of_replicas" : "0",
        "uuid" : "Q5yDYEXHSM-5kvyLGgsYYg",
        "version" : {
          "created" : "7090199"
        }
      }

Now the problem is i want to extract all the Hashtags and mentions in a seprate field.

What i want as O/P

"id" : 26930655,
          "status" : 1,
          "title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow:  @Quinnvestments , @JacobOracle , @jevauniedaye , @ginsbergonomics , @InspoCrypto",
          "hashtags" : BTC,
          "created_at" : 1622390229,
          "category" : null,
          "language" : 50
        },
        {
              "id" : 22521897,
              "status" : 1,
              "title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan 🙏🚩🚩🇮🇳""",
              "hashtags" : bulls,bears,ATH, ALTSEASON, BSCGem, eth , btc, memecoin, 100xGem, satyasanatan
              "created_at" : 1620045296,
              "category" : null,
              "language" : 50
        }

I wanted to try reindex API but ruby codes not works there and i can not use ruby script in filter plugin as mentioned in blog.

or is there any tokenizer which can be handy in this case. I have a limitation that i can not use logstash here i am inserting data using API calls

cbuescher · December 14, 2021, 10:03am

Try using two separate analyzers for the hashtag / mentions fields. Both would probably need to tokenize on whitespace, but then you can add token filters (e.g. maybe the conditional filter) that only let the hashtag or mention through and discard the rest. Hope that helps.

prakhar_nigam · December 16, 2021, 6:48am

Thanks Christoph for your answer and it seems like a very feasible approach.
I am very newbie to painless script i tried to write something like ths to extract hashtags,which results in all the tokens. Can you point me in the right direction


GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "condition",
      "filter": [ "lowercase" ],
      "script": {
        "source": "token.getTerm().toString().startsWith('#')"
      }
    }
  ],
  "text": "Crypto bull run in #BTC #ETH I was 🤓 damn #ICL2020#ICL"


}

Error : 
"caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "method [java.lang.CharSequence, indexOf/1] not found"
    }

``

Can i use Java Patterns Something like this in painless script to filter words

TAG_PATTERN = Pattern.compile("(?:^|\\s|[\\p{Punct}&&[^/]])(#[\\p{L}0-9-_]+)")

system · January 13, 2022, 6:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to extract found terms into keyword fields Elasticsearch	3	581	May 23, 2019
Identifying Significant Words In a Field Kibana	8	642	May 1, 2018
Text analysis Elasticsearch	6	1271	April 8, 2019
Search for hashtags - Find exact matches only Elasticsearch	3	3096	July 6, 2017
Preprocess Tweets before indexation Logstash	2	519	September 21, 2017

Extract Hashtags and Mentions into separate fields

Related topics