Extract Hashtags and Mentions into separate fields

I am doing a DIY Tweet Sentiment analyser, I have an index of tweets like these,

"_source" : {
      "id" : 26930655,
      "status" : 1,
      "title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow:  @Quinnvestments , @JacobOracle , @jevauniedaye , @ginsbergonomics , @InspoCrypto",
      "hashtags" : null,
      "created_at" : 1622390229,
      "category" : null,
      "language" : 50
    },
    {
          "id" : 22521897,
          "status" : 1,
          "title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan 🙏🚩🚩🇮🇳""",
          "hashtags" : null,
          "created_at" : 1620045296,
          "category" : null,
          "language" : 50
    }

There Mappings and settings are like

"sentiment-en" : {
    "mappings" : {
      "properties" : {
        "category" : {
          "type" : "text"
        },
        "created_at" : {
          "type" : "integer"
        },
        
        "hashtags" : {
          "type" : "text"
        },
        "id" : {
          "type" : "long"
        },
        "language" : {
          "type" : "integer"
        },
        "status" : {
          "type" : "integer"
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "raw" : {
              "type" : "keyword"
            },
            "raw_text" : {
              "type" : "text"
            },
            "stop" : {
              "type" : "text",
              "index_options" : "docs",
              "analyzer" : "stop_words_filter"
            },
            "syn" : {
              "type" : "text",
              "index_options" : "docs",
              "analyzer" : "synonyms_filter"
            }
          },
          "index_options" : "docs",
          "analyzer" : "all_ok_filter"
        }
      }
    }
  }
}




"settings" : {
      "index" : {
        "number_of_shards" : "10",
        "provided_name" : "sentiment-en",
        "creation_date" : "1627975717560",
        "analysis" : {
          "filter" : {
            "stop_words" : {
              "type" : "stop",
              "stopwords" : [ ]
            },
            "synonyms" : {
              "type" : "synonym",
              "synonyms" : [ ]
            }
          },
          "analyzer" : {
            "stop_words_filter" : {
              "filter" : [ "stop_words" ],
              "tokenizer" : "standard"
            },
            "synonyms_filter" : {
              "filter" : [ "synonyms" ],
              "tokenizer" : "standard"
            },
            "all_ok_filter" : {
              "filter" : [ "stop_words", "synonyms" ],
              "tokenizer" : "standard"
            }
          }
        },
        "number_of_replicas" : "0",
        "uuid" : "Q5yDYEXHSM-5kvyLGgsYYg",
        "version" : {
          "created" : "7090199"
        }
      }

Now the problem is i want to extract all the Hashtags and mentions in a seprate field.

What i want as O/P

"id" : 26930655,
          "status" : 1,
          "title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow:  @Quinnvestments , @JacobOracle , @jevauniedaye , @ginsbergonomics , @InspoCrypto",
          "hashtags" : BTC,
          "created_at" : 1622390229,
          "category" : null,
          "language" : 50
        },
        {
              "id" : 22521897,
              "status" : 1,
              "title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan 🙏🚩🚩🇮🇳""",
              "hashtags" : bulls,bears,ATH, ALTSEASON, BSCGem, eth , btc, memecoin, 100xGem, satyasanatan
              "created_at" : 1620045296,
              "category" : null,
              "language" : 50
        }

I wanted to try reindex API but ruby codes not works there and i can not use ruby script in filter plugin as mentioned in blog.

or is there any tokenizer which can be handy in this case. I have a limitation that i can not use logstash here i am inserting data using API calls

Try using two separate analyzers for the hashtag / mentions fields. Both would probably need to tokenize on whitespace, but then you can add token filters (e.g. maybe the conditional filter) that only let the hashtag or mention through and discard the rest. Hope that helps.

1 Like

Thanks Christoph for your answer and it seems like a very feasible approach.
I am very newbie to painless script i tried to write something like ths to extract hashtags,which results in all the tokens. Can you point me in the right direction


GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "condition",
      "filter": [ "lowercase" ],
      "script": {
        "source": "token.getTerm().toString().startsWith('#')"
      }
    }
  ],
  "text": "Crypto bull run in #BTC #ETH I was 🤓 damn #ICL2020#ICL"


}

Error : 
"caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "method [java.lang.CharSequence, indexOf/1] not found"
    }

``

Can i use Java Patterns Something like this in painless script to filter words

TAG_PATTERN = Pattern.compile("(?:^|\\s|[\\p{Punct}&&[^/]])(#[\\p{L}0-9-_]+)")

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.