We're relatively new to Elastic and have a requirement to better understand the occurrences of particular hashtags and twitter handles that appear in paragraphs in full text fields. E.g. I'd like to have a way to locate all the hashtags that appear in a document and place those in a new array in the document. In that manner I think that would enable or at least simplify processes like aggregations around hashtags. I could easily generate a report showing most popular hashtags, with a count. Is this possible? Is it the best way to solve the problem? Thanks
In my opinion the best way to hang out your problem is collect all the data on the index and process it into a new one with Logtash.
This may help.
https://www.elastic.co/es/elasticon/2015/sf/building-entity-centric-indexes
This sounds like an entity extraction problem and fortunately twitter handles and hashtags are easily identified using a simple regular expression.
I tend to use Python code to prepare docs but this is a personal choice and Logstash or ingest pipelines are other document-enrichment tools. This question explores the same problem.
Either way, you should be OK to have a plain doc with your original text
field and use a keyword
type structured field with an array of the extracted handles or tags. If you want to remember where these handles were extracted from the text it might be an idea to use an annotated_text field instead.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.