Let's say we are analyzing posts to a sports forum, where the records contain a title, a body (text) field, and keyword fields, where the keywords might be the name of the sport that the post concerns.
I know how to do aggregations around keywords, so that I could report on all the posts that have the keyword 'tennis'.
But let's say that I want to do queries in the same manner but the term is inside the text body field, not in a keyword. So e.g. I want to aggregate on every post that contains the word 'exercise' somewhere in the body field, just as though that was a keyword field even though it isn't.
This is a related question: would it be possible instead to do an aggregation that shows e.g. the top-20 twitter handles (any term that starts with '@') mentioned in the text field of all the records, i.e. show me who is being mentioned the most in a corpus of documents, where I also can use filters to define what that corpus is, e.g. last month's documents.
Yes. This is best achieved if you prepare the content first - aka "entity extraction". Twitter handles are easy to spot and can be extracted with a regular expression into a structured field. So given a doc like this:
{ "text": "Elasticsearch was created by @kimchy"}
You would turn it into :
{ "text": "Elasticsearch was created by @kimchy", "mentions": ["@kimchy"] }
You could then use the terms aggregation or significant_terms aggregation on the mentions field.
The entity extraction could be done upstream of elasticsearch in logstash or whatever programming language you might use to create docs or inside elasticsearch using an ingest pipeline configured with a processor like grok.
In each case a regular expression is typically spotting the entities of interest (in your case Twitter handles)
Thanks, this is excellent. I've pored over the docs recently looking for a capability (processor? plugin?) that makes it straightforward to 1) extract terms and then 2) write them back to the document. For the second part: does that have to be coded explicitly by hand, or is there some construct that does that for me?
I do most of my doc preparation in Python so probably not best positioned to comment on logstash/ingest pipelines. This tool is great for testing and generating regex code to do the parsing.
To get more eyeballs I suggest opening another issue to discuss "how to extract Twitter handles using Logstash or ingest pipelines".
Good post. My impression of Grok processor from the examples was that it was used for parsing log files with strictly ordered sequences of values in a line. Obviously free-text with phone numbers and twitter handles etc is more free-form so I'm keen to learn if it can be used for that.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.