We often encounter a need to extract special terms from our documents, e.g. phone numbers, twitter handles, hashtags. Often this can be done with regular expressions, not fancy entity extraction tools. We realize that this could be done during ingestion in our python ingest scripts, but I'd prefer to be able to run this process during re-indexing because if we identify new entities to extract next month, I want to re-index, not re-ingest. We are not using Logstash yet and would prefer not to get into that learning curve yet, unless we have to.
So, the question is: how does one compliment the grok processor with a capability to take the term(s) it extracts, and then insert them into a field in the document that contains (most likely) an array of such terms? Thanks. Edit: I see now that the grok processor inserts scalar results into fields automatically. What about if the grok processor finds multiple results for one expression? Is it smart enough to put those in an array? I see that the grok filter has a break_on_match option which will allow finding multiple matches of the same item, but I don't see this option for the grok processor
please add some more context about your question. First and foremost: Are you talking about Elasticsearch or Logstash here. The option you quoted only exists in the logstash processor.
Second, it would be tremendously useful, if you could provide a sample document and a sample grok expression plus the desired output, as this would make it much easier to follow your examples.
we would prefer not to use Logstash yet if we can avoid the learning curve of using a new tool
I have since learned that the grok processor can extract regexp groups into named fields, which is nearly precisely what we need, however i am unable to figure out how to extract multiple matches to a pattern, e.g. multiple hashtags in the content. Not sure if that's possible. If that is possible in Logstash, then we'll learn Logstash
DateTime: 03-01-2019
Body: "I'm watching the #Lakers destroy the #Warriors"
The saved document i suppose could look like this:
{DateTime: 03-01-2019, Body: "I'm watching the #Lakers destroy the #Warriors", entities: { hashtags: ['Lakers', 'Warriors']}}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.