We often encounter a need to extract special terms from our documents, e.g. phone numbers, twitter handles, hashtags. Often this can be done with regular expressions, not fancy entity extraction tools. We realize that this could be done during ingestion in our python ingest scripts, but I'd prefer to be able to run this process during re-indexing because if we identify new entities to extract next month, I want to re-index, not re-ingest. We are not using Logstash yet and would prefer not to get into that learning curve yet, unless we have to.
So, the question is: how does one compliment the grok processor with a capability to take the term(s) it extracts, and then insert them into a field in the document that contains (most likely) an array of such terms? Thanks.
Edit: I see now that the grok processor inserts scalar results into fields automatically. What about if the grok processor finds multiple results for one expression? Is it smart enough to put those in an array? I see that the grok filter has a break_on_match option which will allow finding multiple matches of the same item, but I don't see this option for the grok processor