How to extract terms into a new field while indexing

We often encounter a need to extract special terms from our documents, e.g. phone numbers, twitter handles, hashtags. Often this can be done with regular expressions, not fancy entity extraction tools. We realize that this could be done during ingestion in our python ingest scripts, but I'd prefer to be able to run this process during re-indexing because if we identify new entities to extract next month, I want to re-index, not re-ingest. We are not using Logstash yet and would prefer not to get into that learning curve yet, unless we have to.

So, the question is: how does one compliment the grok processor with a capability to take the term(s) it extracts, and then insert them into a field in the document that contains (most likely) an array of such terms? Thanks.
Edit: I see now that the grok processor inserts scalar results into fields automatically. What about if the grok processor finds multiple results for one expression? Is it smart enough to put those in an array? I see that the grok filter has a break_on_match option which will allow finding multiple matches of the same item, but I don't see this option for the grok processor

Hey,

please add some more context about your question. First and foremost: Are you talking about Elasticsearch or Logstash here. The option you quoted only exists in the logstash processor.

Second, it would be tremendously useful, if you could provide a sample document and a sample grok expression plus the desired output, as this would make it much easier to follow your examples.

Thanks a lot!

--Alex

Hi thanks for responding:

  • we would prefer not to use Logstash yet if we can avoid the learning curve of using a new tool
  • I have since learned that the grok processor can extract regexp groups into named fields, which is nearly precisely what we need, however i am unable to figure out how to extract multiple matches to a pattern, e.g. multiple hashtags in the content. Not sure if that's possible. If that is possible in Logstash, then we'll learn Logstash

do you have a sample message/document?

DateTime: 03-01-2019
Body: "I'm watching the #Lakers destroy the #Warriors"

The saved document i suppose could look like this:
{DateTime: 03-01-2019, Body: "I'm watching the #Lakers destroy the #Warriors", entities: { hashtags: ['Lakers', 'Warriors']}}

how about using a script ingest processor?

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "source" : """
ctx.tags = Arrays.asList(/ /.split(ctx.message)).stream().filter(s -> s.startsWith("#")).collect(Collectors.toList());
          """
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "I'm watching the #Lakers destroy the #Warriors"
      }
    }
  ]
}

Note, that the syntax with three double ticks is kibana specific.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.