How to extract terms into a new field while indexing

David9 · April 1, 2019, 3:11pm

We often encounter a need to extract special terms from our documents, e.g. phone numbers, twitter handles, hashtags. Often this can be done with regular expressions, not fancy entity extraction tools. We realize that this could be done during ingestion in our python ingest scripts, but I'd prefer to be able to run this process during re-indexing because if we identify new entities to extract next month, I want to re-index, not re-ingest. We are not using Logstash yet and would prefer not to get into that learning curve yet, unless we have to.

So, the question is: how does one compliment the grok processor with a capability to take the term(s) it extracts, and then insert them into a field in the document that contains (most likely) an array of such terms? Thanks.
Edit: I see now that the grok processor inserts scalar results into fields automatically. What about if the grok processor finds multiple results for one expression? Is it smart enough to put those in an array? I see that the grok filter has a break_on_match option which will allow finding multiple matches of the same item, but I don't see this option for the grok processor

spinscale · April 2, 2019, 10:55am

Hey,

please add some more context about your question. First and foremost: Are you talking about Elasticsearch or Logstash here. The option you quoted only exists in the logstash processor.

Second, it would be tremendously useful, if you could provide a sample document and a sample grok expression plus the desired output, as this would make it much easier to follow your examples.

Thanks a lot!

--Alex

David9 · April 2, 2019, 12:01pm

Hi thanks for responding:

we would prefer not to use Logstash yet if we can avoid the learning curve of using a new tool
I have since learned that the grok processor can extract regexp groups into named fields, which is nearly precisely what we need, however i am unable to figure out how to extract multiple matches to a pattern, e.g. multiple hashtags in the content. Not sure if that's possible. If that is possible in Logstash, then we'll learn Logstash

spinscale · April 2, 2019, 12:08pm

do you have a sample message/document?

David9 · April 2, 2019, 12:11pm

DateTime: 03-01-2019
Body: "I'm watching the #Lakers destroy the #Warriors"

The saved document i suppose could look like this:
{DateTime: 03-01-2019, Body: "I'm watching the #Lakers destroy the #Warriors", entities: { hashtags: ['Lakers', 'Warriors']}}

spinscale · April 3, 2019, 2:09pm

how about using a script ingest processor?

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "source" : """
ctx.tags = Arrays.asList(/ /.split(ctx.message)).stream().filter(s -> s.startsWith("#")).collect(Collectors.toList());
          """
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "I'm watching the #Lakers destroy the #Warriors"
      }
    }
  ]
}

Note, that the syntax with three double ticks is kibana specific.

system · May 1, 2019, 2:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to extract found terms into keyword fields Elasticsearch	3	581	May 23, 2019
How to process again documents already ingested Elasticsearch	4	479	October 30, 2020
Process twice a specific field created by grok Logstash	2	537	February 7, 2020
How to split a field value into separated fields in elasticsearch Elasticsearch elastic-stack-alerting	7	2766	September 3, 2020
Extract string from a field in logstash Logstash	8	4903	December 12, 2019

How to extract terms into a new field while indexing

Related topics