Partition and Index by Text Field Contents

wpm · March 19, 2025, 7:11pm

I have an index with the following records.

text
The triangle is blue.
The square is red.
There are two red circles.
The triangle is green.

I want to partition them by mentions of color in the text field, meaning that I want to add a color column like so.

text                                          color
The triangle is blue.               blue
The square is red.                  red
There are two red circles.    red
The triangle is green.            green

The color is determined by running a regular expression. Assume for simplicity's sake that exactly one color is mentioned in every text field. I want to perform a transform with color as the pivot field.

What is the best way to do this?

dadoonet · March 19, 2025, 8:32pm

What do you think of this?

POST _ingest/pipeline/_simulate
{
  "docs": [
    { "_index": "foo", "_source": { "text": "The triangle is blue." } },
    { "_index": "foo", "_source": { "text": "The square is red." } },
    { "_index": "foo", "_source": { "text": "There are two red circles." } },
    { "_index": "foo", "_source": { "text": "The triangle is green." } }
  ],
  "pipeline": {
    "processors": [
      {
        "lowercase": {
          "field": "text",
          "target_field": "tmp"
        }
      },
      {
        "grok": {
          "field": "tmp",
          "patterns": [
            "%{COLOR:color}"
          ],
          "pattern_definitions": {
            "COLOR": "red|green|blue"
          }
        }
      },
      {
        "remove": {
          "field": "tmp"
        }
      }
    ]
  }
}

This gives:

{
  "docs": [
    {
      "doc": {
        "_index": "foo",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "color": "blue",
          "text": "The triangle is blue."
        },
        "_ingest": {
          "timestamp": "2025-03-19T20:32:02.568078692Z"
        }
      }
    },
    {
      "doc": {
        "_index": "foo",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "color": "red",
          "text": "The square is red."
        },
        "_ingest": {
          "timestamp": "2025-03-19T20:32:02.568106186Z"
        }
      }
    },
    {
      "doc": {
        "_index": "foo",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "color": "red",
          "text": "There are two red circles."
        },
        "_ingest": {
          "timestamp": "2025-03-19T20:32:02.568112315Z"
        }
      }
    },
    {
      "doc": {
        "_index": "foo",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "color": "green",
          "text": "The triangle is green."
        },
        "_ingest": {
          "timestamp": "2025-03-19T20:32:02.568117517Z"
        }
      }
    }
  ]
}

wpm · March 19, 2025, 9:12pm

Yes, that's what I'm looking for. I'm still inexperienced with Grok.

How would I make this a runtime mapping instead? (I'm playing around but can't quite get the syntax.)

dadoonet · March 19, 2025, 9:28pm

Definitely better to do that at index time. I'd not use runtime fields for this.

Topic		Replies	Views
Explode a document into multiple documents by delimited text field Elasticsearch runtime-fields	3	42	March 21, 2025
Search text occurring priority Elasticsearch	2	311	March 5, 2020
App Search - How to add "multi field", a field with array of values (like color) Elastic Search elastic-app-search	2	295	July 19, 2022
Need help to simple sort aggregation by other field Elasticsearch	1	435	July 14, 2020
Synonym analyzer not picked up as search_analyzer for field Elasticsearch	2	798	July 6, 2017

Partition and Index by Text Field Contents

Related topics