How to replace multiple new lines with one in Ingest Pipeline gsub

Bowfish · December 29, 2023, 12:39pm

I want to replace multiple new lines (\n\n+) with one single new line (\n) with a gsub processor in the ingest pipeline.

This is my gsub processor:

  {
    "gsub": {
      "field": "attachment.content_processed",
      "pattern": "\\n\\n",
      "replacement": "\\n",
      "ignore_missing": true,
      "if": "ctx?._replace_bullets == true",
      "tag": "replace_bullets",
      "description": "Replace multiple newlines",
      "on_failure": [
        {
          "append": {
            "description": "Record error information",
            "field": "_ingestion_errors",
            "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
          }
        }
      ]
    }
  },

Like this it replaces \n\n with n and not \n.

What is the correct "replacement" string?

stephenb · December 29, 2023, 4:14pm

Hi @Bowfish

The best way for use to help... It is to provide a JSON document with the actual (anonymized) data / attachment.content_processed, string and the desired output. Or at least a representative sample

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "gsub": {
          "field": "message",
          "pattern": "\\\n",
          "replacement": "",
          "ignore_missing": false,
          "description": "Replace multiple newlines",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """String with 
more than 1



newline """
      }
    }
  ]
}
# Result
{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "message": "String with more than 1 newline "
        },
        "_ingest": {
          "timestamp": "2023-12-29T16:27:48.349457039Z"
        }
      }
    }
  ]
}

Bowfish · December 29, 2023, 4:45pm

stephenb:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "message": "String with more than 1 newline "
        },
        "_ingest": {
          "timestamp": "2023-12-29T16:27:48.349457039Z"
        }
      }
    }
  ]
}

The result of your example should be:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "message": "String with 
more than 1

newline "
        },
        "_ingest": {
          "timestamp": "2023-12-29T16:38:18.121740582Z"
        }
      }
    }
  ]
}

If there are more than one newlines in the string it should substitute the multiple newlines with one newline. If there is only one newline it should keep it.

stephenb · December 29, 2023, 4:57pm

Now it is just a regex exercise I am not a regex expert... but I did a quick google search found this

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "gsub": {
          "field": "message",
          "pattern": "\\\n+",
          "replacement": "\\\n",
          "ignore_missing": false,
          "description": "Replace multiple newlines",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """String with 
1 newline
Then 2 newlines

Then 3 Newlines



The End """
      }
    }
  ]
}

# Result
{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "message": """String with 
1 newline
Then 2 newlines
Then 3 Newlines
The End """
        },
        "_ingest": {
          "timestamp": "2023-12-29T16:58:37.484845559Z"
        }
      }
    }
  ]
}

system · January 26, 2024, 4:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to replace multiple newlines with two newlines in ingest pipeline gsub Elasticsearch ingest-pipeline	7	287	February 6, 2024
Gsub processor in ingest pipeline cannot substitute a pattern into newline Elasticsearch ingest-pipeline	5	1101	February 13, 2023
How to replace unicode \u00a0 with space with ingest pipeline processor Elasticsearch ingest-pipeline	2	366	February 3, 2024
Elasticsearch Ingest node gsub processor replace character Elasticsearch	3	2380	February 14, 2018
Replace string with multiple patterns on ingest node Elasticsearch	2	373	February 19, 2020

How to replace multiple new lines with one in Ingest Pipeline gsub

Related topics