How to replace multiple new lines with one in Ingest Pipeline gsub

I want to replace multiple new lines (\n\n+) with one single new line (\n) with a gsub processor in the ingest pipeline.

This is my gsub processor:

  {
    "gsub": {
      "field": "attachment.content_processed",
      "pattern": "\\n\\n",
      "replacement": "\\n",
      "ignore_missing": true,
      "if": "ctx?._replace_bullets == true",
      "tag": "replace_bullets",
      "description": "Replace multiple newlines",
      "on_failure": [
        {
          "append": {
            "description": "Record error information",
            "field": "_ingestion_errors",
            "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
          }
        }
      ]
    }
  },

Like this it replaces \n\n with n and not \n.

What is the correct "replacement" string?

Hi @Bowfish

The best way for use to help... It is to provide a JSON document with the actual (anonymized) data / attachment.content_processed, string and the desired output. Or at least a representative sample

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "gsub": {
          "field": "message",
          "pattern": "\\\n",
          "replacement": "",
          "ignore_missing": false,
          "description": "Replace multiple newlines",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """String with 
more than 1



newline """
      }
    }
  ]
}
# Result
{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "message": "String with more than 1 newline "
        },
        "_ingest": {
          "timestamp": "2023-12-29T16:27:48.349457039Z"
        }
      }
    }
  ]
}

The result of your example should be:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "message": "String with 
more than 1

newline "
        },
        "_ingest": {
          "timestamp": "2023-12-29T16:38:18.121740582Z"
        }
      }
    }
  ]
}

If there are more than one newlines in the string it should substitute the multiple newlines with one newline. If there is only one newline it should keep it.

Now it is just a regex exercise I am not a regex expert... but I did a quick google search found this

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "gsub": {
          "field": "message",
          "pattern": "\\\n+",
          "replacement": "\\\n",
          "ignore_missing": false,
          "description": "Replace multiple newlines",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """String with 
1 newline
Then 2 newlines

Then 3 Newlines



The End """
      }
    }
  ]
}

# Result
{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "message": """String with 
1 newline
Then 2 newlines
Then 3 Newlines
The End """
        },
        "_ingest": {
          "timestamp": "2023-12-29T16:58:37.484845559Z"
        }
      }
    }
  ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.