How to replace unicode \u00a0 with space with ingest pipeline processor

I want to replace all non breaking space (\u00a0) characters with a normal spaces in a gsub processor in an ingest pipeline.

I tried it with this:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "gsub": {
          "field": "message",
          "pattern": "^[\\\n]+",
          "replacement": "",
          "ignore_missing": false,
          "description": "Replace multiple newlines at the beginning",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      },
      {
        "gsub": {
          "field": "message",
          "pattern": "\\\u00a0",
          "replacement": "",
          "ignore_missing": false,
          "description": "Replace non breaking spaces",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      },
      {
        "gsub": {
          "field": "message",
          "pattern": "Page \\d of \\d",
          "replacement": "",
          "ignore_missing": false,
          "description": "Remove Page x of x",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      },
      {
        "gsub": {
          "field": "message",
          "pattern": "\\\n ",
          "replacement": "",
          "ignore_missing": false,
          "description": "Replace multiple newlines",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      },
      {
        "gsub": {
          "field": "message",
          "pattern": "[\\\n\\\n]+",
          "replacement": "\\\n\\\n",
          "ignore_missing": false,
          "description": "Replace multiple newlines",
          "on_failure": [
            {
              "append": {
                "description": "Record error information",
                "field": "_ingestion_errors",
                "value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
              }
            }
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "\n\n\n\nMarketing Intern\nMay 2009 - August 2009 (4 months)\nNew York City Metropolitan Area\n\nJump PR\nPublic Relations Intern\n\n  Page 2 of 3\n\n\n\n   \n\nMay 2008 - August 2008 (4 months)\nNew York City Metropolitan Area\n\nEducation\nUniversity\nBachelor of Arts - BA, Communication and Media Studies · (August 2010 - May\n2014)\n\nSenior High School\n · (September 2006 - June 2010)\n\n  Page 3 of 3"
      }
    }
  ]
}

but it doesn't remove the \u00a0 character.

How can I remove it? And how can i optimize the gsub's that the I get the output like I posted it below?

I would like to look the output like this:

Marketing Intern
May 2009 - August 2009 (4 months) 
New York City Metropolitan Area Jump PR

Public Relations Intern  
May 2008 - August 2008 (4 months) 
New York City Metropolitan Area

Education
University
Bachelor of Arts - BA, Communication and Media Studies · (August 2010 - May 2014)
Senior High School· (September 2006 - June 2010)

Thanks for your help

Hi @Bowfish

It seems like you have an additional \ in your regex. Maybe you can give a try:

"pattern": "[1]+" -> "pattern": "[2]+"

"pattern": "\\u00a0" -> "pattern": "\u00a0",


  1. \\n ↩︎

  2. \n ↩︎

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.