I want to replace all non breaking space (\u00a0) characters with a normal spaces in a gsub processor in an ingest pipeline.
I tried it with this:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"gsub": {
"field": "message",
"pattern": "^[\\\n]+",
"replacement": "",
"ignore_missing": false,
"description": "Replace multiple newlines at the beginning",
"on_failure": [
{
"append": {
"description": "Record error information",
"field": "_ingestion_errors",
"value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
}
}
]
}
},
{
"gsub": {
"field": "message",
"pattern": "\\\u00a0",
"replacement": "",
"ignore_missing": false,
"description": "Replace non breaking spaces",
"on_failure": [
{
"append": {
"description": "Record error information",
"field": "_ingestion_errors",
"value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
}
}
]
}
},
{
"gsub": {
"field": "message",
"pattern": "Page \\d of \\d",
"replacement": "",
"ignore_missing": false,
"description": "Remove Page x of x",
"on_failure": [
{
"append": {
"description": "Record error information",
"field": "_ingestion_errors",
"value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
}
}
]
}
},
{
"gsub": {
"field": "message",
"pattern": "\\\n ",
"replacement": "",
"ignore_missing": false,
"description": "Replace multiple newlines",
"on_failure": [
{
"append": {
"description": "Record error information",
"field": "_ingestion_errors",
"value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
}
}
]
}
},
{
"gsub": {
"field": "message",
"pattern": "[\\\n\\\n]+",
"replacement": "\\\n\\\n",
"ignore_missing": false,
"description": "Replace multiple newlines",
"on_failure": [
{
"append": {
"description": "Record error information",
"field": "_ingestion_errors",
"value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
}
}
]
}
}
]
},
"docs": [
{
"_source": {
"message": "\n\n\n\nMarketing Intern\nMay 2009 - August 2009 (4 months)\nNew York City Metropolitan Area\n\nJump PR\nPublic Relations Intern\n\n Page 2 of 3\n\n\n\n \n\nMay 2008 - August 2008 (4 months)\nNew York City Metropolitan Area\n\nEducation\nUniversity\nBachelor of Arts - BA, Communication and Media Studies · (August 2010 - May\n2014)\n\nSenior High School\n · (September 2006 - June 2010)\n\n Page 3 of 3"
}
}
]
}
but it doesn't remove the \u00a0 character.
How can I remove it? And how can i optimize the gsub's that the I get the output like I posted it below?
I would like to look the output like this:
Marketing Intern
May 2009 - August 2009 (4 months)
New York City Metropolitan Area Jump PR
Public Relations Intern
May 2008 - August 2008 (4 months)
New York City Metropolitan Area
Education
University
Bachelor of Arts - BA, Communication and Media Studies · (August 2010 - May 2014)
Senior High School· (September 2006 - June 2010)
Thanks for your help