The best way for use to help... It is to provide a JSON document with the actual (anonymized) data / attachment.content_processed, string and the desired output. Or at least a representative sample
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"gsub": {
"field": "message",
"pattern": "\\\n",
"replacement": "",
"ignore_missing": false,
"description": "Replace multiple newlines",
"on_failure": [
{
"append": {
"description": "Record error information",
"field": "_ingestion_errors",
"value": "Processor 'gsub' with tag 'remove_page_numbers' in pipeline '{{ _ingest.on_failure_pipeline }}' failed with message '{{ _ingest.on_failure_message }}'"
}
}
]
}
}
]
},
"docs": [
{
"_source": {
"message": """String with
more than 1
newline """
}
}
]
}
# Result
{
"docs": [
{
"doc": {
"_index": "_index",
"_version": "-3",
"_id": "_id",
"_source": {
"message": "String with more than 1 newline "
},
"_ingest": {
"timestamp": "2023-12-29T16:27:48.349457039Z"
}
}
}
]
}
If there are more than one newlines in the string it should substitute the multiple newlines with one newline. If there is only one newline it should keep it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.