I am indexing documents with Elasticsearch, and its working well. My problem is that some documents have hyperlinks in them. Search is finding terms in these links, which I don't want.
I tried to add a html_strip processor to the pipeline to remove the links on ingest, like this:
PIPELINE = {
"description": "Extract attachment information"
"processors": {
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data",
"ignore_missing": 1,
"indexed_chars": -1,
"ignore_failure": 1
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"remove": {
"field": "_ingest._value.data"
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"html_strip": {
"field": "_ingest._value.attachment"
}
}
}
}
}
This does not work. I re-send the mapping, and re-index the attachments, but I still find hits inside hyperlinks in documents. Got any tips?