How to use html_strip in an attachment pipeline?

I am indexing documents with Elasticsearch, and its working well. My problem is that some documents have hyperlinks in them. Search is finding terms in these links, which I don't want.

I tried to add a html_strip processor to the pipeline to remove the links on ingest, like this:

PIPELINE = {
    "description": "Extract attachment information"
    "processors": {
        "foreach": {
            "field": "attachments",
            "processor": {
                "attachment": {
                    "target_field": "_ingest._value.attachment",
                    "field": "_ingest._value.data",
                    "ignore_missing": 1,
                    "indexed_chars": -1,
                    "ignore_failure": 1
                }
            }
        }
    },
    {
        "foreach": {
            "field": "attachments",
            "processor": {
                "remove": {
                    "field": "_ingest._value.data"
                }
            }
        }
    },
    {
        "foreach": {
            "field": "attachments",
            "processor": {
                "html_strip": {
                    "field": "_ingest._value.attachment"
                }
            }
        }
    }
}

This does not work. I re-send the mapping, and re-index the attachments, but I still find hits inside hyperlinks in documents. Got any tips?

Hey,

please provide a fully reproducible but minimal example. I have created a snippet that works for me, but that won't help you or me :slight_smile:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "foreach": {
          "field": "attachments",
          "processor": {
            "html_strip": {
              "field": "_ingest._value.attachment"
            }
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "attachments": [
          {
            "attachment": "This is <b>a test</b>"
          }
        ]
      }
    }
  ]
}

Maybe the structure of your documents is different?

--Alex

Thanks Alex, good point. I'll test this way.

I found the issue. I'm using Elasticsearch 6.4. It looks like there is no html_strip processor in this version.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.