How can I strip text from a document before it goes into the index

HI All,

I have created an index of an intranet site that is written in classic asp. The idea being that a search can be performed of the entire site. The trouble is in the content I also have my vbscript that are in '<% %>' tags. When I first create the index is there any kind of filter that I can apply to say 'do not include anything between the above tags' in the document?

Depending on if you want to keep the source intact or not you can use a regex token filter or use an ingest grok processor.

No, the actual asp page obviously needs to stay as is for the purpose of the website, but for the index it doesn't need to be kept in the document.

I wonder if Apache Tika (and Ingest Attachment plugin) would automatically remove that content or not. I don't believe so though.

Anyway, to keep the source intact, you can do that with a regex: https://www.elastic.co/guide/en/elasticsearch/reference/6.0/analysis-pattern_replace-tokenfilter.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.