Hi there,
Is there a way to make the Elasticsearch crawler from indexing the content of and HTML tags and their content?
Specifically, we'd like to remove them from the headings and main_content (extracted by default). By default, the tags are stripped, but its contents are kept (which seems odd for these types of tag).
I am aware of the data-elastic-exclude attribute, but in our case, these tags are generated automatically by a framework over which we have limited control. We considered regex, but because the tags are extracted and processed by default this isn't feasible. If this is not possible, I suppose we could extract the headings into a separate fields, and apply regex to remove the and content, though we'd prefer to use the existing default fields.
Hi @_Pontes, we usually recommend using ingest pipelines to manipulate the content on ingest: Customize crawler field values using an ingest pipeline | Enterprise Search documentation [8.12] | Elastic
Yes, that's the approach we ended up following, used a GSUB step in the pipeline to extract the undesired CSS.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.