Elastic App Search Crawler

_Pontes · January 24, 2024, 9:07am

Hi there,
Is there a way to make the Elasticsearch crawler from indexing the content of and HTML tags and their content?
Specifically, we'd like to remove them from the headings and main_content (extracted by default). By default, the tags are stripped, but its contents are kept (which seems odd for these types of tag).
I am aware of the data-elastic-exclude attribute, but in our case, these tags are generated automatically by a framework over which we have limited control. We considered regex, but because the tags are extracted and processed by default this isn't feasible. If this is not possible, I suppose we could extract the headings into a separate fields, and apply regex to remove the and content, though we'd prefer to use the existing default fields.

Sander_Philipse · January 26, 2024, 8:03pm

Hi @_Pontes, we usually recommend using ingest pipelines to manipulate the content on ingest: Customize crawler field values using an ingest pipeline | Enterprise Search documentation [8.12] | Elastic

_Pontes · January 29, 2024, 8:37am

Yes, that's the approach we ended up following, used a GSUB step in the pipeline to extract the undesired CSS.

system · February 26, 2024, 8:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How can i disable content extraction? Elastic Search elastic-app-search	2	184	April 11, 2024
Pattern for Indexing HTML Documents Elasticsearch	3	2970	July 26, 2017
Little help needed with crawler content exclusion 7.14 Elastic Search elastic-app-search	9	955	January 11, 2022
Best way to exclude headers and footers on external website Elastic Search elastic-app-search	3	752	October 7, 2022
Ingest attachment plugin not analysing some html files Elasticsearch	15	1207	March 30, 2018

Elastic App Search Crawler

Related topics