How can i disable content extraction?

maddy30 · March 14, 2024, 4:55pm

I am using App search based engines. By default my crawler extracts the content from web pages and pdf's. But when i am running the crawl for one particular app search engine, i only want the meta data of both the web pages and pdf's to be extracted but not the content from it. how can i achieve it? any help would be appreciated. Thanks.

Sean_Story · March 14, 2024, 5:56pm

Hi @maddy30 ,

Looks like this might be related to your other question here: How can i update the pipeline used for a app search engine?

The configurations to extract content from files (like PDFs) are made at a deployment level, not on an engine-by-engine basis. What you could do is add conditionals to your ingest pipeline to run certain processors only if the URL matches a certain domain or pattern.

Alternatively, you can take the approach I suggest in the other post to use different pipelines per index, and have some pipelines remove the body_content from your documents before indexing it.

system · April 11, 2024, 5:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic App Search Crawler Elastic Search elastic-app-search	3	165	February 26, 2024
How to exclude attachment content and still searching inside it? Elasticsearch	8	245	May 1, 2023
Ingest Pipeline "app_search_crawler" not running Elastic Search ingest-pipeline	7	468	January 17, 2023
Ingest pipeline for App Search document indexing Elastic Search ingest-pipeline	2	712	August 19, 2022
How can i update the pipeline used for a app search engine? Elastic Search elastic-app-search	5	219	April 11, 2024

How can i disable content extraction?

Related topics