Web crawler body_content should not include the contents of the title tag

Antonio_Gutierrez · July 18, 2022, 8:06pm

When using the web crawler feature, resulting documents have a body_content that starts with the same string as the <title>. I have no idea why.

The documentation at Web crawler reference | Elastic App Search Documentation [8.3] | Elastic specifically says that body_content comes from the <body> tag. Having the title appended screws up the relevancy of snippets when the search term appears in the title.

Carlos_D · July 20, 2022, 12:38pm

Hi @Antonio_Gutierrez ! Could you share the URL or page source, so we can take a closer look?

Thanks!

Antonio_Gutierrez · July 20, 2022, 1:20pm

Open support case: #00994044
Example:

Crawled doc:

The title should not be in the body. The h2 should be, and is as expected.

Carlos_D · July 26, 2022, 9:47am

This has been confirmed as a bug, thanks for raising it!

We'll keep you updated on the fix version. Stay tuned!

Topic		Replies	Views
Remove strings from body_content in Crawler Elastic Search	1	328	February 1, 2022
Body meta tag significance Elastic Search elastic-site-search	1	567	December 8, 2020
I Want to crawl Meta tag in Head Elasticsearch	0	349	March 19, 2014
Exclude parts of on-page content from crawler Elastic Search elastic-app-search	1	751	June 14, 2021
Autocomplete search in both title and body Elastic Search elastic-site-search	2	884	December 13, 2020