Web crawler body_content should not include the contents of the title tag

When using the web crawler feature, resulting documents have a body_content that starts with the same string as the <title>. I have no idea why.

The documentation at Web crawler reference | Elastic App Search Documentation [8.3] | Elastic specifically says that body_content comes from the <body> tag. Having the title appended screws up the relevancy of snippets when the search term appears in the title.

Hi @Antonio_Gutierrez ! Could you share the URL or page source, so we can take a closer look?

Thanks!

Open support case: #00994044
Example:

Crawled doc:

The title should not be in the body. The h2 should be, and is as expected.

This has been confirmed as a bug, thanks for raising it!

We'll keep you updated on the fix version. Stay tuned!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.