Little help needed with crawler content exclusion 7.14

sanderunive · December 7, 2021, 7:16pm

Hi all,

Currently our company is in transition between 2 CMS systems. We therefore have different implemenations on the old and the new CMS to prevent certain elements from being indexed by the Elastic Crawler, mainly for header and footer.

We are following the documentation on content inclusion and exclusion on crawler as on
Web crawler (beta) reference | Elastic App Search Documentation [7.14] | Elastic . We currently use version 7.14.1.

So far, what works for us and what does not:

Works:

<div data-elastic-exclude> + subs
<footer data-elastic-exclude id="footer"> + subs

Does not work:

<header class="u-pb-8" data-elastic-exclude id="header">

Having an exclude tag on top and only include tags on main

Is there something we are overlooking or are there extra requirements on the use of these tags? Should it always be the first declared tag or something like that?
If you care to help us, please take a look at Motorverzekering premie berekenen en afsluiten - Univé (unive.nl) and share your thoughts on why the header is NOT excluded by the Elastic crawler and the footer is.

Any help is much appreciated!

Best regards and have a nice day,
Sander

oleksiy-elastic · December 7, 2021, 8:02pm

Hello, Sander!

Thank you for trying out our Web Crawler!

I've just checked on the most recent version of the product and I'm fairly certain we exclude both the header and the footer from that page. Here is how I tested it:

$ curl -s -u elastic:changeme "http://localhost:3002/api/as/v0/engines/test/crawler/extract_url" -d '{ "url": "https://www.unive.nl/motorverzekering" }' | jq .

You can see the results here: extraction.json · GitHub

As you can see, the results.extraction.content_fields.body_content does not contain the text from the header section (like Geen winstoogmerk), so we do exclude all of the strings found within the header tag and all of its children.

I have not tested the 7.14 release (that's a very old release in the world of such a new product that is changing so rapidly), but if you are seeing issues with it, I would recommend trying out in the latest release (7.16.0 releases today).

Hope this helps.

sanderunive · December 8, 2021, 8:56am

Thanks for checking and letting us know the results.

I checked the results, but there are still a lot of things before the first text in Main (which is our goal). Partially that is, I see on closer inspection, this is because even before Header there is some text. Sorry for that.

But also in the Header still a lot of text is included in the body_content, so there is still something wrong with our implementation of the tagging I guess... See for example the buttons and other hidden menu-items in the header on the second image.

oleksiy-elastic · December 8, 2021, 4:45pm

One way to fix the issue would be to flip the configuration to include-only. Here is a relevant snippet from the docs:

For all pages that contain HTML tags with a data-elastic-include attribute, the crawler will only index content within those tags.

So, if you remove all of those exclusion rules and just add a data-elastic-include attribute to your <main> tag, crawler will only index that content on the page, ignoring all the buttons, etc.

sanderunive · December 10, 2021, 12:46pm

Hi, thanks for the suggestion. Originally we went Exclude on top, Include on main, Exclude on footer but that did not work.
We now tried only Include on main, no Excludes on our test environment. Unfortunately then nothing gets excluded by the ES crawler.

oleksiy-elastic · December 12, 2021, 5:06pm

Ok, I've checked the code and it looks like the docs are wrong: the top-level rule is always an implicit allow. So, to make sure you can exclude absolutely everything, you need to wrap the whole thing in an element with an exclude attribute and then use includes on everything you want yo index:

Something like this:

<html>
<body>
<span data-elastic-exclude>

Some control elements here, menu, etc

<main data-elastic-include>
Your content here
</main>

your footer here

</span>
</body>

I'm going to talk to the team to see if we could make it possible to apply the data-elastic-exclude attribute to the body tag or in some other way to change the default, but for now all exclusions have to be explicit.

P.S. We'll update the docs to make sure the align with the actual rules applied by the crawler. Sorry for the confusion there!

oleksiy-elastic · December 12, 2021, 5:13pm

Here is a test case I've added to cover this specific scenario and to make sure it passes:

    context 'only include the content we want' do
      let(:html) do
        <<-HTML
          <body>
            <span data-elastic-exclude>
              <menu>menu</menu>
              <main data-elastic-include>content</main>
              <footer>footer</footer>
            </span>
          </body>
        HTML
      end

      it 'should return expected document body' do
        expect(document_body).to eq('content')
      end
    end

oleksiy-elastic · December 12, 2021, 5:16pm

Actually, looks like we already support <body> attributes to control the default behaviour! Added this test to the product and it passes:

    context 'using body attributes to set the default behavior' do
      let(:html) do
        <<-HTML
          <body data-elastic-exclude>
            <menu>menu</menu>
            <main data-elastic-include>content</main>
            <footer>footer</footer>
          </body>
        HTML
      end

      it 'should return expected document body' do
        expect(document_body).to eq('content')
      end
    end

This means that in your case you can add the data-elastic-exclude tag to your body tag and then use the data-elastic-include attribute to indicate the content you want to index.

Sorry for the confusion, I hope this helps.

sanderunive · December 14, 2021, 8:18am

Many thanks for the additional information. Will pass it on internally and follow up. When I have an update I will post it to this topic.

Best regards, Sander

system · January 11, 2022, 8:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best way to exclude headers and footers on external website Elastic Search elastic-app-search	3	752	October 7, 2022
Exclude parts of on-page content from crawler Elastic Search elastic-app-search	2	699	July 12, 2021
Data attributes for inclusion and exclusion are not working Elasticsearch	1	98	October 11, 2024
Elastic App Search Crawler Elastic Search elastic-app-search	3	165	February 26, 2024
Webcrawler and googleoff:all tags Elastic Search elastic-app-search	3	564	June 7, 2021

Little help needed with crawler content exclusion 7.14

Related topics