Correct way to only include specific sharepoint sites in Workpspace search

Robin_Vanderschueren · February 26, 2022, 5:20pm

I've configured the workspace search following this guide: Customize indexing for a content source | Workplace Search Guide [8.4] | Elastic

This resulted for me in the following rule:

        "default_action": "include",
        "rules": [
            {
                "filter_type": "path_template",
                "exclude": "/sites/ENGINEERING/**/*"
            }
        ]

Strangely enough, this results in no files being included in sharepoint. Is there something wrong with my configuration?

Sean_Story · February 28, 2022, 8:50pm

Hi @Robin_Vanderschueren,

Good question. This configuration looks reasonable to me. Can you confirm:

do you have documents visible to the connected user in sites other that /sites/ENGINEERING/**/*?
do you get results synced if you remove this rule?
do you see any errors in your server logs? Could there be some other issue, besides this config?
does the sync complete? If you have MANY documents in /sites/ENGINEERING, it make take Enterprise Search a while to move on to documents that it can index, but that doesn't mean that the job isn't progressing.

If none of the above seems to be helpful, try setting your log level to DEBUG (if you're on Elastic Cloud, you'll need to contact support for help accessing your logs and increasing the verbosity). Then you should be able to find DEBUG log statements about your indexing rules that help you understand why documents are or are not being filtered.

Robin_Vanderschueren · March 1, 2022, 1:16pm

Hey Sean,

Thanks for the response!

The POST body above is not fully in line with my call to the API, this is the complete call I execute:

curl \
--request "PUT" \
--url "https://xxxx.ent.eu-central-1.aws.cloud.es.io/api/ws/v1/sources/xxxxx" \
--header "Authorization: Bearer xxxxxx" \
--header "Content-Type: application/json" \
--data '
{
    "name": "Sharepoint",
    "is_searchable": true,
    "indexing": {
        "default_action": "exclude",
        "rules": [
            {
                "filter_type": "path_template",
                "include": "/sites/ENGINEERING/**/*"
            },
            {
                "filter_type": "path_template",
                "include": "/sites/WAITERSSITEII/**/*"
            }
        ]
    }
}
'

Sean_Story · March 1, 2022, 3:54pm

@Robin_Vanderschueren that makes more sense. My guess is that your paths of /sites/ENGINEERING/**/* and /sites/WAITERSSITEII/**/* aren't matching anything, so the "default_action": "exclude" is applying to every single document. Paths can be non-intuitive. I recommend:

disabling these indexing rules
sync data
in the search page, open your browser dev tools and look at the document results you get back, particularly their path fields.
modify your indexing rules accordingly

Hopefully once you see what the actual paths are for your documents, it will be easier to form up functional include paths.

Robin_Vanderschueren · March 2, 2022, 6:45pm

@Sean_Story I've looked into it and could it be that the path_template does not work for sites?

It seems that only Lists (or folders) are used as top-level for a pathname. Is there a reason why this is the case?

If this is true, it seems to me that there is not much use to it. A list name or a folder is not required to be unique over different sites. As such it is not possible to ensure that only one folder or list is included if there are multiple over different sites with the same name.

Also, most enterprise SharePoint have thousands of folders and lists, as such figuring out which lists need to be included seems to be close to impossible.

Any help is welcome! It could be that I misunderstood something of course.

Sean_Story · March 2, 2022, 9:00pm

Hi @Robin_Vanderschueren ,

the path_template does not work for sites?

To make sure I'm understanding, you're saying that the path_template indexing rule won't meet your needs as it is, because the path field we apply the rule against doesn't include the site name as a prefix? If so, I can confirm. The path as we build it up looks like: "#{parent_folder_path}/#{item.name}", where the site name is not included in the parent_folder_path.

Your point about folder names not being required to be unique is a fair one. I'll flag this to our product managers. If you have a support relationship with Elastic, I'll suggest that you file an Enhancement Request as well.

As a workaround, you could

create a top-level directory that matches the site name, and put all the other current top-level directories underneath it.
create a service account "fake user" who only has access to the sites you want indexed, and use this user's credentials to connect from Workplace Search.

system · October 31, 2022, 2:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to index Sharepoint files? Elasticsearch	7	1064	April 8, 2020
Confluence connector - how to exclude some spaces Elastic Search elastic-workplace-search	13	655	October 31, 2022
Enrich workplace search indices Elastic Search elastic-workplace-search	3	554	October 31, 2022
Get all data from sharepoint to Workplace search Elastic Search docker , elastic-workplace-search	7	1058	October 31, 2022
OAuth application setup error Elastic Search elastic-workplace-search	7	562	October 31, 2022

Correct way to only include specific sharepoint sites in Workpspace search

Related topics