Confluence connector - how to exclude some spaces

Hi,

We have a Confluence cloud site to index. There are 3 spaces on it, and I only want one, and definitely not the user pages.
But I can't seem to get the indexing exclude to work. We only want the pages from one of the spaces.
I've tried many combinations. I make sure the full sync runs, and also run the deletions sync.

For example, I put in this set of rules:


Include  /wiki/spaces/goodspace/overview/**/*
Include /wiki/spaces/goodspace/pages/**/*
Exclude   /**/*

Yet when syncing is finished, and I look at
Enterprise Search > Workplace Search > Sources > Confluence > Content,
I still see results that we want to exclude, such as


https://xyz.atlassian.net/wiki/spaces/BADSpace/pages/67713778/blahblah

Am I doing it wrong? Thank you again

Hi @tonyfam ,

If you make a change in indexing rules, it does not apply to documents that were synced previously. You'd have to delete them from Workplace Search. Deletions sync will not do this, it only removes documents that were deleted from Confluence.

Hi Irina,
Can you please say what "delete them from Workplace Search" means? Can I do it without deleting the connection?

ie is "Remove the Content Source" s the only way, and if I do that, will I have to set up the connection again (which was tricky for us, but finally got it)?

Thank you

Hi @tonyfam,

Scratch what I said above - I meant deleting the document via API, but that was wrong advice, sorry. Since this is not a custom source, you can't delete documents via API.

Also yes, the documents that were already synced but don't match the new rules should be deleted during next full or incremental sync. Why this didn't happen - it must be something about the paths that you're configuring.

I'll take a closer look, and I'll get back to you.

Hi @tonyfam,

So when the documents are synced by Workplace Search, they are indexed into an Elasticsearch index with a name that looks like this:

.ent-search-engine-documents-{SOURCE_TYPE}-{SOURCE_ID}

If you look at the raw documents in Kibana Dev Tools, they would have an attribute called path. This is what the rules are trying to match.

I would advise you to take a look at the documents in your index, and see what the path is set to:

GET .ent-search-engine-documents-{SOURCE_TYPE}-{SOURCE_ID}/_search

Based on this, you can build your include / exclude mask, or maybe see what's wrong with your current mask and why it's not matching. I suspect that possibly path doesn't start with a /.

I hope this helps - let me know what happens.

Hi Irina, Thank you.
So I tried the GET command in the developer console, and it returns something with like 10 docs and I can't find any mention of "path" (there are about 900 in the results). The ones that show up do have a proper URL.

I did read somewhere that the Confluence Cloud connector is regarded as a "custom" connector. Could this be a source of my issues?

@tonyfam

How does a single document look?

Does a document have description and / or title?

path is deduced as description + "/" + title. If title is empty, then just description. If description is empty, then just title.

Hi Irina, Is it possible that this connector does not support "Indexing Rules"?
Do you know of a way to find out for sure whether indexing rules are supported in a connector? (other than tying it and not being sure if it's my own error or not when it doesn't work :-))

Here is one item in the "hit" array from the GET
(I obscured our atlassian url)

{
        "_index": ".ent-search-engine-documents-confluence-cloud-630e5c215f75c5c8932fb0b8",
        "_id": "confluence_content_67962084",
        "_score": 1,
        "_ignored": [
          "last_updated.float",
          "title.location",
          "url.location",
          "created_at.location",
          "body.date",
          "type.location",
          "content_source_id.date",
          "description.float",
          "description.location",
          "title.date",
          "body.location",
          "created_by.date",
          "updated_at.float",
          "source.location",
          "type.date",
          "source.date",
          "created_by.float",
          "created_at.float",
          "created_by.location",
          "description.date",
          "title.float",
          "url.float",
          "comments.location",
          "project.date",
          "last_updated.location",
          "type.float",
          "updated_at.location",
          "source.float",
          "body.float",
          "project.float",
          "project.location",
          "comments.date",
          "content_source_id.float",
          "url.date"
        ],
        "_source": {
          "last_updated": "2022-08-29T18:32:20+00:00",
          "comments": "",
          "description": "ajp2795/Overview/Sample Pages",
          "project": "~630d038f6856bdd60a9e4bdc",
          "created_at": "2022-08-29T18:32:20+00:00",
          "source": "confluence_cloud",
          "title": "Decision",
          "type": "page",
          "body": "Click to edit this page. To delete, open the ··· menu and select Delete . Status NOT STARTED / IN PROGRESS / COMPLETE Impact HIGH / MEDIUM / LOW Driver Approver Contributors Informed Due date Resources Relevant data Background Options considered Option 1 Option 2 Description Pros and cons Estimated cost LARGE MEDIUM Action items Outcome",
          "created_by": "ajp2795",
          "url": "https://XYZ.atlassian.net/wiki/spaces/~630d038f6856bdd60a9e4bdc/pages/67962084/Decision",
          "updated_at": "2022-08-31T23:23:10+00:00",
          "content_source_id": "630e5c215f75c5c8932fb0b8",
          "id": "confluence_content_67962084"
        }
      },

Hey @tonyfam,

What version of Enterprise Search are you running? I'm surprised the example document you provided doesn't have a path property.

Hi Ross, we are running 8.4, and we started using this index yesterday, after we upgraded our deployment to 8.4.

oops, my apologies. We are running 8.4.1.

Hey @tonyfam,

My apologies right back you. What Irina described about the path field is correct, except I think you've run into a bug with the Confluence Cloud connector: it's not indexing a path value, which leaves the indexing rule with nothing to match against.

I think your only work around would be to modify the Confluence Cloud user used to connect to Enterprise Search such that they only have access to the one Space you want to see in Enterprise Search.

Let us know if you end up trying that!

Ross

Hi Irina and Ross,

OK good to know. We will try the workaround.

Thank you very much for the help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.