Trouble syncing from Confluence connector

Hello, I just did a make clean install and run using Elasticsearch 8.16.1 and am running into issues syncing from Confluence Cloud.

this is what my local running make run returned:

[FMWK][20:34:58][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Executing full sync
[FMWK][20:34:58][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Filtering validation started
[FMWK][20:34:58][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Filtering validation result: FilteringValidationState.VALID
[FMWK][20:34:59][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Collecting local document ids
[FMWK][20:34:59][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Iterating on remote documents
[FMWK][20:34:59][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Successfully connected to Confluence
[FMWK][20:34:59][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Fetching spaces and its permissions from Confluence
[FMWK][20:35:23][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Sync progress -- created: 0 | updated: 0 | deleted: 0
[FMWK][20:35:23][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Fetching docs from space: ECOP
[FMWK][20:35:23][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Fetching blogpost and its permissions from Confluence
[FMWK][20:35:23][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Fetching page and its permissions from Confluence
[FMWK][20:35:26][ERROR] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Error while fetching pages and blogposts with query 'cql=space in ('ECOP') AND type=page&limit=50&expand=ancestors,children.attachment,history.lastUpdated,body.storage,space,space.permissions,restrictions.read.restrictions.user,restrictions.read.restrictions.group': 'publicName'
Traceback (most recent call last):
  File "/Users/me/vscode_stuff/connectors/connectors/sources/confluence.py", line 1176, in _page_blog_coro
    async for (
  File "/Users/me/vscode_stuff/connectors/connectors/sources/confluence.py", line 918, in fetch_documents
    "author": document["history"]["createdBy"][self.authorkey],
KeyError: 'publicName'
[FMWK][20:35:26][ERROR] Exception found for task Task-484
Traceback (most recent call last):
  File "/Users/me/vscode_stuff/connectors/connectors/sources/confluence.py", line 1176, in _page_blog_coro
    async for (
  File "/Users/me/vscode_stuff/connectors/connectors/sources/confluence.py", line 918, in fetch_documents
    "author": document["history"]["createdBy"][self.authorkey],
KeyError: 'publicName'
[FMWK][20:40:55][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Both Extractor and Sink tasks are successfully stopped.
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Sync ended with status completed -- created: 1 | updated: 0 | deleted: 1 (took 682 seconds)
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] --- Counters ---
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'bulk_item_responses._ids_changed_after_request' : 2
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'bulk_item_responses.delete' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'bulk_item_responses.index' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'bulk_operations.delete' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'bulk_operations.index' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'deleted_document_count' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'doc_creates_queued' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'doc_deletes_queued' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'docs_extracted' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'indexed_document_count' : 1
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'indexed_document_volume' : 0
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] 'result_successes' : 2
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] full counters dictionary: {'docs_extracted': 1, 'doc_creates_queued': 1, 'doc_deletes_queued': 1, 'bulk_operations.index': 1, 'bulk_operations.delete': 1, 'bulk_item_responses.index': 1, 'bulk_item_responses._ids_changed_after_request': 2, 'result_successes': 2, 'bulk_item_responses.delete': 1, 'indexed_document_count': 1, 'indexed_document_volume': 0, 'deleted_document_count': 1}
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] ----------------
[FMWK][20:40:57][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Job terminated. Cleaning up.
[FMWK][20:40:58][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: qm82jpMBqZFPm85GdOfj] Both Extractor and Sink tasks are successfully stopped.

I see there is a KeyError. despite this, the job indexed one document for the space, with just the id, space key and name. None of the pages in the confluence space are indexed.

Any help is appreciated, thanks!

Hi @robert.bonagura ,
Sorry you're having this issue.

This is unexpected behavior, for sure. It looks to me like self.authorkey is being evaluated as the string "publicName" which seems like an... unlikely author name. If you execute the CQL yourself:

space in ('ECOP') AND type=page&limit=50&expand=ancestors,children.attachment,history.lastUpdated,body.storage,space,space.permissions,restrictions.read.restrictions.user,restrictions.read.restrictions.group

Do you get back reasonable results? Do you see pages that are created by some sort of uncommon automation or something? Can you identify what the key should be in these cases? You'd be looking for a createdBy that's odd. Can you share what creators are credited?

Hey Sean,

So when I execute the query, I see that response does not contain a "createdBy" field. This is consistent with why I am seeing "publicName" as the default value. this is defined by confluence.py when there ius not authorKey found:

class ConfluenceDataSource(BaseDataSource):
    """Confluence"""

    ...
    ...

    def __init__(self, configuration):
        """Setup the connection to Confluence

        Args:
            configuration (DataSourceConfiguration): Object of DataSourceConfiguration class.
        """
        ...
        ...

        self.authorkey = (
            "username"
            if self.confluence_client.data_source_type == "confluence_data_center"
            else "publicName"
        )

This is the default behavior when the data_source_type is not confluence_data_center

This is the default behavior when the data_source_type is not confluence_data_center

My mistake, should have done a closer grep through the code. Ok, so publicName is a red herring, and is something we clearly do expect to be present.

I see that response does not contain a "createdBy" field

this is surprising to me. None of the returned Pages have createdBy as a field? https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-content/#api-wiki-rest-api-content-get shows that results should have a results[].history.createdBy.publicName

I expect you executed the CQL directly and not throught the Get Content REST API, but I'd still expect the pages you got as a result to all have creators.

Yeah, strange right?

My results[].history or results[].history.createdBy.publicName is not present in my response,. The shape of my response looks like this:

"results": [
      {
          "content": {
              "id": "1420110996",
              "type": "page",
              "status": "current",
              "title": "some-title",
              "restrictions": {},
              "_links": {
                  "webui": "/pages/viewpage.action?pageId=1420110996",
                  "tinyui": "/x/lCylV",
                  "self": "https://some-host/rest/api/content/1420110996"
              },
              "_expandable": {
                  "container": "",
                  "metadata": "",
                  "extensions": "",
                  "operations": "",
                  "children": "",
                  "history": "/rest/api/content/1420110996/history",
                  "ancestors": "",
                  "body": "",
                  "version": "",
                  "descendants": "",
                  "space": "/rest/api/space/some-space"
              }
          },
          "title": "some-title",
          "excerpt": "some-excerpt",
          "url": "/pages/viewpage.action?pageId=1420110996",
          "resultGlobalContainer": {
              "title": "some title",
              "displayUrl": "/display/ECOP"
          },
          "entityType": "content",
          "iconCssClass": "aui-icon content-type-page",
          "lastModified": "2024-11-26T16:49:15.370-05:00",
          "friendlyLastModified": "Nov 26, 2024",
          "timestamp": 1732657755370
      },

I am using the confluence 8.9.3 search API GET /rest/api/search to generate this, based on the query we discussed in this thread

EDIT: please disregard, I will try using the API you shared instead of the GET /rest/api/search

Thanks @Sean_Story I think I have discovered the issue.

It has to do with that inital code block I shared

        self.authorkey = (
            "username"
            if self.confluence_client.data_source_type == "confluence_data_center"
            else "publicName"
        )

I need to use the username field here, not publicName, despite the fact that my client type is NOT confluence_data_center.

            # else "publicName"
            else "username"

Does this have to do with the fact that I am doing auth with a username and password, maybe this is not preferred? I have configured using the connector using Confluence server.

Anyways, I have made progress. So far I have indexed 832 documents, but have encountered a new issue, six 504 gateway timeout errors. Is this just a matter of updating my connector configuration to allow for a longer time window? for what it is worth, I am starting with one of my smaller confluence spaces, and was planning to scale this to multiple spaces, some much much larger than this one.

Any guidance with regards to this would be really appreciated. I am not certain how much this connector is supposed to scale. I am running locally on a 32GB m2 macbook. I am not familiar with whether or not this job processor utilizes multi-threading or not

I need to use the username field here, not publicName, despite the fact that my client type is NOT confluence_data_center .

It's possible that Confluence Cloud changed their data model. But even the docs I linked earlier said that this should be publicName after all. If you're finding that their API response is differing from their documentation, you may want to file a support ticket with Atlassian, and I'll try a few things on my side to see if I can reproduce.

but have encountered a new issue, six 504 gateway timeout errors.

I'm assuming those are coming from Confluence, not Elasticsearch? Connectors themselves don't have a REST API, so they aren't the source of the 504s.

5xx errors are the "fault" of whatever produces them. If I'm right and these are coming from Confluence, I'd suggest filing a support ticket with them to inquire as to why you're getting 504 errors.

Is this just a matter of updating my connector configuration to allow for a longer time window?

If it were a client timeout, maybe. But Gateway timeout means that something between the source and the destination decided to give up without the consent of the client or the server. So changing client-side timeouts likely won't help here.

I am not familiar with whether or not this job processor utilizes multi-threading or not

The connectors framework is built on Python, and does not start parallel threads/processes. It does make heavy use of "async" python, which utilize numerous coroutines within a single real core. It should be pretty snappy, and is usually bottlenecked more by the speed at which data can be fetched from the 3rd party (Confluence) and at which it can ingest it into Elasticsearch.

How many documents are you expecting? Does the sync continue (just slowly) or is it crashing after 832 documents and those 504 errors?

I am a little unsure of how many I am expecting to be honest. But after I re-ran the sync process I did not get those 504s, so like you said it may just happen to be related to something with the 3rd party.

I've expanded the configuration to multiple spaces this time though and am seeing a new error that I saw once before on one of my first attempts:

[FMWK][20:47:08][INFO] Running connector service version 8.16.1
[FMWK][20:47:08][INFO] Loading config from /Users/user/vscode_stuff/connectors/connectors/../config.yml
[FMWK][20:47:08][INFO] Running preflight checks
[FMWK][20:47:08][INFO] Waiting for Elasticsearch at https://host.com/ (so far: 0 secs)
[FMWK][20:47:09][INFO] Elasticsearch 8.16.1 and Connectors 8.16.1 are compatible
[FMWK][20:47:09][INFO] Extraction service is not configured, skipping its preflight check.
[FMWK][20:47:09][INFO] [job_scheduling_service] Job Scheduling Service started, listening to events from https://host.com/
[FMWK][20:47:09][INFO] [content_sync_job_execution_service] Content sync job execution service started, listening to events from https://host.com/
[FMWK][20:47:09][INFO] [access_control_sync_job_execution_service] Access control sync job execution service started, listening to events from https://host.com/
[FMWK][20:54:19][INFO] [job_cleanup_service] Successfully marked #2 out of #2 idle jobs as error.

Any idea what would cause a job to go idle?

Also, you mentioned it should be pretty snappy, but I do find that it usually takes 5+ minutes to make the initial connection, you can see from the timestamps in these info logs.

[FMWK][21:04:50][INFO] Running connector service version 8.16.1
[FMWK][21:04:50][INFO] Loading config from /Users/my-user/vscode_stuff/connectors/connectors/../config.yml
[FMWK][21:04:50][INFO] Running preflight checks
[FMWK][21:04:50][INFO] Waiting for Elasticsearch at https://masked-host.com/ (so far: 0 secs)
[FMWK][21:04:50][INFO] Elasticsearch 8.16.1 and Connectors 8.16.1 are compatible
[FMWK][21:04:50][INFO] Extraction service is not configured, skipping its preflight check.
[FMWK][21:04:51][INFO] [job_scheduling_service] Job Scheduling Service started, listening to events from https://masked-host.com/
[FMWK][21:04:51][INFO] [content_sync_job_execution_service] Content sync job execution service started, listening to events from https://masked-host.com/
[FMWK][21:04:51][INFO] [access_control_sync_job_execution_service] Access control sync job execution service started, listening to events from https://masked-host.com/


[FMWK][21:10:04][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: wb-jmJMBafUUoRu236tw] Executing full sync
[FMWK][21:10:04][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: wb-jmJMBafUUoRu236tw] Filtering validation started
[FMWK][21:10:04][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: wb-jmJMBafUUoRu236tw] Filtering validation result: FilteringValidationState.VALID
[FMWK][21:10:05][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: wb-jmJMBafUUoRu236tw] Collecting local document ids
[FMWK][21:10:05][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: wb-jmJMBafUUoRu236tw] Iterating on remote documents
[FMWK][21:10:05][INFO] [Connector id: Y7_GjZMBafUUoRu2nVr9, index name: connector-confluence-068d, Sync job id: wb-jmJMBafUUoRu236tw] Successfully connected to Confluence

It seems like this 5 minutes is the elastic cluster finding the connector, not the actual connection to the third party tool, no?

are you ctrl+C'ing the connectors process while syncs are running, rather than canceling the jobs in the UI?

"Idle" jobs are jobs that haven't made any progress or sent a heartbeat during some window of time. We determine something must have gone wrong and put them into an error state so that other pending jobs can begin.

t usually takes 5+ minutes to make the initial connection

what's triggering the sync? Are you clicking the "Sync now" button, or is there a schedule you've configured?

on-demand syncs might take up to 30 seconds to start, because, like I said, Connectors dont' have a REST API, they communicate with Kibana asynchronously through Elasticsearch. But unless you've changed the default config in connectors/config.yml to increase that polling window, 5 minutes is definitely longer than expected.

If none of that helps you identify what's happening, can you enable debug logs and share them from service start-up to job start-up?