I'm running the GitHub connector on our enterprise instance, and while I am able to do a full sync on smaller repos, we have a large repo that has over 78k+ pull requests, and we hit the rate limit during the full sync.
Is there a way to throttle the GraphQL calls the connector makes on full sync? Is this something that we would need to implement ourselves?
Thank you for your question! The GitHub connector has rate limiting implemented and should throttle the calls to the GraphQL API. May I ask you to provide the versions for Elasticsearch and connectors you're running on? Also do you've any logs similar to those? (You must enable DEBUG log level to observe these logs).
We are running Elasticsearch version 8.11.3, and connector version 8.11.5 built from branch 8.11.
Here is what I see in the DEBUG logs for connector service:
[FMWK][11:09:42][INFO] [Connector id: CONNECTOR_ID, index name: search-ghes-drivers, Sync job id: lk8Ceo0BZ1EMbxmageMz] Sync progress -- created: 0 | updated: 9900 | deleted: 0
[FMWK][11:09:42][DEBUG] [Connector id: CONNECTOR_ID, index name: search-ghes-drivers, Sync job id: lk8Ceo0BZ1EMbxmageMz] Sending POST to GHES_SERVER_URL/api/graphql with body: '
[FMWK][11:09:42][DEBUG] Retrying (1 of 3) with interval: 2 and strategy: EXPONENTIAL_BACKOFF
[FMWK][11:09:44][DEBUG] [Connector id: CONNECTOR_ID, index name: search-ghes-drivers, Sync job id: lk8Ceo0BZ1EMbxmageMz] Sending POST to GHES_SERVER_URL/api/graphql with body: '
[FMWK][11:09:45][DEBUG] Retrying (2 of 3) with interval: 2 and strategy: EXPONENTIAL_BACKOFF
[FMWK][11:09:49][WARNING] [Connector id: CONNECTOR_ID, index name: search-ghes-drivers, Sync job id: lk8Ceo0BZ1EMbxmageMz] Something went wrong while fetching the pull requests. Exception: 406, message='Not Acceptable', url=URL('GHES_SERVER_URL/login?return_to=GHES_SERVER_URL/rate_limit')
It looks to me, like it is not waiting for sufficient time to retry (should be 1 hr), but I don't see where to set retry time in the config? or is it hard-coded somewhere in the connector?
Piggy-back question I see that there is only an option for full syncs scheduled and no incremental sync for GitHub connector, is that right?
The retrying logic is hardcoded in the connectors. This is something we probably want to change in the future and either allow a framework-level or per-connector level configuration. I'll go ahead and create an issue for that, though I cannot guarantee, when and if this will be picked up .
I'm kinda surprised though that this logic doesn't kick in correctly. I'll take that with me to the team.
The debug logs could also be more useful as we don't see the unit of the interval, that's something I can change directly.
Piggy-back question I see that there is only an option for full syncs scheduled and no incremental sync for GitHub connector, is that right?
Yes, that's right. You can see here that only advanced rules are implemented. For example here you can see that incremental syncs are supported for a different connector.
Anything else I can help with? And btw feel free to simply create an issue in our repository, if you encounter something is not working, always appreciating feedback and input from the community!
I've done some debugging on my own - though I'm not a Python dev, I found my way around relevant code
It seems that the following happens:
the connector makes POST requests while rate limits allow it
it hits the rate limit, executes this line, then goes into _put_to_sleep, and _get_retry_after from line 570. Inside _get_retry_after, it fails on line 560, because that resolves to another API call but the rate is already exceeded at this point, so it fails.
I did a rough implementation of throttling the post call, by just running this for 45 sec before line 624. Not the greatest example of dev work, but for now it works, I am able to pull data though it obviously takes some time with over 80k PRs in a repo, but I can look into making it a more elegant solution for our use case, at least.
I did try to do the same (i.e. run the same sleep utility function, hardcoded for an hour to reset rate limit) here, but after an hour of waiting, the sync job fails because this response object resolves to NoneType and thus does not have get attribute. I don't quite understand why that happens, so no idea on how to resolve this, though this would be a preferred solution, I suppose.
Another approach would be to monitor rate limit remaining with every post call and to run this implementation before we run into rate limit.
I guess, given different options of going about to fix this, what would be the suggestion from maintainers?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.