I am looking for info if it is possible for the Elastic Crawler ( cloud deployment ) to enable a debug log, just for a crawl job.
I am hitting 599 timeout errors during a crawl, and once this occurs the crawl is no longer productive. Not sure if there are other details available in a debug crawl log.
This is crawling an enterprise environment that has a WAF and has been confirmed there is nothing from the WAF preventing the crawler from doing its thing.
Things I have done:
•Scale back the crawl threads to 1, still same result.
•Other, smaller subdomain sites will complete their crawl fine
•The URLs that get the 599 network timeout response will open just fine manually from a browser at the moment they error in the Elastic crawl log
•Starting a new crawl on the same domain will run fine, but at an unpredictable time the 599s return.
Content is retrieved during functional crawls. The document count is up to 8,700k documents from this domain. The smaller subdomains I mentioned will index all discovered documents and complete with 'success'.
Any insight is appreciated.
P.S. This may need to move to the 'Elastic Enterprise Search' section.
In 'Indices' > index_name > 'Crawl' dropdown - 'crawl with custom settings' - 'Seed URLs' - there is an option to toggle some of the 'Entry points' in the list, if applicable.
This particular case was just referencing the single, root domain URL. I have added some some seed urls/entry points as experimentation. Ultimately the aim is to have a full crawl of the desired domain complete without timeouts. Not sure if there is a way to schedule a crawl of just a subset of seed/entrypoint URLs to prevent timeouts from the Elastic crawler.
I kicked off a crawl with seeds/entry points added this time, in hopes of focusing the crawl progress a bit. It is a large-ish site, so no telling how long it will take to complete - or hit timeouts again.
The full crawl I kicked off this morning ran for ~3hrs then it began hitting the 599 timeouts as others have. I canceled it shortly after as usually it will never make it back to a state of indexing new URLs/links.
I left the domain with a list of entry points to include in the full crawl.
I then went back and tried the manual re-crawl, toggling certain entry points only, and those partial crawls did finish. However, there is still a large amount of the site that is not being indexed due to the 599s. It's kind of puzzling as I'm not able to discern a pattern leading up to the timeouts.
Thanks for confirming @alongaks. We're getting to the edge of my crawler knowledge, so aside from using the troubleshooting documentation to use the logs to find specific errors, or splitting the crawled pages into batches instead of a full run I'm not sure what to suggest.
Now the issue is tagged in the right topic someone in the know should pick it up. But I also recommend raising a support issue if you still don't have any clear errors in the logs to share to get some targeted help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.