Hi,
This question is about the Elastic Web Crawler, not the App Search Crawler. I'm having such a bad time with the below issue:
- When the crawler finishes a particular crawling, I can see the status got updated to Success:
- However, upon clicking on that "success" request, that stats given seems to be returned as the final stage -> "purge" stage stats in stead of the "primary" stage stats:
- The reason I know for sure this is the stats for "purge" stage is because when I look at the log, the crawling did ran for ~40 minutes, which is the correct times it usually need for our big website (we have about ~6k of pages). This is what the RAW JSON tab displays:
{
"id": "6658400f5295451c68e6cf79",
"type": "full",
"status": "success",
"created_at": "2024-05-30T08:59:59Z",
"begun_at": "2024-05-30T09:00:03Z",
"completed_at": "2024-05-30T09:40:03Z",
"crawl_config": {
"crawl_rules": [
{
"policy": "deny",
"url_pattern": "\\Ahttps://centernet\\.fredhutch\\.org.*\\?",
"source": "domain=https://centernet.fredhutch.org policy=deny rule=contains pattern=?"
},
{
"policy": "allow",
"url_pattern": "\\Ahttps://centernet\\.fredhutch\\.org.*",
"source": "domain=https://centernet.fredhutch.org, default allow all rule"
}
],
"max_crawl_depth": 10,
"seed_urls": [
"https://centernet.fredhutch.org/"
],
"sitemap_urls": [
"https://centernet.fredhutch.org/sitemap.xml"
],
"domain_allowlist": [
"https://centernet.fredhutch.org"
],
"deduplication_settings": [
{
"fields": [
"body_content",
"domains",
"headings",
"links",
"meta_description",
"meta_keywords",
"title"
],
"url_pattern": "\\Ahttps://centernet\\.fredhutch\\.org.*",
"source": "domain=https://centernet.fredhutch.org deduplication_enabled=true deduplication_fields=[\"title\", \"body_content\", \"meta_keywords\", \"meta_description\", \"links\", \"headings\"]"
}
],
"content_extraction_enabled": true,
"content_extraction_mime_types": [
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/msword",
"application/rtf",
"application/vnd.ms-excel",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.oasis.opendocument.text",
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
"application/vnd.ms-powerpoint",
"application/vnd.openxmlformats-officedocument.presentationml.slideshow"
]
},
"stats": {
"timestamp": "2024-05-30T09:40:03Z",
"event_id": "66584973608533fc3e4d61be",
"status": {
"crawl_duration_msec": 14595,
"http_client": {
"max_connections": 100,
"used_connections": 2
},
"crawling_time_msec": 1531,
"active_threads": 0,
"pages_visited": 21,
"queue_size": 0,
"urls_allowed": 21,
"avg_response_time_msec": 72.9047619047619,
"status_codes": {
"200": 21
}
},
"crawl": {
"id": "6658400f5295451c68e6cf79",
"stage": "purge"
}
}
}
- When I check the actual log of the dataset -> elastic_crawler, the log details seems to matches the RAW JSON in the way that it logged the final stat as "purge" stage, instead of "primary" stage.
-
This is NOT the expected logging behavior for either Elastic Web Crawler or the App Search Crawler. The final log of these crawlers should log the stage as "primary" instead of "purge" so that the window/tab -> Craw requests could display the correct stats for the main stage of crawlings. In my case, the numbers such as "URLs", "Pages"... should show somewhere in the ~6,000 documents ranges, not 22 like the above pic.
-
I know the final stage of a crawling will always be the "purge" stage. But believe me, you can set up some of these test crawlings in your test instances and see it for yourself. The final log of the crawlers should say "primary" and display the numbers for this stage, as the stats for this stage are the most important summarized stats for the whole crawling.
I have no idea how to resolve this problem, and to make the matter worse, I have 2 clusters: QA and PROD. The crawling in the QA cluster did show the correct number, as in stats for “primary” stage, but the PROD cluster number is skewed and display "purge" stage stats only.