Elastic Web Crawler -> Crawl requests -> Incorrect stats

Hi,

This question is about the Elastic Web Crawler, not the App Search Crawler. I'm having such a bad time with the below issue:

  • When the crawler finishes a particular crawling, I can see the status got updated to Success:

  • However, upon clicking on that "success" request, that stats given seems to be returned as the final stage -> "purge" stage stats in stead of the "primary" stage stats:

  • The reason I know for sure this is the stats for "purge" stage is because when I look at the log, the crawling did ran for ~40 minutes, which is the correct times it usually need for our big website (we have about ~6k of pages). This is what the RAW JSON tab displays:

{
  "id": "6658400f5295451c68e6cf79",
  "type": "full",
  "status": "success",
  "created_at": "2024-05-30T08:59:59Z",
  "begun_at": "2024-05-30T09:00:03Z",
  "completed_at": "2024-05-30T09:40:03Z",
  "crawl_config": {
    "crawl_rules": [
      {
        "policy": "deny",
        "url_pattern": "\\Ahttps://centernet\\.fredhutch\\.org.*\\?",
        "source": "domain=https://centernet.fredhutch.org policy=deny rule=contains pattern=?"
      },
      {
        "policy": "allow",
        "url_pattern": "\\Ahttps://centernet\\.fredhutch\\.org.*",
        "source": "domain=https://centernet.fredhutch.org, default allow all rule"
      }
    ],
    "max_crawl_depth": 10,
    "seed_urls": [
      "https://centernet.fredhutch.org/"
    ],
    "sitemap_urls": [
      "https://centernet.fredhutch.org/sitemap.xml"
    ],
    "domain_allowlist": [
      "https://centernet.fredhutch.org"
    ],
    "deduplication_settings": [
      {
        "fields": [
          "body_content",
          "domains",
          "headings",
          "links",
          "meta_description",
          "meta_keywords",
          "title"
        ],
        "url_pattern": "\\Ahttps://centernet\\.fredhutch\\.org.*",
        "source": "domain=https://centernet.fredhutch.org deduplication_enabled=true deduplication_fields=[\"title\", \"body_content\", \"meta_keywords\", \"meta_description\", \"links\", \"headings\"]"
      }
    ],
    "content_extraction_enabled": true,
    "content_extraction_mime_types": [
      "application/pdf",
      "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "application/msword",
      "application/rtf",
      "application/vnd.ms-excel",
      "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
      "application/vnd.oasis.opendocument.text",
      "application/vnd.openxmlformats-officedocument.presentationml.presentation",
      "application/vnd.ms-powerpoint",
      "application/vnd.openxmlformats-officedocument.presentationml.slideshow"
    ]
  },
  "stats": {
    "timestamp": "2024-05-30T09:40:03Z",
    "event_id": "66584973608533fc3e4d61be",
    "status": {
      "crawl_duration_msec": 14595,
      "http_client": {
        "max_connections": 100,
        "used_connections": 2
      },
      "crawling_time_msec": 1531,
      "active_threads": 0,
      "pages_visited": 21,
      "queue_size": 0,
      "urls_allowed": 21,
      "avg_response_time_msec": 72.9047619047619,
      "status_codes": {
        "200": 21
      }
    },
    "crawl": {
      "id": "6658400f5295451c68e6cf79",
      "stage": "purge"
    }
  }
}
  • When I check the actual log of the dataset -> elastic_crawler, the log details seems to matches the RAW JSON in the way that it logged the final stat as "purge" stage, instead of "primary" stage.

  • This is NOT the expected logging behavior for either Elastic Web Crawler or the App Search Crawler. The final log of these crawlers should log the stage as "primary" instead of "purge" so that the window/tab -> Craw requests could display the correct stats for the main stage of crawlings. In my case, the numbers such as "URLs", "Pages"... should show somewhere in the ~6,000 documents ranges, not 22 like the above pic.

  • I know the final stage of a crawling will always be the "purge" stage. But believe me, you can set up some of these test crawlings in your test instances and see it for yourself. The final log of the crawlers should say "primary" and display the numbers for this stage, as the stats for this stage are the most important summarized stats for the whole crawling.

I have no idea how to resolve this problem, and to make the matter worse, I have 2 clusters: QA and PROD. The crawling in the QA cluster did show the correct number, as in stats for “primary” stage, but the PROD cluster number is skewed and display "purge" stage stats only. :sob:

Hi Tung,

It seems to be a bug on the Web Crawler side. I will file a bug with the team, but in the meantime I need to double check:

You need crawler stats for primary stages - e.g. how much time it took to ingest the pages before duplicates were purged. Is it correct?

How much does it block you from using the Web Crawler right now?

Theoretically it's possible to access these stats from Elasticsearch directly, they will not be displayed in UI correctly until we fix it, but you might be able to collect these stats for your purposes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.