We use a Swiftype engine to manage search across our documentation site. There are over 1600 pages/documents in our engine right now.
We are trying to write a few checks on Swiftype for internal controls/QA to ensure that swiftype is seeing all our pages, and trying to determine the last full crawl date of our engine was within the past 7 days. This data is available via the userPortal ..
Hi @Andrew_Sepic, the updated_at value should be accurate if you are looking at a document rather than the engine, see Crawler Overview | Swiftype Documentation. The updated_at should be the date the doc was last indexed (i.e. last crawl date).
Would this be the value you're looking for for what you're writing?
If I'm seeing that my engine has it's most recent full crawl completed today, and plans to start a new full site crawl today, I'm assuming that the updated_at date should be similar.. ie: it should be the current date or within 24 hrs. Is that accurate?
@Andrew_Sepic yes that's correct, the updated_at value should update for the documents of any URL endpoints encountered during that crawl. It will differ by n seconds/minutes/hours per doc depending on how long the crawl takes, as the value is specifically when that document was indexed.
If the Swiftype crawler doesn't encounter a URL for an existing doc during its full crawl, it should delete the doc. So I don't think there should be a situation where there are large discrepancies between updated_at values.
@nfeekery Thanks for verifying that. Seems to make sense to me.
I don't understand why I'm not seeing that though..
I make a request via the node client @elastic/site-search-node and get 200 documents back from my engine.
I random choose an externalId and make a request to get that document that looks like this.. https://api.swiftype.com/api/v1/engines/${engine}/document_types/page/documents/${externalId}?auth_token=${apiKey})
And the object I get back on that document (as of today June 21/2024) is below. It does look like there is a duplication of content in the response, but maybe I'm interpreting that wrong.. But the updated_at date is clearly March 21,2024. Suspiciously 3 months behind today. If I make requests to other documents, I get the same consistent date of 2024-03-21.
Any idea whats going on?
{
"external_id": "f8a46f9624ddee1152b9e56865bbf468e69a8a19",
"engine_id": "5c6adb81d3b68758d0c5c15a",
"document_type_id": "5c6adb82d3b68758d0c5c15b",
"id": "65fb78a0196a678073395947",
"updated_at": "2024-03-21T00:00:32Z",
"title": "Directions API Playground",
"excerpt": "Retrieve turn-by-turn instructions using four different Mapbox routing profiles.",
"image": "https://static-assets.mapbox.com/branding/social/social-120x120.v2.png",
"site": "Developer Playgrounds",
"contentType": "playground",
"sections": ["Directions API Playground"],
"body": "...",
"type": "",
"published_at": "2024-06-21T12:19:01Z",
"popularity": 1,
"info": "",
"url": "https://docs.mapbox.com/playground/directions/",
"updated_at": "2024-03-21T00:00:32Z",
"title": "Directions API Playground",
"excerpt": "Retrieve turn-by-turn instructions using four different Mapbox routing profiles.",
"image": "https://static-assets.mapbox.com/branding/social/social-120x120.v2.png",
"site": "Developer Playgrounds",
"contentType": "playground",
"sections": ["Directions API Playground"],
"body": "...",
"type": "",
"published_at": "2024-06-21T12:19:01Z",
"popularity": 1,
"info": "",
"url": "https://docs.mapbox.com/playground/directions/"
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.