Hi @jimczi, thanks for your reply.
The queries based on the field collapsing look like this
{
"from": 0,
"size": 2500,
"query": {
"range": {
"nextFetchDate": {
"from": null,
"to": "2018-10-13T06:34:36+01:00",
"include_lower": true,
"include_upper": true,
"boost": 1.0
}
}
},
"explain": false,
"sort": [{
"nextFetchDate": {
"order": "asc"
}
}],
"track_total_hits": false,
"collapse": {
"field": "hostname",
"inner_hits": {
"name": "urls_per_bucket",
"ignore_unmapped": false,
"from": 0,
"size": 2,
"version": false,
"explain": false,
"track_scores": false,
"sort": [{
"nextFetchDate": {
"order": "asc"
}
}]
}
}
}
I compare them both in situ by measuring how long the queries take while running the crawler and externally using the console in Kibana.
For this crawl, each shard has around 68K unique values for the hostname field.
Here are the times I got using the console using a match all query to keep things simple => Benchmark ES queries
NOTE: I haven't quite finished measuring with the sampling on
As expected the measurements for the (non-sampled) aggregations are pretty constant which is why my strategy is to get loads of buckets with 2 or URLs in each. It takes some time but we don't need to query often.
With the field collapsing the times are proportional to the number of buckets retrieved but looking at the average time per doc, it is worth getting a large number of buckets as well.
In practice, we try to reuse nextFetchDates from one query to the other to benefit from any caching. I will rerun the queries in Kibana using a different nextFetchDate every time - but using some time in the distant future to get the same number of documents but without any caching that query": { "match_all": {}} could have caused.
The values observed in situ were imperfect (more docs were being added to the index between the different runs) but gave some idea of which strategy to adopt:
AGGREG + sampling => AVERAGE QUERY TIME 2214 msec @ 18.57M docs
AGGREG - sampling => AVERAGE QUERY TIME 3170 msec @ 17.6 M docs
COLLAPSING => AVERAGE QUERY TIME 6290 msec @ 19.6M docs
The field collapsing triggered loads of CircuitBreakingException until I raised the Xmx.
What I should add to my logs is the average time per document retrieved - we don't necessarily get the exact number of documents we want: some buckets can have only 1 doc for instance.
I am wondering whether there are any scenarii were the Field Collapsing has its benefits e.g. when the number of hosts is limited, which would be the case for a vertical crawl.
I will have a look at Mark's suggestion. Thanks!