Performance aggregations vs collapsing

jnioche · November 14, 2018, 10:48am

I can't use reindex as _source is disabled for this index in order to save space, see https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/ES_IndexInit.sh#L38

jimczi · November 14, 2018, 11:12am

Ok, too bad, then you should be able to restrict the index to one shard with an alias:

POST /_aliases
{
    "actions" : [
        {
            "add" : {
                 "index" : "index", # the source_index
                 "alias" : "source_alias",
                 "routing" : "routing" # a routing value to restrict the index to one shard
            }
        }
    ]
}

Since you already use routing you should be able to choose the shard by setting the routing value to an host that appears in it.
You can use the alias name to create the snapshot.

jnioche · November 14, 2018, 12:05pm

Here is what I did

POST /_aliases
{
    "actions" : [
        {
            "add" : {
                 "index" : "status",
                 "alias" : "debug2",
                 "routing" : "elastic.co"
            }
        }
    ]
}

PUT /_snapshot/stormcrawler_status
{
  "type": "fs",
  "settings": {
    "location": "/data/esbackups/stormcrawler_status"
  }
}

# DELETE /_snapshot/stormcrawler_status/snapshot_1

PUT /_snapshot/stormcrawler_status/snapshot_1
{
  "indices": "debug2",
  "ignore_unavailable": true,
  "include_global_state": false
}

GET /_snapshot/stormcrawler_status/snapshot_1

Looking at the stats for debug2 and the content of the snapshot, it looks like the entire index is there, not just the shard where 'elastic.co' points to.

jimczi · November 14, 2018, 12:42pm

Indeed, sorry the snapshot does not apply the routing. What is the total size of the snapshot on disk ?

jnioche · November 14, 2018, 12:47pm

44GB

Is there a way to nuke shards? I could keep only one of them? Delete by query everything which is not in shard X?

jimczi · November 14, 2018, 1:00pm

Delete by query everything which is not in shard X?

Delete by query handles ?routing= so you could delete shard by shard. However 44GB is not that big to transfer so we could also proceed with the current snapsot.

jnioche · November 14, 2018, 1:13pm

Thanks Mark.

I currently shard by host or domain name then query per shard in parallel. One side effect of doing this is that the distribution of documents across the shards can be very uneven and I can end up with some shards much larger than others.

With the terms partitioning, I could use the standard routing (and have URLs belonging to the same host in various shards) but have one thread in charge of one partition instead of a shard.

How would you expect this to fare in terms of performance compared to my current approach of aggregs shard per shard?

Mark_Harwood · November 14, 2018, 1:29pm

It would obviously be less focused than routing to specific shards. Each request for partition N of X would trawl through all documents that match the query (a time range) and would access the hostname doc value doing the hash/modulo computation to see if it comes up equal to N. That cost is a straight multiple of the number of docs that match the query but hopefully not too expensive. Only benchmarking will give you the truth.

jimczi · November 15, 2018, 9:50am

Thanks @jnioche, I was able to reproduce the slow search with your snapshot.
I have some remarks:

The _source is not disabled: https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/ES_IndexInit.sh#L35, I can see it in the response.
Field collapsing is fast in my tests when the size is small (1 to 100) and slow for more than 1000 hits.
It is not inlined with the results of your testing that was slow for all size.

The fact that field collapsing is slow for large top hits retrieval is expected though. It is optimized for top hits which means that the number of operations (comparison, insertion, retrieval) depends on the requested size. When the size is small (10), we can save a lot of comparisons because competitive documents (that can enter the top 10) are less frequent. With a large size, the top hits are very instable because the long tail of results makes every document a potential top hits. I tried several things to speed up field collapsing on large top hits but I couldn't find a way to make it work seamlessly on small and large top hits. It is not optimized for analytics query that needs to return a lot of documents but rather for search use case where you need to retrieve a few collapsed hits.
Bottom line here is that field collapsing is not suited for your use case. It is fast to retrieve few documents but the cost grows significantly with the number of collapsed group to retrieve.

jnioche · November 15, 2018, 10:02am

Thanks @jimczi

_source: you're absolutely right, don't know why I thought it wasn't the case.

Will try again with the field collapsing with 100 buckets or less, just to make sure I haven't overlooked anything. I suppose it could be a good approach when the number of hosts is limited as it would be with a vertical crawl. I'll spin a new crawl with only 1K hostnames spread over 10 shards and see how it fares vs aggregs.

I might also try Mark's suggestion for the sake of comparison.

I am doing these benchmarks for a presentation I'll be giving at the Big Data conference in Vilnius at the end of the month.

Thanks for your help

jimczi · November 15, 2018, 10:37am

The number of unique hosts is not the issue here. In fact it should be better with more unique hosts since it would be faster to fill the top docs during the collection. One thing that could speed up the sort is to reduce the precision of the nextFecthDate, currently the date has a millisecond precision but I doubt that it is useful. You could try to round the precision down to the hour (round the date to the start of the hour) in order to reduce the difference between urls and by extension the number of comparison during the search.

jnioche · November 15, 2018, 1:13pm

Rounding down to the closest second, minute or hour would make loads of sense. I've opened an issue for it in StormCrawler.

Would the aggregation queries benefit from that as well?

system · December 13, 2018, 1:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sampler aggregation performance vs 2 queries Elasticsearch	5	910	January 18, 2018
Max Shard Size: 50GB versus 100GB? Elasticsearch	4	4363	April 29, 2017
Performance of terms aggregation based on the number of indices/shards Elasticsearch	1	399	June 4, 2018
Multiple aggregation in one request vs one aggregation per request performance Elasticsearch	3	5289	May 8, 2017
Sampler aggregration overhead Elasticsearch	11	515	February 25, 2020

Performance aggregations vs collapsing

Related topics