Performance aggregations vs collapsing

I can't use reindex as _source is disabled for this index in order to save space, see https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/ES_IndexInit.sh#L38

Ok, too bad, then you should be able to restrict the index to one shard with an alias:

POST /_aliases
{
    "actions" : [
        {
            "add" : {
                 "index" : "index", # the source_index
                 "alias" : "source_alias",
                 "routing" : "routing" # a routing value to restrict the index to one shard
            }
        }
    ]
}

Since you already use routing you should be able to choose the shard by setting the routing value to an host that appears in it.
You can use the alias name to create the snapshot.

Here is what I did

POST /_aliases
{
    "actions" : [
        {
            "add" : {
                 "index" : "status",
                 "alias" : "debug2",
                 "routing" : "elastic.co"
            }
        }
    ]
}

PUT /_snapshot/stormcrawler_status
{
  "type": "fs",
  "settings": {
    "location": "/data/esbackups/stormcrawler_status"
  }
}

# DELETE /_snapshot/stormcrawler_status/snapshot_1

PUT /_snapshot/stormcrawler_status/snapshot_1
{
  "indices": "debug2",
  "ignore_unavailable": true,
  "include_global_state": false
}

GET /_snapshot/stormcrawler_status/snapshot_1

Looking at the stats for debug2 and the content of the snapshot, it looks like the entire index is there, not just the shard where 'elastic.co' points to.

Indeed, sorry the snapshot does not apply the routing. What is the total size of the snapshot on disk ?

44GB

Is there a way to nuke shards? I could keep only one of them? Delete by query everything which is not in shard X?

Delete by query everything which is not in shard X?

Delete by query handles ?routing= so you could delete shard by shard. However 44GB is not that big to transfer so we could also proceed with the current snapsot.

Thanks Mark.

I currently shard by host or domain name then query per shard in parallel. One side effect of doing this is that the distribution of documents across the shards can be very uneven and I can end up with some shards much larger than others.

With the terms partitioning, I could use the standard routing (and have URLs belonging to the same host in various shards) but have one thread in charge of one partition instead of a shard.

How would you expect this to fare in terms of performance compared to my current approach of aggregs shard per shard?

It would obviously be less focused than routing to specific shards. Each request for partition N of X would trawl through all documents that match the query (a time range) and would access the hostname doc value doing the hash/modulo computation to see if it comes up equal to N. That cost is a straight multiple of the number of docs that match the query but hopefully not too expensive. Only benchmarking will give you the truth.

1 Like

Thanks @jnioche, I was able to reproduce the slow search with your snapshot.
I have some remarks:

The fact that field collapsing is slow for large top hits retrieval is expected though. It is optimized for top hits which means that the number of operations (comparison, insertion, retrieval) depends on the requested size. When the size is small (10), we can save a lot of comparisons because competitive documents (that can enter the top 10) are less frequent. With a large size, the top hits are very instable because the long tail of results makes every document a potential top hits. I tried several things to speed up field collapsing on large top hits but I couldn't find a way to make it work seamlessly on small and large top hits. It is not optimized for analytics query that needs to return a lot of documents but rather for search use case where you need to retrieve a few collapsed hits.
Bottom line here is that field collapsing is not suited for your use case. It is fast to retrieve few documents but the cost grows significantly with the number of collapsed group to retrieve.

1 Like

Thanks @jimczi

_source: you're absolutely right, don't know why I thought it wasn't the case.

Will try again with the field collapsing with 100 buckets or less, just to make sure I haven't overlooked anything. I suppose it could be a good approach when the number of hosts is limited as it would be with a vertical crawl. I'll spin a new crawl with only 1K hostnames spread over 10 shards and see how it fares vs aggregs.

I might also try Mark's suggestion for the sake of comparison.

I am doing these benchmarks for a presentation I'll be giving at the Big Data conference in Vilnius at the end of the month.

Thanks for your help

The number of unique hosts is not the issue here. In fact it should be better with more unique hosts since it would be faster to fill the top docs during the collection. One thing that could speed up the sort is to reduce the precision of the nextFecthDate, currently the date has a millisecond precision but I doubt that it is useful. You could try to round the precision down to the hour (round the date to the start of the hour) in order to reduce the difference between urls and by extension the number of comparison during the search.

1 Like

Rounding down to the closest second, minute or hour would make loads of sense. I've opened an issue for it in StormCrawler.

Would the aggregation queries benefit from that as well?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.