Checking for sparse statistics and problems with shard routing


(Vaughn Dickson) #1

Hi,

We have a search index consisting of 206735 web pages (7gb) stored on a Found 2 node cluster with 5 shards. Our _id is the url of the page, so it seems like each domain will get routed to the same shard?
Will this cause us problems with relevancy scoring? I don't fully understand the sparse statistics problem yet.

Should we be using custom routing to shuffle documents across shards more, and/or use less shards?

Kind regards,
Vaughn Dickson


(Isabel Drost-Fromm) #2

[quote="Vaughn_Dickson, post:1, topic:33227"]
We have a search index consisting of 206735 web pages (7gb) stored on a Found 2 node cluster with 5 shards. Our _id is the url of the page, so it seems like each domain will get routed to the same shard?[/quote]

Reading https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html I don't think so unless with "URL of the page" you really mean "domain name of the page".

Thinking about this question independently of whether or not your particular setup actually suffers from this issue here's what IMHO happens when sharding web pages by domain name:

Hope this helps,
Isabel


(Vaughn Dickson) #3

Thanks so much Isabel! That blog post really helped. Our _id is the full URL to the webpage and not just the domain, so I'm guessing we'll have a fairly decent spread across our shards. I'll investigate the hashing algo just in case.
And the global document frequency stats are most likely fairly similar across shards, so I don't think we're going to run into problems.


(system) #4