Checking for sparse statistics and problems with shard routing

Vaughn_Dickson · October 29, 2015, 8:48am

Hi,

We have a search index consisting of 206735 web pages (7gb) stored on a Found 2 node cluster with 5 shards. Our _id is the url of the page, so it seems like each domain will get routed to the same shard?
Will this cause us problems with relevancy scoring? I don't fully understand the sparse statistics problem yet.

Should we be using custom routing to shuffle documents across shards more, and/or use less shards?

Kind regards,
Vaughn Dickson

mainec · October 29, 2015, 10:24am

[quote="Vaughn_Dickson, post:1, topic:33227"]
We have a search index consisting of 206735 web pages (7gb) stored on a Found 2 node cluster with 5 shards. Our _id is the url of the page, so it seems like each domain will get routed to the same shard?[/quote]

Reading _routing field | Elasticsearch Guide [8.11] | Elastic I don't think so unless with "URL of the page" you really mean "domain name of the page".

Thinking about this question independently of whether or not your particular setup actually suffers from this issue here's what IMHO happens when sharding web pages by domain name:

For scoring you only run into issues if the distribution of your terms is highly uneven. For more information on how this affects scoring see here Understanding "Query Then Fetch" vs "DFS Query Then Fetch" | Elastic Blog
If we are talking about indexing crawled web pages you will run into different problem though: Looking at the number of pages per domain there's plenty of domains out there that publish only a handful of pages. However there's a few domains (think yahoo.com and the like) that publish tons of pages. As a result I imagine you'd be running into the "one big user" issue described here: One Big User | Elasticsearch: The Definitive Guide [2.x] | Elastic

Hope this helps,
Isabel

Vaughn_Dickson · October 30, 2015, 7:50am

Thanks so much Isabel! That blog post really helped. Our _id is the full URL to the webpage and not just the domain, so I'm guessing we'll have a fairly decent spread across our shards. I'll investigate the hashing algo just in case.
And the global document frequency stats are most likely fairly similar across shards, so I don't think we're going to run into problems.

Topic		Replies	Views
Shards/routing documents imbalance problem Elasticsearch	9	745	July 6, 2017
Elasticsearch is sharing some shards even if the routing is specified Elasticsearch	4	1162	July 5, 2017
Routing performance tuning Elasticsearch	5	1463	July 5, 2017
Records per shard Elasticsearch	7	1006	July 6, 2017
Perplexing benchmark result Elasticsearch	3	401	July 6, 2017

Checking for sparse statistics and problems with shard routing

Related topics