We have a search index consisting of 206735 web pages (7gb) stored on a Found 2 node cluster with 5 shards. Our _id is the url of the page, so it seems like each domain will get routed to the same shard?
Will this cause us problems with relevancy scoring? I don't fully understand the sparse statistics problem yet.
Should we be using custom routing to shuffle documents across shards more, and/or use less shards?
[quote="Vaughn_Dickson, post:1, topic:33227"]
We have a search index consisting of 206735 web pages (7gb) stored on a Found 2 node cluster with 5 shards. Our _id is the url of the page, so it seems like each domain will get routed to the same shard?[/quote]
Thinking about this question independently of whether or not your particular setup actually suffers from this issue here's what IMHO happens when sharding web pages by domain name:
If we are talking about indexing crawled web pages you will run into different problem though: Looking at the number of pages per domain there's plenty of domains out there that publish only a handful of pages. However there's a few domains (think yahoo.com and the like) that publish tons of pages. As a result I imagine you'd be running into the "one big user" issue described here: One Big User | Elasticsearch: The Definitive Guide [2.x] | Elastic
Thanks so much Isabel! That blog post really helped. Our _id is the full URL to the webpage and not just the domain, so I'm guessing we'll have a fairly decent spread across our shards. I'll investigate the hashing algo just in case.
And the global document frequency stats are most likely fairly similar across shards, so I don't think we're going to run into problems.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.