I want to use elasticsearch to index inlinks of websites, currently 2 Billion Websites and ~ 500 Billion links, but only 100 Million domains. My queries shall only focus inlinks by domain - is there any chance to exploit this natural partitioning?
I want to index fulltext of anchor-links and some integer and double properties, say in average 100 bytes per document. Some domains like Google can have up to billions of inlinks, other a very low number. I could maybe use a large number of shards to adress the 2B border, but I fear the non-linear scaling behaviour of searches when having all 500B inlinks in one index. And I have no idea how indexing would behave for that number of documents.
I was also thinking to use something like https://code.google.com/p/lucene-on-cassandra/ where I could create a separate index per domain. But I would really like to use ES, certainly.
Would ES filters lead to the same behaviour when working on scale? Means that only links/documents from a domain are considered and hence response times can be reduced? Is there any other solution to achieve my goals?
Thanks for any advice!