Multi billion documents index

Adam_Smith1 · November 12, 2015, 12:30am

Dear community,

I want to use elasticsearch to index inlinks of websites, currently 2 Billion Websites and ~ 500 Billion links, but only 100 Million domains. My queries shall only focus inlinks by domain - is there any chance to exploit this natural partitioning?

I want to index fulltext of anchor-links and some integer and double properties, say in average 100 bytes per document. Some domains like Google can have up to billions of inlinks, other a very low number. I could maybe use a large number of shards to adress the 2B border, but I fear the non-linear scaling behaviour of searches when having all 500B inlinks in one index. And I have no idea how indexing would behave for that number of documents.

I was also thinking to use something like https://code.google.com/p/lucene-on-cassandra/ where I could create a separate index per domain. But I would really like to use ES, certainly.

Would ES filters lead to the same behaviour when working on scale? Means that only links/documents from a domain are considered and hence response times can be reduced? Is there any other solution to achieve my goals?

Thanks for any advice!
Adam

warkolm · November 12, 2015, 3:39am

An index per domain would mean a lot of indices, and you may run into this.

Maybe look at an index per country code/TLD, and use routing?

mainec · November 12, 2015, 11:31am

Your problem sounds very much like One Big User | Elasticsearch: The Definitive Guide [2.x] | Elastic to me. Though not focussed on web crawling maybe the post helps you anyhow.

Isabel