Multi billion documents index

(Adam Smith) #1

Dear community,

I want to use elasticsearch to index inlinks of websites, currently 2 Billion Websites and ~ 500 Billion links, but only 100 Million domains. My queries shall only focus inlinks by domain - is there any chance to exploit this natural partitioning?

I want to index fulltext of anchor-links and some integer and double properties, say in average 100 bytes per document. Some domains like Google can have up to billions of inlinks, other a very low number. I could maybe use a large number of shards to adress the 2B border, but I fear the non-linear scaling behaviour of searches when having all 500B inlinks in one index. And I have no idea how indexing would behave for that number of documents.

I was also thinking to use something like where I could create a separate index per domain. But I would really like to use ES, certainly.

Would ES filters lead to the same behaviour when working on scale? Means that only links/documents from a domain are considered and hence response times can be reduced? Is there any other solution to achieve my goals?

Thanks for any advice!

(Mark Walkom) #2

An index per domain would mean a lot of indices, and you may run into this.

Maybe look at an index per country code/TLD, and use routing?

(Isabel Drost-Fromm) #3

Your problem sounds very much like to me. Though not focussed on web crawling maybe the post helps you anyhow.


(system) #4