Hi, I need to perform scrolling over a billion of docs, sorted by just one field Foo. If I format this single field Foo (+ possibly but I guess not necessarily all those fields retrieved by scroll) as "doc_values", is it gonna work? Currently I can perform sorted scrolling over like 10-20 millions of documents, more of them cause OutOfMemory error.
I currently have a billion of documents across 100 indices (500 shards) across 3 nodes.
using doc_values is probably the only way this will work, since it doesn't use any memory. Probably the OOM you got was caused by pulling too many field values into the field data cache and blowing it up. Doc_values should keep everything on disk so you don't have to worry.
Just remember that after changing the mapping format for the Foo field you will have to re-index all of your data in order for the doc-values to get pulled out.
I have personally used doc_values for queries hitting 1bil+ documents and it worked fine.
I just reindexed 300 mils of docs in 100 indices (500 shards total) on 3 nodes (c4.large instances) with doc_values on 4 fields. And I'm sort-scrolling over those 300 mils documents with just those 4 fields loaded and it takes like 720 hours with CPU on 100% ... It seems to have enough memory on Heap but I suspect that remaining off-heap 1.8GB of OS memory might be the bottleneck... Also
Hi, I cannot reproduce the OOME stacktrace since I reindexed everything to doc_values and now it is just slow, CPUs on 100% and no other obvious bottlneck like IO wait or GC chocking, probably more heap memory would help. We have 300 mils of documents in 100 indices on 3 nodes using 5 shards, each node is a c4.large instance, each ES has 2048MB ram. And default configuration, I'm not exactly sure whether field data cache would improve it radically.
I'm using ES 1.5.1
However I tried to set scroll batch size to 2,10, 40 to 200 which should roughly correspond to 0.1 to 5MB after multiplying the number of shards the request might hit. Not more... And it is still insanely slow, I played with all day and sorted scroll over 300mills of documents with just 4 doc_values field loaded would take hundreds of hours. It was scanning like 100 - 400 documents per second, not more.
I'm actually not sure if ec2 c4.large instance storage has any filesystem cache and how big is it. But it most probably does. It wouldn't make sense. We are using instance storages, not EBS.
Btw I started to call optimize before scrolling and I optimized scroll search size (increased) and now it seems to be scrolling 1000 small docs per second which might be sufficient ! Thanks
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.