Slow results retrieval


(TDZ) #1

So I have 5 billion docs on 5 shards in 1 index, all on one machine, each shard has 1 segment.
ES is running on 4 cpus with 26GB ram and 18GB of heap for ES.
Each doc has 4 ints and 2 floats.
Im running a query that uses range query on one field and a terms query on another, both are in the filter part of a bool query.
Then I try to retrieve 500k results using the scroll API with 5 slices (I have 5 threads running at the same time, one per slice), Im fetching only one int field using _source for each result, im using a 9K page size and im using the transport client for java.
Its taking me ~50 seconds.... Does that make sense or am I doing something wrong?


(Christian Dahlqvist) #2

It is generally recommended to give no more than 50% of available RAM to heap. Elasticsearch requires off-heap memory for optimal performance.

What if you instead fetch the full document so Elasticsearch do not need to parse it?


(TDZ) #3

Done both, didnt change anything


(Christian Dahlqvist) #4

What does disk I/O and iowait look like during retrieval?


(TDZ) #5

Dont know about iowait but I assume its not a problem since im on an ssd, io is 300 iops and 38~MBs in spikes, thats the highest spike.
But I think thats besides the point, I want to use the above mentioned search for real time usages, so I wonder whats a reasonable expectation? For example, can ES provide 100k results in 2-3 seocnds? What about 500k?