I'm wondering if there are any pointers you could give me for search
queries that return large result sets. The number of hits could be anywhere
from 10,000 - 2,000,000. As indicated by the slowlog, the consuming part is
fetching the data. I presume this fetching phase also includes sorting the
data?
The initial query invocation may take upwards of 5 minutes. If I initiate
the same subsequent query, it returns < 500ms.
Yes, if you know you will be running a query X and it is expensive or loads
some data that will be reused by other queries, or initializes some data
structures, and so on, then yes, using such a query for warming up is a
good idea.
On Tuesday, February 12, 2013 3:11:45 PM UTC-5, Justin wrote:
Hello all,
I'm wondering if there are any pointers you could give me for search
queries that return large result sets. The number of hits could be anywhere
from 10,000 - 2,000,000. As indicated by the slowlog, the consuming part is
fetching the data. I presume this fetching phase also includes sorting the
data?
The initial query invocation may take upwards of 5 minutes. If I initiate
the same subsequent query, it returns < 500ms.
Unfortunately, we do not know the queries beforehand. BTW, they are text
queries.
We had to act fast. It may not be pretty but this is the current solution
we're using to "force" the data into the file system cache.
find /var/lib/elasticsearch/data -type f -exec cat {} ; > /dev/null
We've yet to see any queries taking > 500 ms - it seems to be working well.
Any thoughts on this approach?
On Tuesday, February 12, 2013 11:49:49 PM UTC-5, Otis Gospodnetic wrote:
Hi,
Yes, if you know you will be running a query X and it is expensive or
loads some data that will be reused by other queries, or initializes some
data structures, and so on, then yes, using such a query for warming up is
a good idea.
On Tuesday, February 12, 2013 3:11:45 PM UTC-5, Justin wrote:
Hello all,
I'm wondering if there are any pointers you could give me for search
queries that return large result sets. The number of hits could be anywhere
from 10,000 - 2,000,000. As indicated by the slowlog, the consuming part is
fetching the data. I presume this fetching phase also includes sorting the
data?
The initial query invocation may take upwards of 5 minutes. If I initiate
the same subsequent query, it returns < 500ms.
Unfortunately, we do not know the queries beforehand. BTW, they are text
queries.
We had to act fast. It may not be pretty but this is the current solution
we're using to "force" the data into the file system cache.
find /var/lib/elasticsearch/data -type f -exec cat {} ; > /dev/null
We've yet to see any queries taking > 500 ms - it seems to be working
well.
Any thoughts on this approach?
On Tuesday, February 12, 2013 11:49:49 PM UTC-5, Otis Gospodnetic wrote:
Hi,
Yes, if you know you will be running a query X and it is expensive or
loads some data that will be reused by other queries, or initializes some
data structures, and so on, then yes, using such a query for warming up is
a good idea.
On Tuesday, February 12, 2013 3:11:45 PM UTC-5, Justin wrote:
Hello all,
I'm wondering if there are any pointers you could give me for search
queries that return large result sets. The number of hits could be anywhere
from 10,000 - 2,000,000. As indicated by the slowlog, the consuming part is
fetching the data. I presume this fetching phase also includes sorting the
data?
The initial query invocation may take upwards of 5 minutes. If I
initiate the same subsequent query, it returns < 500ms.
Your "find loop" loads everything from the data folder, but only once
for the "cat" process, and even the files that may be not used by your
elasticsearch workload. In contrast, mmapfs reads and keeps just the
relevant files in memory and continue to let the OS VM manage the cache.
Together with bootstrap.mlockall: true, such a page cache will stay
perfectly in RAM until eviction (mostly the exit of elasticsearch
process, assuming enough RAM), while your "find loop" loaded files will
only be loaded in RAM once before they are evicted, so the find command
would have to be repeated over and over again during the lifetime of the
ES process. And this would boggle down your overall system performance.
If you want another "no need to think" solution you could also use ZFS
with L2ARC (adaptive replacement cache, intro http://dtrace.org/blogs/brendan/2008/07/22/zfs-l2arc/ ) where you don't
have to tinker with your resources like RAM, fs cache, disks, SSD and so
on - ZFS manages it for you.
Best regards,
Jörg
Am 13.02.13 07:18, schrieb Justin:
We had to act fast. It may not be pretty but this is the current
solution we're using to "force" the data into the file system cache.
find /var/lib/elasticsearch/data -type f -exec cat {} ; > /dev/null
We've yet to see any queries taking > 500 ms - it seems to be working
well.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.