Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):
In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.
Our "problem" is that some of our queries are taking < 50 ms and others are
taking around 600ms up to 1 second. When we repeat a "slow" query seconds
after the first one, the query is usually very fast (< 100 ms).
After some tests, it seems that when searching for unused terms the queries
are slower. That is, when the inverted list is on disk queries are slow.
My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can do
is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?
Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.
Unless you disable _source and set the store to "memory", but then if a
node goes down, since you have no replicas, you'll lose data. OS on the
other hand, will cache whatever you use for reading and writing. That
includes indices, fields and sources
Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):
In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.
Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).
After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.
My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?
Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.
On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:
Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):
In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.
Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).
this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
(Elasticsearch Platform — Find real-time answers at scale | Elastic)
After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.
so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.
My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?
Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.
yeah, the warmer api should help you but note its a 0.20 feature.
Thanks for the tips. One last thing: we only maintain _source to allow
updates. In the search we actually disable _source loading.
Em 04/02/2013 17:48, "simonw" simon.willnauer@elasticsearch.com escreveu:
Hey,
On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:
Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with
7.5GB (index is stored in /mnt/ ephemeral storage). My index has ~70GB
(with _sources enabled, ~20GB with them disabled) with +40M docs. We index
100 new documents per minute. Some of our configuration (almost all other
settings are default):
In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.
Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).
this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
( Elasticsearch Platform — Find real-time answers at scale | Elastic
)
After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.
so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.
My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?
Our system has low search usage. So one of my concerns is that indexing
is forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.
yeah, the warmer api should help you but note its a 0.20 feature.
Thanks for the tips. One last thing: we only maintain _source to allow
updates. In the search we actually disable _source loading.
Em 04/02/2013 17:48, "simonw" simon.willnauer@elasticsearch.com
escreveu:
Hey,
On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:
Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with
7.5GB (index is stored in /mnt/ ephemeral storage). My index has ~70GB
(with _sources enabled, ~20GB with them disabled) with +40M docs. We index
100 new documents per minute. Some of our configuration (almost all other
settings are default):
In each machine, we're currently setting 2.5GB to the JVM, and we
haven't run into out of memory or anything alike. The filter cache was
decreased to 10%.
Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).
this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
( Elasticsearch Platform — Find real-time answers at scale | Elastic
)
After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.
so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.
My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?
Our system has low search usage. So one of my concerns is that indexing
is forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.
yeah, the warmer api should help you but note its a 0.20 feature.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.