Fitting Inverted list in memory

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.

Our "problem" is that some of our queries are taking < 50 ms and others are
taking around 600ms up to 1 second. When we repeat a "slow" query seconds
after the first one, the query is usually very fast (< 100 ms).

After some tests, it seems that when searching for unused terms the queries
are slower. That is, when the inverted list is on disk queries are slow.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can do
is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Felipe,

  1. Unless you disable _source and set the store to "memory", but then if a
    node goes down, since you have no replicas, you'll lose data. OS on the
    other hand, will cache whatever you use for reading and writing. That
    includes indices, fields and sources

  2. Since you seem to have quite light indexing and searching, I think the
    Warmers API is what you need:
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Put in your typical queries there and it should keep your caches warm.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Mon, Feb 4, 2013 at 5:24 PM, Felipe Hummel felipehummel@gmail.comwrote:

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.

Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).

After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.

Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).

this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
(Elasticsearch Platform — Find real-time answers at scale | Elastic)

After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.

so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

yeah, the warmer api should help you but note its a 0.20 feature.

simon

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the tips. One last thing: we only maintain _source to allow
updates. In the search we actually disable _source loading.
Em 04/02/2013 17:48, "simonw" simon.willnauer@elasticsearch.com escreveu:

Hey,

On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with
7.5GB (index is stored in /mnt/ ephemeral storage). My index has ~70GB
(with _sources enabled, ~20GB with them disabled) with +40M docs. We index
100 new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.

Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).

this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
(
Elasticsearch Platform — Find real-time answers at scale | Elastic
)

After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.

so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing
is forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

yeah, the warmer api should help you but note its a 0.20 feature.

simon

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Does anyone know if this feature still works?

Telling Elasticsearch to to cache specific (lucene index) files make sense
to speed up operations.

It seems this could solve this problem somehow.

Any problems with that aproach? Does it really works?

On Mon, Feb 4, 2013 at 5:13 PM, Felipe Hummel felipehummel@gmail.comwrote:

Thanks for the tips. One last thing: we only maintain _source to allow
updates. In the search we actually disable _source loading.
Em 04/02/2013 17:48, "simonw" simon.willnauer@elasticsearch.com
escreveu:

Hey,

On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with
7.5GB (index is stored in /mnt/ ephemeral storage). My index has ~70GB
(with _sources enabled, ~20GB with them disabled) with +40M docs. We index
100 new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we
haven't run into out of memory or anything alike. The filter cache was
decreased to 10%.

Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).

this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
(
Elasticsearch Platform — Find real-time answers at scale | Elastic
)

After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.

so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing
is forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

yeah, the warmer api should help you but note its a 0.20 feature.

simon

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.