Fitting Inverted list in memory

Felipe_Hummel · February 4, 2013, 3:24pm

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.

Our "problem" is that some of our queries are taking < 50 ms and others are
taking around 600ms up to 1 second. When we repeat a "slow" query seconds
after the first one, the query is usually very fast (< 100 ms).

After some tests, it seems that when searching for unused terms the queries
are slower. That is, when the inverted list is on disk queries are slow.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can do
is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · February 4, 2013, 5:09pm

Hi Felipe,

Unless you disable _source and set the store to "memory", but then if a
node goes down, since you have no replicas, you'll lose data. OS on the
other hand, will cache whatever you use for reading and writing. That
includes indices, fields and sources
Since you seem to have quite light indexing and searching, I think the
Warmers API is what you need:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Put in your typical queries there and it should keep your caches warm.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Mon, Feb 4, 2013 at 5:24 PM, Felipe Hummel felipehummel@gmail.comwrote:

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.

Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).

After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · February 4, 2013, 7:48pm

Hey,

On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with 7.5GB
(index is stored in /mnt/ ephemeral storage). My index has ~70GB (with
_sources enabled, ~20GB with them disabled) with +40M docs. We index 100
new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.

Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).

this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
(Elasticsearch Platform — Find real-time answers at scale | Elastic)

After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.

so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing is
forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

yeah, the warmer api should help you but note its a 0.20 feature.

simon

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Felipe_Hummel · February 4, 2013, 9:13pm

Thanks for the tips. One last thing: we only maintain _source to allow
updates. In the search we actually disable _source loading.
Em 04/02/2013 17:48, "simonw" simon.willnauer@elasticsearch.com escreveu:

Hey,

On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with
7.5GB (index is stored in /mnt/ ephemeral storage). My index has ~70GB
(with _sources enabled, ~20GB with them disabled) with +40M docs. We index
100 new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we haven't
run into out of memory or anything alike. The filter cache was decreased to
10%.

Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).

this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
(
Elasticsearch Platform — Find real-time answers at scale | Elastic
)

After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.

so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing
is forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

yeah, the warmer api should help you but note its a 0.20 feature.

simon

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Onilton_Maciel · February 5, 2013, 11:35pm

Does anyone know if this feature still works?

github.com/elastic/elasticsearch

Index FS Store: Allow to cache (in memory) specific files

opened 03:08PM - 22 Mar 10 UTC

closed 03:09PM - 22 Mar 10 UTC

kimchy

>feature v0.06.0

When using file system based storage, caching specific (lucene index) files make… sense to speed up operations. The ability to specify which files to cache in memory (with all the parameters the memory store provides) based on their suffix should be provided. This allows, for example, to store term freq in memory, while keeping the field store on disk (which takes much more space). The configuration should look something like this: ``` index: store: fs: memory: enabled: true extensions: ["", "del", "gen"] ```

Telling Elasticsearch to to cache specific (lucene index) files make sense
to speed up operations.

It seems this could solve this problem somehow.

Any problems with that aproach? Does it really works?

On Mon, Feb 4, 2013 at 5:13 PM, Felipe Hummel felipehummel@gmail.comwrote:

Thanks for the tips. One last thing: we only maintain _source to allow
updates. In the search we actually disable _source loading.
Em 04/02/2013 17:48, "simonw" simon.willnauer@elasticsearch.com
escreveu:

Hey,

On Monday, February 4, 2013 4:24:39 PM UTC+1, Felipe Hummel wrote:

Hey guys, I'm running a 4-node ES cluster. They're EC2 m1.large with
7.5GB (index is stored in /mnt/ ephemeral storage). My index has ~70GB
(with _sources enabled, ~20GB with them disabled) with +40M docs. We index
100 new documents per minute. Some of our configuration (almost all other
settings are default):

"number_of_shards": 8,
"number_of_replicas": 0,
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true"

In each machine, we're currently setting 2.5GB to the JVM, and we
haven't run into out of memory or anything alike. The filter cache was
decreased to 10%.

Our "problem" is that some of our queries are taking < 50 ms and others
are taking around 600ms up to 1 second. When we repeat a "slow" query
seconds after the first one, the query is usually very fast (< 100 ms).

this sounds like a problem of a reopen. When you refresh (reopen) your
index the first query is likely to be slowish. You should try to use the
warmer API to load / warm you new segments before the first search hits it.
(
Elasticsearch Platform — Find real-time answers at scale | Elastic
)

After some tests, it seems that when searching for unused terms the
queries are slower. That is, when the inverted list is on disk queries are
slow.

so this problem always arises, if you keep on merging / indexing in the
background some postings lists might not be "hot" yet. I guess with
reasonable queries you can also warm you postings up pretty reasonably.

My question has two parts:
1 - can I "force" the whole inverted lists to be in memory? Or all I can
do is hope the OS cache to keep them in memory? Without _sources the whole
dataset is 20GB which should fit in the four 7.5GB instances.
2 - Can I do something else to make the reading of inverted lists faster
from disk?

Our system has low search usage. So one of my concerns is that indexing
is forcing parts of the inverted lists out of the OS cache. Any help is
appreciated.

yeah, the warmer api should help you but note its a 0.20 feature.

simon

Thanks

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Elasticsearch behavior when stop indexing the biggest part of your messages Elasticsearch	1	596	January 6, 2020
ES searching and indexing speed reduced after processing 600milion records Elasticsearch	10	951	July 6, 2017
Performance problems when searching in-memory index with 15M documents Elasticsearch	6	1094	July 5, 2017
Slow query_string with lot of memory Elasticsearch	1	558	April 18, 2017
Reducing heap usage dedicated to indices.segments.terms_memory_in_bytes...? Elasticsearch	10	3273	January 18, 2017

Fitting Inverted list in memory

Best regards, Radu

Related topics

Best regards,
Radu