Storing unlimited amount of data on limited resourses

rodexion · April 6, 2013, 2:42am

I am trying to roll out a system for storing logs (similar to Logstash and
Graylog), where the amount of indexed log data has no upper bounds. In my
experiments so far, I have observed that with time, as I index more and
more log data (no searching yet), the total amount of open file
descriptors, and memory usage gradually grows, until it hits the limits and
some error occurs.

I have been experimenting on a fixed number of nodes (3 nodes, 8GB for
elasticseach each) with index rotation, different shard numbers, segment
sizes and merging schemes. I saw impact on resource usage, but the general
tendency of constant growing of the number of FDs and memory stayed the
same.

Is there any way to make elasticsearch release excess FDs and memory, in a
similar fashion to an LRU cache, even if it comes at the expense of poorer
performance?

Both Logstash and Graylog simply suggest that you estimate the required
resources for the given amount of data and delete any excess (by deleting
old date-rotated indices). I would like to avoid removing this old data,
but don't mind if the data is not cached and always loaded from disk
on-demand so that it does not hold on to any resources. I don't even mind
if the whole system becomes 10 times slower, as long as it doesn't throw an
OOM or "Too many open files".

Thanks in advance!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dan_Fairs · April 6, 2013, 9:32am

Both Logstash and Graylog simply suggest that you estimate the required resources for the given amount of data and delete any excess (by deleting old date-rotated indices). I would like to avoid removing this old data, but don't mind if the data is not cached and always loaded from disk on-demand so that it does not hold on to any resources. I don't even mind if the whole system becomes 10 times slower, as long as it doesn't throw an OOM or "Too many open files".

Fitting a quart in a pint pot, eh?

I suspect you should probably look into closing old (rarely-used) indices. This means that they'll continue to take up disk space, but won't consume file handles etc.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Note that you can't read or write to a closed index, you have to open it again - your application will have to manage that side of things, opening an index before querying it. Clint warns that this process can take a few seconds to a couple of minutes, so you'll need to manage users' expectations. But - it should probably help you avoid OOMs or other badness.

Cheers,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

rodexion · April 6, 2013, 12:56pm

Thanks for the suggestions.

Fitting a quart in a pint pot, eh?

I was hoping it was a gallon pot!

I suspect you should probably look into closing old (rarely-used) indices.

This means that they'll continue to take up disk space, but won't consume
file handles etc.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Note that you can't read or write to a closed index, you have to open it
again ...

I have considered it, and I will definitely work for the indexing.

As for searching, it sounds a bit more tricky, especially in a multi-user
environment ... but, probably doable. Opening old index for searching may
potentially load most of the data into memory, depending on the search
query, so I'd only be able to open one (or limited) number of old indices
at a time, having to make other search clients wait. Well, I guess, that is
the price I have to pay sigh.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Norberto_Meijome · April 7, 2013, 3:04pm

do your old-data-queries need to be realtime?
you can always redirect those old indices to separate servers, which take
their time loading the needed data, and then lazy load it / push to your
frontend (and then close these indices again if not used after a TTL, or if
other indices require loading... ie, separate your unbound long tail from
your current data search.

yes, you will need more resources than your 3 machines (maybe). but you
shouldnt need as much as keeping all in memory.

On Sat, Apr 6, 2013 at 11:56 PM, Rodion Moiseev rodion.moiseev@gmail.comwrote:

Thanks for the suggestions.

Fitting a quart in a pint pot, eh?

I was hoping it was a gallon pot!

I suspect you should probably look into closing old (rarely-used) indices.

This means that they'll continue to take up disk space, but won't consume
file handles etc.

http://www.elasticsearch.org/guide/reference/api/admin-
indices-open-close/http://www.elasticsearch.org/guide/reference/api/admin-indices-open-close/

Note that you can't read or write to a closed index, you have to open it
again ...

I have considered it, and I will definitely work for the indexing.

As for searching, it sounds a bit more tricky, especially in a multi-user
environment ... but, probably doable. Opening old index for searching may
potentially load most of the data into memory, depending on the search
query, so I'd only be able to open one (or limited) number of old indices
at a time, having to make other search clients wait. Well, I guess, that is
the price I have to pay sigh.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

rodexion · April 8, 2013, 12:21am

Hi,

do your old-data-queries need to be realtime?

you can always redirect those old indices to separate servers, which take

their time loading the needed data, and then lazy load it / push to your
frontend (and then close these indices again if not used after a TTL, or if
other indices require loading... ie, separate your unbound long tail from
your current data search.

No, realtimeness is not a requirement. I am not sure I understand how I
would set up those separate servers, and what do you mean by "redirect"? Do
you mean moving data for closed indices to a backup storage of some sort?
Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Controlling Stored Data Size Elasticsearch	2	354	July 6, 2017
Graylog vs elasticsearch capacity planning Elasticsearch	4	2083	June 14, 2017
Analyzing logs and document limit per shard Elasticsearch	11	1285	February 21, 2017
How to limit amount of incoming data Elasticsearch	3	745	January 12, 2017
Experiences in "how to manage much data" needed Elasticsearch	8	549	August 10, 2018

Storing unlimited amount of data on limited resourses

Cheers, Dan

Related topics

Cheers,
Dan