Is it problematic to create a large number of indices?
Because the memory drain on my machine is growing with the number of documents I've placed in ES (I've logged some OutOfMemory errors lately), I am working on a purging strategy that is based on expiration dates. To create maximum flexibility in my purging schedule, I am creating a separate index for each calendar date and document source. Indices are named like "2011-11-08-twitter"; if twitter is scheduled for a weekly purge, then I would delete this index on Nov. 15.
I can see this means I'll have about 175 indices at any given time (25 sources * 7 days). And the number will grow as I add more sources. I considered subdividing the indices into separate types according to source name, but then I'd be forced to purge using the DELETE BY query, which others have advised against. I want to simply delete an entire index.
On Tue, 2011-11-08 at 19:06 -0800, searchersteve wrote:
Is it problematic to create a large number of indices?
Because the memory drain on my machine is growing with the number of
documents I've placed in ES (I've logged some OutOfMemory errors lately), I
am working on a purging strategy that is based on expiration dates. To
create maximum flexibility in my purging schedule, I am creating a separate
index for each calendar date and document source. Indices are named like
"2011-11-08-twitter"; if twitter is scheduled for a weekly purge, then I
would delete this index on Nov. 15.
You care correct: purging by deleting old indices is much more efficient
than using delete-by-query, which needs to (1) mark all old docs as
deleted (2) expunge the old deleted docs by optimizing to new segments.
Each index consists of 1 or more primary shards, and each shard has zero
or more replica shards.
Each shard is a Lucene instance, and thus there is a memory cost for
each shard that you have.
So two things I'd suggest:
Only have one primary shard for each index, and probably just 1
replica shard.
Close indices that you don't actually need access to.
I feel like I've been using ES for a long time now, but I'm still learning every day. What's the difference between deleting and index and closing it? Does closing mean I can reduce memory consumption for now but maintain the option of re-opening the index later if needed, whereas deleting means losing that option?
Is it problematic to create a large number of indices?
Because the memory drain on my machine is growing with the number of
documents I've placed in ES (I've logged some OutOfMemory errors lately), I
am working on a purging strategy that is based on expiration dates. To
create maximum flexibility in my purging schedule, I am creating a separate
index for each calendar date and document source. Indices are named like
"2011-11-08-twitter"; if twitter is scheduled for a weekly purge, then I
would delete this index on Nov. 15.
I can see this means I'll have about 175 indices at any given time (25
sources * 7 days). And the number will grow as I add more sources. I
considered subdividing the indices into separate types according to source
name, but then I'd be forced to purge using the DELETE BY query, which
others have advised against. I want to simply delete an entire index.
I feel like I've been using ES for a long time now, but I'm still learning
every day.
You're not alone
What's the difference between deleting and index and closing it?
Does closing mean I can reduce memory consumption for now but maintain the
option of re-opening the index later if needed, whereas deleting means
losing that option?
Closing an index means that ES still knows about the index (which uses
up a tiny bit of memory) but it doesn't open it, so it is essentially
free.
You can just reopen the index when you need to access it.
Splitting the index is a means to an end. It makes it easier for me to delete old data, keeping the total volume of data limited and reducing the strain on memory. That's the thought, anyway.
less then 10k indices should be ok (=> but massive open files!) a lot
more should be avoided. why splitting them by source?
Because the memory drain on my machine is growing with the number of documents I've placed in ES
but this is not solved via splitting the index or what do you mean
here?
Peter.
On Nov 9, 4:06 am, searchersteve <steve...@> wrote:
Is it problematic to create a large number of indices?
Because the memory drain on my machine is growing with the number of
documents I've placed in ES (I've logged some OutOfMemory errors lately), I
am working on a purging strategy that is based on expiration dates. To
create maximum flexibility in my purging schedule, I am creating a separate
index for each calendar date and document source. Indices are named like
"2011-11-08-twitter"; if twitter is scheduled for a weekly purge, then I
would delete this index on Nov. 15.
I can see this means I'll have about 175 indices at any given time (25
sources * 7 days). And the number will grow as I add more sources. I
considered subdividing the indices into separate types according to source
name, but then I'd be forced to purge using the DELETE BY query, which
others have advised against. I want to simply delete an entire index.
Now I'm reading about the _TTL (time to live) field and how ES automatically purges old docs. That seems like an even more straightforward approach than creating a separate index for each day's worth of data. I could put all my docs in one index that way.
Is THIS the optimal purge solution? I know this treads on ground previously discussed on the list, but there seem to be varying themes over time.
TTL still means that those docs are deleted from the index (similar to
delete by query), which means they are marked as deleted, and then will
eventually need to be merged out of it. It might be ok for your case (the
merging part), but, deleting a whole index is much more lightweight.
On Sat, Nov 12, 2011 at 3:57 AM, searchersteve stevesuo@gmail.com wrote:
And one more last, last question........
Now I'm reading about the _TTL (time to live) field and how ES
automatically
purges old docs. That seems like an even more straightforward approach than
creating a separate index for each day's worth of data. I could put all my
docs in one index that way.
Is THIS the optimal purge solution? I know this treads on ground previously
discussed on the list, but there seem to be varying themes over time.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.