Efficient Index Concatenation Without Reindexing

Hello all,

I am currently working on a project which requires indexing substantial
amounts of timeseries event data. Based on the feedback from other
projects doing similar work (i.e. Logstash), I've decided to try rolling
indices, with a single active index being written at any given time. This
has all sorts of advantages w.r.t. rapid cleanup of old content, and the
ability to optimize historical data for querying, as well as separating
indexing and query hardware into separate zones with appropriate hardware.

I would like to aggregate older indices, in order to prevent excessive
shard allocations. Given that the mappings are identical for all indices,
this should be a fairly efficient operation - a higher level segment merge
across multiple Lucene shards. However, I haven't been able to determine
whether ElasticSearch has any built in support for this currently.

Note that Lucene provides a standalone tool for just this purpose:
http://lucene.apache.org/core/3_6_0/api/contrib-misc/org/apache/lucene/misc/IndexMergeTool.html

So, my question is:

  1. Does ES already support this behavior in some fashion? Unfortunately
    it's very difficult to search for answers to this problem, as 'merge' is a
    very overloaded word

If not:

  1. Is there already a roadmap to include similar functionality?

And finally:

  1. If there's nothing currently planned, how viable would it be to get this
    functionality integrated into ES? I am happy to start working on an
    implementation if that is what's needed.

Cheers,
Eric Cornelius

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

In addition to a shard merge tool, I'd love to have a full ES shard
merger/splitter.

Idea:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jorg, have you attempted any operations on the lucene indices outside of
elasticsearch like this? Once the index is closed, it's effectively
ignored by ES, but I'm wondering if there are any negative implications on
reopening if the state has been changed externally. I remember looking for
information on importing lucene indices directly into ES and coming up
empty, but I don't yet know enough about ES internals to understand why.
If that's actually feasible, it certainly makes this all a lot simpler.

I'm only just beginning to familiarize myself with parts of the Lucene api,
but I the addIndexes method certainly sounds like what's needed:
http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html

On Saturday, November 2, 2013 6:16:12 PM UTC-4, Jörg Prante wrote:

In addition to a shard merge tool, I'd love to have a full ES shard
merger/splitter.

Idea:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

All I did was writing a discovery tool that can walk through Lucene
structures inside ES just for educating myself:

Since each shard is a Lucene index, the main processing could be done
fairly well with standard Lucene procedures and tools.

There are some add-ons in ES to ensure that a Lucene index is recognized as
a valid and operiational ES shard: the uid field, and the cluster state
info, most notable the mappings, and the murmur djb hash that distributes
docs across shards. So a post-processing index tool should span several
indexes, which is challenging to a standalone tool to access, because the
shards reside on different nodes. Maybe a plugin approach is preferable.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.