Efficient Index Concatenation Without Reindexing

EricMCornelius · November 2, 2013, 8:14pm

Hello all,

I am currently working on a project which requires indexing substantial
amounts of timeseries event data. Based on the feedback from other
projects doing similar work (i.e. Logstash), I've decided to try rolling
indices, with a single active index being written at any given time. This
has all sorts of advantages w.r.t. rapid cleanup of old content, and the
ability to optimize historical data for querying, as well as separating
indexing and query hardware into separate zones with appropriate hardware.

I would like to aggregate older indices, in order to prevent excessive
shard allocations. Given that the mappings are identical for all indices,
this should be a fairly efficient operation - a higher level segment merge
across multiple Lucene shards. However, I haven't been able to determine
whether ElasticSearch has any built in support for this currently.

Note that Lucene provides a standalone tool for just this purpose:
http://lucene.apache.org/core/3_6_0/api/contrib-misc/org/apache/lucene/misc/IndexMergeTool.html

So, my question is:

Does ES already support this behavior in some fashion? Unfortunately
it's very difficult to search for answers to this problem, as 'merge' is a
very overloaded word

If not:

Is there already a roadmap to include similar functionality?

And finally:

If there's nothing currently planned, how viable would it be to get this
functionality integrated into ES? I am happy to start working on an
implementation if that is what's needed.

Cheers,
Eric Cornelius

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · November 2, 2013, 10:16pm

In addition to a shard merge tool, I'd love to have a full ES shard
merger/splitter.

Idea:

close index
start stand alone tool: iterate through docs, rehash (a Lucene 3.6
approach can be found in
https://github.com/healthonnet/hash-based-index-splitter, a Lucene 4.4
splitter is
http://lucene.apache.org/core/4_4_0/misc/org/apache/lucene/index/PKIndexSplitter.html
)
copy each doc from old ES index to a new ES index with old shard number
plus/minus 1
remove old ES index
reopen index

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

EricMCornelius · November 3, 2013, 5:38am

Jorg, have you attempted any operations on the lucene indices outside of
elasticsearch like this? Once the index is closed, it's effectively
ignored by ES, but I'm wondering if there are any negative implications on
reopening if the state has been changed externally. I remember looking for
information on importing lucene indices directly into ES and coming up
empty, but I don't yet know enough about ES internals to understand why.
If that's actually feasible, it certainly makes this all a lot simpler.

I'm only just beginning to familiarize myself with parts of the Lucene api,
but I the addIndexes method certainly sounds like what's needed:
http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html

On Saturday, November 2, 2013 6:16:12 PM UTC-4, Jörg Prante wrote:

In addition to a shard merge tool, I'd love to have a full ES shard
merger/splitter.

Idea:

close index

start stand alone tool: iterate through docs, rehash (a Lucene 3.6
approach can be found in
GitHub - healthonnet/hash-based-index-splitter: Command-line utility to split a Lucene index into multiple shards using the document's ID hash., a Lucene 4.4
splitter is
PKIndexSplitter (Lucene 4.4.0 API)
)

copy each doc from old ES index to a new ES index with old shard number
plus/minus 1

remove old ES index

reopen index

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · November 3, 2013, 8:22am

All I did was writing a discovery tool that can walk through Lucene
structures inside ES just for educating myself:

Since each shard is a Lucene index, the main processing could be done
fairly well with standard Lucene procedures and tools.

There are some add-ons in ES to ensure that a Lucene index is recognized as
a valid and operiational ES shard: the uid field, and the cluster state
info, most notable the mappings, and the murmur djb hash that distributes
docs across shards. So a post-processing index tool should span several
indexes, which is challenging to a standalone tool to access, because the
shards reside on different nodes. Maybe a plugin approach is preferable.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Can I use Lucene merge tool to merge Elasticsearch index Elasticsearch	2	686	July 5, 2017
In elasticsearch 5.4.2 can merge indexs? Elasticsearch	4	549	September 28, 2017
How to merge indices? Elasticsearch	3	405	July 6, 2017
Migrate lucene index into elasticsearch Elasticsearch	10	1794	July 6, 2017
Merge indexes in elasticseach from reindex api Elasticsearch	1	401	November 5, 2019

Efficient Index Concatenation Without Reindexing

Related topics