I am currently working on a project which requires indexing substantial
amounts of timeseries event data. Based on the feedback from other
projects doing similar work (i.e. Logstash), I've decided to try rolling
indices, with a single active index being written at any given time. This
has all sorts of advantages w.r.t. rapid cleanup of old content, and the
ability to optimize historical data for querying, as well as separating
indexing and query hardware into separate zones with appropriate hardware.
I would like to aggregate older indices, in order to prevent excessive
shard allocations. Given that the mappings are identical for all indices,
this should be a fairly efficient operation - a higher level segment merge
across multiple Lucene shards. However, I haven't been able to determine
whether ElasticSearch has any built in support for this currently.
Does ES already support this behavior in some fashion? Unfortunately
it's very difficult to search for answers to this problem, as 'merge' is a
very overloaded word
If not:
Is there already a roadmap to include similar functionality?
And finally:
If there's nothing currently planned, how viable would it be to get this
functionality integrated into ES? I am happy to start working on an
implementation if that is what's needed.
Jorg, have you attempted any operations on the lucene indices outside of
elasticsearch like this? Once the index is closed, it's effectively
ignored by ES, but I'm wondering if there are any negative implications on
reopening if the state has been changed externally. I remember looking for
information on importing lucene indices directly into ES and coming up
empty, but I don't yet know enough about ES internals to understand why.
If that's actually feasible, it certainly makes this all a lot simpler.
All I did was writing a discovery tool that can walk through Lucene
structures inside ES just for educating myself:
Since each shard is a Lucene index, the main processing could be done
fairly well with standard Lucene procedures and tools.
There are some add-ons in ES to ensure that a Lucene index is recognized as
a valid and operiational ES shard: the uid field, and the cluster state
info, most notable the mappings, and the murmur djb hash that distributes
docs across shards. So a post-processing index tool should span several
indexes, which is challenging to a standalone tool to access, because the
shards reside on different nodes. Maybe a plugin approach is preferable.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.