Spark/Hadoop batch build shards/Lucene indices

Is it possible to use Spark/Hadoop to write a Lucene index direct to disk that could be "imported" into ES as a shard instead of relying on ES to perform the indexing itself?

In production environments, an ES cluster is often not over provisioned enough to both handle a heavy streaming indexing load in addition to a batch indexing load. This means having Spark perform index-heavy operations in batch can quickly bring a cluster down.

Instead, it'd be awesome to be able to have Spark/Hadoop jobs write the segments of a Lucene index to a local filesystem like HDFS, then simply bulk import the indices to ES to act as shards afterwards.

A Lucene segment cannot just be imported into Elasticsearch since it's not just the index but also the distribution of data that Elasticsearch handles. Supposing it can happen, by processing files locally you wouldn't do it in a distributed fashion so you'll end up with unbalanced segments - some will hold more data than others since the data is processed in isolation.

When you import it into the cluster, one would have to reindex the data again since the segments would have to be distributed across the cluster which essentially means redoing the work.

Processing files locally doesn't necessarily mean you'll get unbalanced segments, it'd depend on how the data is partitioned in Spark/Hadoop when performing the local writes.

As for reindexing data to distribute the segments, my hope would be to actually use something like snapshot and restore to accomplish this.

It's worth noting that if users do use Elasticsearch and also use doc routing during indexing, the same concerns about shards/segments not being distributed equally exists.

I'm sure there are a few more gotchas I haven't thought of, but on the surface I can't see a huge impediment to why this wouldn't be possible and even recommended for intense indexing loads.

It's not possible and undesirable for several reasons:

  1. one of the main features of Elastic is that it splits an index across multiple shards. This is not an after thought but rather a core feature - Elastic is distributed from day 0. And that applies to the index itself.
    Note that a Lucene index can work stand-along, it is self-contained hence why having one index per host means having different indices that work in isolation.
    ES goes beyond this by applying the sharding during indexing - in takes into account things like topology, routing, replication, etc... and based on that decides what doc needs to sit where. The end result is not multiple indices across nodes but rather one index split across multiple nodes.
    In other words, if you'd like to do this, you'll basically have to replicate a lot of Elasticsearch features outside Elastic itself.
  2. As an alternative you can create the indices in isolation however to "put them together" you'll end up reindexing all your data which mean twice the work.
    Mark's point is actually quite valid; you are right that with routing you can have unbalanced shards but the point is that is something you want and can control; with separate indexing that is not the case as you are prone to the particularities of your dataset.

As a simple solution why not create an Elasticsearch staging cluster? To avoid pushing the data to your production cluster, simply setup a staging cluster (depending on whatever hardware you have), push the data there, validate it (a good idea before going into production), snapshot it and then restore it into your production environment.
This guarantees the data moves safely in stages from staging to production, allows you to isolate 'defects' outside production itself and track down anormal behavior.
It is also fully supported and gives you the flexibility of manipulating the topology (you can modify the production cluster without having to worry about the staging one and vice-versa).

P.S. In case you were unaware, for Spark/Hadoop questions there's a dedicated forum.


Good point, let me move it :slight_smile:

Fair enough, but then the better question is, what is the right solution here for a cluster where it's already dealing with a heavy indexing workload? Spin up a new cluster for batch updates, then do a backup and restore of whatever indices that were created during the batch op?

Ideally you'd add more nodes to cope with both loads.

Otherwise your suggestion, which Costin also mentioned, is a good one.

As you pointed out, you want to isolate the load from the production cluster. this means creating a buffer/staging environment that takes care of adding the new content and then loading it up with minimal effort into the production stage.
As I've mentioned, having a separate staging ES instance can solve this nicely.