I sent the following slack to a colleague today. He does elasticsearch consulting and usually keeps pretty current on all things elasticsearch. Then I thought maybe other people would have ideas too. Here it is:
Man, every possible solution for getting bulk data into elasticsearch seems kludgy or has some drawbacks. I've thought of 6 different ways of doing it, but none of them really seems good. Or I haven't found the right documentation for a solution.
-
Option 1: Fire up a standalone elasticsearch instance on each pipeline node and build an index in it containing all of the documents from that node, but use the number of shards that the final index in production will use and specify routing that final index will use. Snapshot each standalone index. Load into production index from all of the snapshots doing a lucene index merge per shard since each standalone index will have the same number of shards and routing. Rebuild cluster metadata.
-
Problems: Elasticsearch simply doesn't have this feature I think, but it should. Unfortunately I don't have enough knowledge of the inner workings of elasticsearch to implement this underneath with lucene merge index like solr.
-
Option 2: Fire up an elasticsearch cluster on the pipeline nodes with each node running an elasticsearch member. Set custom shard placement so that shard 0 goes to node 0, shard 1 goes to node 1 etc. with the same number of shards as pipeline nodes. Set custom routing to match pipeline partitioning so that all documents on node 0 go to shard 0 etc. Snapshot the index and load it as normal.
-
Problems: Elasticsearch doesn't seem to support this level of detail for routing. You can specify which parameter to route on but not how exactly that parameter matches to a shard. It also doesn't seem to support this level of shard placement config. You can specify which nodes shards for a given index will be placed on but you can't specify exactly which shard (with which routing) must be placed on which node. This solution forces you into a certain number of shards that might not be optimal for the production cluster.
-
Option 3: Fire up an indexing machine for each pipeline node to join the production elasticsearch cluster while indexing is happening. Somehow use placement or something to send all shards for indexing work to the new indexing machines, and to stop the new machines from taking any query traffic. Increase replicas after indexing and use placement to make all the replicas go to the query nodes. Shut down the indexing nodes and hope one of the replicas on the query nodes becomes the primary appropriately.
-
Problems: We need to pay for a bunch of extra nodes and we have to send everything over network instead of all indexing going to the local machine. Not sure how much this type of thing affects the production cluster. It seems risky poking at the prod cluster like this and trying to get all the placement configuration and stuff just right. It's not clear exactly how much load it will place on the rest of the cluster when all the rebalancing happens and how much control we have over that.
-
Option 4: Fire up a complete separate elasticsearch cluster similar to option 3, but don't join the production cluster. Build the index on this cluster and snapshot it. load the snapshot to prod as normal.
-
Problems: We need to pay for a whole elasticsearch cluster, but probably still a better solution than 3 since it doesn't have the rest of the drawbacks. Still seems pretty clunky and have to run way more machines and use way more network copy than an optimal solution would require.
-
Option 5: Switch to using REST client only. Put elasticsearch behind ELB. Run 2 elasticsearch clusters for batch indices. One cluster for indexing. One cluster for querying. Double buffer the clusters. Indexing to back cluster while front cluster takes traffic. After indexing flip the load balancer, and the new back cluster is ready for indexing. (edited)
-
Problems: We need to switch to use REST client only. We need to run 2 clusters. Otherwise this solution is fairly safe and convenient, but we still have the issue of the indexinig solution being subotimal because of the extra network.