Moving big indices around with two or more instances running in the same box, working separately

Hi there,

I would like an opinion about this scenario:

  • 4 ES instances, one node each, running in the same box but using different cluster names.
    So they won't be working together. But they all will store same type of data.
  • A proxy running on top of these instances to route my indices to the right node.
  • ZSF snapshot as backup

I am not complete aware of the ES capabilities and the main reason here would be easily move an entire index to other new servers, as it grows.

How is it sound for you?

Thanks

Why not just cluster the new node to the old one, then use shard filtering to shift the data across.

I did not know about this, but one question.
With this approach, I would not have to declare my indices in the file?
Because my indices are generated on the fly, so do not know their names before their creation.

In what file?

Sorry, I mean, not necessarily in the conf file.
I would have to send a PUT for each new index I have created on the fly.

for instance, new index 1000
PUT 1000/_settings
{
"index.routing.allocation.include._name": "my-node-A"
}
...

Then, sending my bulk request and repeat this process for each new index passing the correct node name.

Currently I have something around 6K indices and it will get much bigger.

I don't understand why you'd want to move stuff around, nor why you have this stand alone "cluster" setup.

I must admit that I also do not understand exactly what you are trying to achieve. The fact that you say that you have 6000 indices in a cluster that size and expect that to grow is however a concern. Each shard in Elasticsearch is a Lucene index and has some resource overhead (memory, file handles, CPU) associated with it. Having that many indices and shards will unnecessarily use up a lot of system resources and is likely to not scale well. Why are you having so many indices?

Here is my case:

  • One client can archive many websites.
  • Each archived website has a unique id.
  • For each time the same client archives the same website, I have a different snapshot id.
  • The document content is the website html/css/js/etc
  • The documents are stored following this path:
    /website_id/snapshot_id/document_id
  • Each new index is dedicated to the snapshots from a particular website.
  • Each website have its own index.

That's why so many indices.

Now, about the stand alone instances. The idea was proposed in order to facilitate the data migration to another server as it grows. For instance, move one entire index data to a another dedicated server. As the data was stored in only one node we could easily move the entire index.

I am trying to understand if this would be viable. That is why I would like to hear from you guys and know if there is other ways achieve that. :slightly_smiling:

Thanks a lot

Why not have a cluster, then when you need, add larger nodes in and use filtering to move data off the smaller nodes.

Having a separate index per website wastes a lot of resources and will not scale. If the structure of the documents you are indexing for the different website are similar and/or you can control the mappings, I would recommend storing multiple, if not all, websites in a single index. You can then either add filters at the application layer or use filtered aliases when accessing the data. If you always query per website, you can also use routing in order to ensure all documents belonging to a single website reside in a single shard, which can improve query latency and throughput.