Dynamically loading elasticsearch index

I have this specific requirement to load different elasticsearch index data for different users without restarting the elasticsearch, I can do this using open and close index.

  1. Create elasticsearch index
    loop:
  2. Index some data to this index (user A data)
  3. Close the index
  4. Copy the \data\nodes\0\indices<index_uuid>\0\index and \data\nodes\0\indices<index_uuid>\0\translog to other folder
  5. Open the index
    Repeat loop for other user data

Now If I want to load the user A data again,

  1. Close the index
  2. Copy the \data\nodes\0\indices<index_uuid>\0\index and \data\nodes\0\indices<index_uuid>\0\translog folders for user A to the index folder
  3. Open the index

So now user A data is ready for indexing or search

Just to provide more info

  1. I am using single node elasticsearch
  2. Index has single shard

Do you find any obvious problems with the above approach ?
Will it cause any data corruptions, if I continue this for long ? Any better solutions for dynamically loading elasticsearch indexes

Seems like a very odd approach. What is the point of this? What is the problem you are trying to solve? What is the high level requirement?

We are indexing huge amount (few terabytes) of data into elasticsearch, which belongs to few hundred thousand users. And user always search only in his data (few mega bytes). So keeping full data loaded in elasticsearch will slow down the search for all users.

If we have index per user, that will make ~100k es indexes, which will slow down the es even more. So what we do is, load the user data to index and open index only during the indexing operation or search operation, after that we close the index. The same index can be used for another user indexing and search by loading the user data. So basically we keep the index data for the user separately only loaded to elasticsearch, when required.

Delay in first search request for loading user data and opening data is acceptable, after that search would be faster. I know this is weird usage of elasticsearch.
Is there any problems with this approach or any other suggestions for my problem ?

How much are you indexing per day? What type of data? How much data in total? Which version are you on? What is the size and specification of your cluster?

Elasticsearch 6.0
4 CPUS
8 GB Ram
ES Heap size - 4GB
Single node elastic search cluster

Each elastic search document will hold 15 - 20 keyword fields.
Every user will index may be few thousand documents everyday.
Over the time, all user data is expected to grow up to few terabytes.

As we can live with, little slow search speed for the first request of user, I am exploring the option of dynamic loading elastic search indexes. With this approach, I can provide search with limited amount of HW specs.

Having individual indices per customer seems excessive. An alternative approach would be to have all users share indices. This could be a single index or a series of time based ones. Each document would have a user ID and you would filter by this when you search. As you have a large number of users that only search within their own indices, you could benefit from applying routing at index and search time. This ensures that all documents end up in the same shard and allows only one shard per index to be queried at search time. If you e.g. have a single index with 100 primary shards, each customer query would only hit one of these, which is quite efficient.

This would eliminate the complex operational procedure and avoid a very slow first search.

Assume we have 10TB index data, and 100 primary shards ==> each shard still has 100GB data.
So I will not be able to handle this amount of data and provide search with HW specs I mentioned.

The approach I described is bit weird and not conventional way of using elastic search, but still it can provide the functionality I need with limited HW specs. So do you see any problems in that ?
or will it break in any scenario ?

With 10TB of data you would probably need more shards, which is why I suggested the use of time-based indices.

There are a couple of things I can think of that may cause problems:

  • Having lots of indices may result in a very large cluster state even if they are closed. Frequent changes, e.g. opening and closing indices, could cause performance issues.
  • Closed indices are not managed by Elasticsearch, so if you lose a node, Elasticsearch will not replicate the closed shards that were lost. If you lose nodes you could therefore lose data.

Basically we are taking out the Lucene index part of data, and make plug and play with live elastic search index. In this way we can load elastic search index with different user data with restarting elastic search.

Elastic search scans the indices folder on startup and load all the indices in it. is it possible to load index without restart ? Ex: I will copy the new index data folder to elastic search data indices folder, and elasticsearch can load this index without restart.

Use the APIs. Messing with the data at the file system level is not supported and bound to cause a lot of problems.

Is there any API for rescanning indices in elastic search ?

There is an API for opening and closing indices. I do not understand why you need to move them.

I am sorry Christian, I know my questions are weird, and I really appreciate for your responses. May be this is unusual for you,but we have more reasons than I explained to you, to have index per user. If we have huge number of indices,(100k) but only few indices (100) open at a time, does the closed indices also put load on elastic search ?

Closed indices do not take up system resources, but I believe they do have an impact on the size of the cluster state. It is quite easy to test though. Create an index template and then create a large number of indices that you close after creation. You should then see what impact they do or do not have. Open an index periodically to see how long this takes.

I have tested this approach, after few thousand indices, Stated experiencing having long open index times.
That's what made me look for other weird approaches ((

That is what I suspected.

I still think the shared indices approach is better and will lead to less problems. If this means you need a bit more hardware (not necessarily a given) you will probably save that on less operational issues.

I did POC for the approach i described in the main post, with decent amount of data(100s of GB) for quite some time, and I did not come across any issues. So I wanted to check if any obvious issues I am missing or any specific issues I can encounter in future.

It sounds like you are in unchartered waters, so I do not really have any additional feedback.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.